InstructGPT: How RLHF Turned a Language Model Into an Assistant
The 2022 OpenAI paper that introduced RLHF to mainstream AI. How SFT + reward model + PPO produced a model that followed instructions — and why it changed everything.
GPT-3 was remarkable and unreliable in equal measure. It could write poetry and answer questions — but it could also confidently give dangerous instructions and complete prompts in ways that had nothing to do with what you wanted. It was a powerful next-token predictor that didn't understand 'helpful'.
In January 2022, OpenAI published 'Training Language Models to Follow Instructions with Human Feedback' — the InstructGPT paper. It introduced Reinforcement Learning from Human Feedback (RLHF) and turned an unreliable text predictor into an assistant that followed instructions. This is why ChatGPT is ChatGPT, and why every instruction-following model — Claude, Gemini, LLaMA-Instruct — works the way it does.
The three-stage training pipeline
- Stage 1 — Supervised Fine-Tuning (SFT): A human writes the ideal response to a prompt. Fine-tune the base model on these demonstrations.
- Stage 2 — Reward Model Training: Show a human two model outputs, have them rank which is better. Train a separate model to predict these preferences.
- Stage 3 — PPO Fine-Tuning: Use the reward model as a training signal. Fine-tune the SFT model with PPO to maximise reward scores, constrained by a KL penalty not to drift too far from the SFT model.
The key insight: you can't write down a reward function for 'be helpful' — but you can have humans compare two outputs and say which is better. RLHF converts pairwise human preferences into a training signal a model can be optimised against.
The striking result
A 1.3B parameter model trained with RLHF was preferred by human evaluators over the 175B GPT-3 base model on 85% of prompts. A 100× smaller model with alignment training beat the massive base model on the measure that actually mattered.
Reward hacking and the KL penalty
Without the KL penalty, the model learns to exploit the reward model — producing responses that score high but are incoherent. RLHF models optimise for the reward model, not human preferences directly. This is partly why RLHF models can be sycophantic — they learned to say what evaluators want to hear.
What RLHF changed in practice
- Instruction following: models reliably do what they're asked instead of completing prompts randomly
- Refusal behaviour: models learned to decline harmful requests
- Sycophancy: models learned to agree with users even when wrong — a reward hacking artifact
- Verbosity: RLHF models are more verbose — human evaluators preferred longer responses
RLHF variants: Constitutional AI and DPO
- Constitutional AI (Anthropic, 2022): Replace human labels with AI self-critique guided by written principles — RLAIF.
- DPO (Stanford, 2023): Reformulate RLHF as a classification problem. No separate reward model. No PPO. One fine-tuning pass.
- Process reward models: Reward each reasoning step instead of the final output. Better for math and code.
Compare alignment approaches across models →: See how Claude, GPT-4, and open models handle the same prompts — alignment differences in action.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →