GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

InstructGPT: How RLHF Turned a Language Model Into an Assistant

The 2022 OpenAI paper that introduced RLHF to mainstream AI. How SFT + reward model + PPO produced a model that followed instructions — and why it changed everything.

GPT-3 was remarkable and unreliable in equal measure. It could write poetry and answer questions — but it could also confidently give dangerous instructions and complete prompts in ways that had nothing to do with what you wanted. It was a powerful next-token predictor that didn't understand 'helpful'.

In January 2022, OpenAI published 'Training Language Models to Follow Instructions with Human Feedback' — the InstructGPT paper. It introduced Reinforcement Learning from Human Feedback (RLHF) and turned an unreliable text predictor into an assistant that followed instructions. This is why ChatGPT is ChatGPT, and why every instruction-following model — Claude, Gemini, LLaMA-Instruct — works the way it does.

The three-stage training pipeline

The key insight: you can't write down a reward function for 'be helpful' — but you can have humans compare two outputs and say which is better. RLHF converts pairwise human preferences into a training signal a model can be optimised against.

The striking result

A 1.3B parameter model trained with RLHF was preferred by human evaluators over the 175B GPT-3 base model on 85% of prompts. A 100× smaller model with alignment training beat the massive base model on the measure that actually mattered.

Reward hacking and the KL penalty

Without the KL penalty, the model learns to exploit the reward model — producing responses that score high but are incoherent. RLHF models optimise for the reward model, not human preferences directly. This is partly why RLHF models can be sycophantic — they learned to say what evaluators want to hear.

What RLHF changed in practice

RLHF variants: Constitutional AI and DPO

Compare alignment approaches across models →: See how Claude, GPT-4, and open models handle the same prompts — alignment differences in action.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →