GenAI Systems Lab Open interactive version →
AI Engineering 11 min read

InstructGPT → How RLHF Turned Raw LLMs Into Products People Actually Use

OpenAI's 2022 InstructGPT paper introduced RLHF to align LLMs with human preferences. What the paper showed, how RLHF became the industry-standard alignment recipe, and what PPO vs. DPO means for your fine-tuning strategy.

In January 2022, OpenAI published InstructGPT — a paper that fine-tuned GPT-3 to follow instructions using reinforcement learning from human feedback (RLHF). The fine-tuned 1.3B parameter InstructGPT model was consistently preferred by human raters over the 175B GPT-3 base model. That result is the core thesis of the modern AI industry: raw capability is not the same as usefulness.

RLHF went from research curiosity to the dominant method for making LLMs into products. Understanding what it actually does — and where it's been replaced — is essential for anyone building or fine-tuning models.

What InstructGPT proposed: the three-stage RLHF pipeline

Stage 1: Supervised fine-tuning (SFT)

Human contractors write high-quality prompt-response pairs. The base model is fine-tuned on this dataset in a standard supervised way. This gives the model a rough sense of what good responses look like — it's the starting point for preference learning.

Stage 2: Reward model training

Human raters are shown multiple model outputs for the same prompt and rank them by preference. A separate reward model (RM) is trained on these pairwise comparisons — given a (prompt, response) pair, it predicts a scalar reward score that approximates human preference.

Stage 3: RL fine-tuning with PPO

The SFT model is used as a starting policy. PPO (Proximal Policy Optimization) optimizes the policy to maximize the reward model's score, with a KL penalty to prevent it from drifting too far from the SFT model (which would result in degenerate output that games the reward model without being actually good).

The KL penalty is critical. Without it, models learn to produce outputs that score well on the reward model but are incoherent or nonsensical — 'reward hacking.' The KL term keeps the policy in a useful range.

What production systems actually use

PPO-based RLHF is expensive. It requires the policy model, reward model, value model, and reference model to all be in memory simultaneously. For large models, this is prohibitive. Two alternatives dominate production:

DPO: Direct Preference Optimization

Rafailov et al. (2023) showed you can eliminate the reward model and PPO entirely. DPO reparameterizes the RLHF objective so you can fine-tune the policy model directly on preference pairs (chosen vs. rejected) using a simple cross-entropy loss. The reward model is implicit in the policy. DPO is now the default for most open-source fine-tuning workflows.

RLAIF: RL from AI Feedback

Constitutional AI (Anthropic, 2022) and related work showed that you can use a larger LLM as the preference labeller instead of humans. The model critiques and revises its own outputs against a set of principles, generating the preference data automatically. This dramatically reduces the cost of the feedback collection stage.

The alignment tax

RLHF and its variants solve a real problem — base models are often unhelpful, inconsistent, and unsafe. But they introduce their own issues. Alignment can reduce benchmark performance on knowledge-intensive tasks (the 'alignment tax'). Over-optimizing for human preference can make models sycophantic — agreeing with users rather than being accurate. Calibrating how much RLHF is enough is an ongoing engineering problem.

Engineering implications

Interactive lab:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →