RLHF and DPO: How Language Models Learn Human Preferences
How reinforcement learning from human feedback works end to end, why DPO replaced it for most teams, and what actually changes in the model during alignment training.
Alignment is not censorship. When people talk about "aligning" a language model, they mean something specific: shifting the output distribution toward responses that humans prefer — more helpful, more accurate, less harmful. The base model, trained on internet text, outputs what's statistically likely. Alignment training steers it toward what's actually good.
RLHF (Reinforcement Learning from Human Feedback) was the technique that made this work at scale. DPO (Direct Preference Optimization) is the technique that replaced it for most teams. Understanding both tells you something important about how modern LLMs actually work.
The 3-Phase RLHF Pipeline
RLHF as introduced in InstructGPT (2022) has three distinct phases. Each builds on the previous.
Phase 1: Supervised Fine-Tuning (SFT)
You start with a pretrained base model and fine-tune it on a curated dataset of (prompt, ideal response) pairs — usually written or heavily edited by human contractors. This teaches the model the general format and style of helpful responses. The base model can generate coherent text; SFT teaches it what a good assistant response looks like.
- Dataset size: typically 10K–100K high-quality (prompt, response) pairs
- Training objective: standard next-token prediction (cross-entropy loss)
- Result: a model that responds helpfully but without preference calibration
- Cost: moderate — the dataset is expensive to produce, training is standard fine-tuning
Phase 2: Reward Model Training
Human labelers are shown multiple model outputs for the same prompt and rank them by preference. These preference pairs (prompt, better_response, worse_response) are used to train a separate reward model — a model that predicts how much a human would prefer a given response.
The reward model is a fine-tuned version of the SFT model with a regression head instead of a next-token prediction head. It outputs a single scalar: the estimated human preference score.
- Training data: ~100K–500K pairwise comparisons
- Labeling cost: significant — each comparison requires a human to read and rank two outputs
- The reward model is a proxy for human judgment — it will have its own failure modes
- Quality of this model directly caps alignment quality — garbage in, garbage out
Phase 3: PPO (Reinforcement Learning)
The SFT model is now fine-tuned using the reward model as the reward signal. The model generates responses, the reward model scores them, and PPO updates the policy to maximize the reward. A KL penalty term keeps the policy from diverging too far from the SFT model — without it, the model reward-hacks into nonsense.
max_θ E[r_φ(x, y)] − β · KL(π_θ(y|x) || π_SFT(y|x))
Where:
r_φ(x, y) = reward model score for (prompt x, response y)
π_θ = current policy (LLM being trained)
π_SFT = reference SFT policy (frozen)
β = KL penalty coefficient (typically 0.1–0.5)
KL term = prevents reward hacking / distribution collapse
PPO is brittle. The KL penalty coefficient β is hard to tune — too low and the model reward-hacks, too high and it barely learns anything. This instability is one of the main reasons DPO took over.
Why PPO Is Brittle in Practice
- Reward hacking: the model finds patterns that score well with the reward model but don't generalize to real human preferences
- KL penalty sensitivity: β requires careful tuning per model and dataset — there's no universal good value
- Training instability: RL training with LLMs can diverge unexpectedly, especially at scale
- Memory overhead: you're running four models simultaneously (policy, reference, reward model, value function)
- Sample inefficiency: PPO requires many rollouts per policy update compared to supervised learning
Meta's 70B Llama RLHF training reportedly cost $5–10M in compute. The complexity is not just financial — the engineering burden of debugging RL training runs at this scale is substantial.
What DPO Does Differently
Direct Preference Optimization (Rafailov et al., 2023) sidesteps the RL loop entirely. The key insight: the optimal policy under the RLHF objective has a closed-form solution. You don't need a separate reward model or PPO — you can derive the alignment objective directly from preference data.
DPO treats the language model itself as the implicit reward model. Given a preference pair (prompt x, chosen response y_w, rejected response y_l), the DPO loss directly increases the probability of y_w relative to y_l, while staying anchored to the SFT reference policy.
L_DPO = -log σ(β · log(π_θ(y_w|x) / π_ref(y_w|x))
− β · log(π_θ(y_l|x) / π_ref(y_l|x)))
Where:
y_w = chosen (preferred) response
y_l = rejected (less preferred) response
π_θ = model being trained
π_ref = frozen SFT reference model
σ = sigmoid function
β = temperature controlling KL penalty strength
DPO eliminates the separate reward model and the PPO loop. It trains directly on (prompt, chosen, rejected) triples using a modified cross-entropy objective. Same preference data, half the complexity.
The Bradley-Terry Model and Why It Works
Both RLHF and DPO are grounded in the Bradley-Terry model of pairwise preference — a probabilistic framework that says the probability of preferring response A over B is proportional to exp(r_A) / (exp(r_A) + exp(r_B)), where r is the underlying reward.
This is a well-studied model from statistics used in sports rankings, psychology, and economics. It's the mathematical bridge between discrete pairwise judgments ("I prefer A to B") and continuous reward signals. RLHF trains a reward model to learn these reward values then uses RL to optimize them; DPO directly optimizes the same Bradley-Terry objective without the intermediate step.
What Actually Changes in the Weights
The practical question: when you run alignment training, which layers of the model actually change, and how much?
- Attention layers: moderate changes, especially in middle and upper layers where complex reasoning happens
- MLP layers: significant changes — this is where much of the preference-relevant knowledge seems to live
- Embedding layers: minor changes — token representations stay mostly stable
- Layer norm parameters: small but measurable changes throughout
Research on mechanistic interpretability has shown that the 'refusal' direction in models often corresponds to specific directions in residual stream space in mid-to-upper layers. When alignment training succeeds, it's amplifying these directions for harmful inputs. When it fails (jailbreaks), adversarial prompts are moving the activation away from those directions.
When RLHF Still Beats DPO
DPO has largely replaced RLHF for fine-tuning teams using open models. But for frontier model training, RLHF (or variants like REINFORCE, GRPO) still dominates in certain regimes:
- Very large scale: the optimal policy derivation in DPO assumes certain properties that may not hold at 100B+ parameter scale
- Complex reward signals: when the reward isn't reducible to pairwise preferences (e.g., multi-dimensional evaluations, process rewards), RL-based methods are more flexible
- Online learning: DPO is an offline method — it trains on fixed preference data. Online RL can collect new preference data mid-training, which matters for certain capability gains
- Process reward models: models like o1 and o3 that reward step-by-step reasoning quality require per-step reward signals that PPO handles naturally but DPO doesn't
The real frontier labs (OpenAI, Anthropic, Google DeepMind) still use RL-based alignment methods for their most capable models. DPO is the practical choice for teams working with 7B–70B open models on a budget — which is most practitioners.
Key Numbers
| Aspect | RLHF (PPO) | DPO |
|---|---|---|
| Training phases | 3 (SFT + RM + PPO) | 2 (SFT + DPO) |
| Separate reward model needed | Yes | No |
| Memory overhead | 4× model copies in memory | 2× model copies |
| Training stability | Low (RL instability) | High (supervised objective) |
| Preference data format | Rankings or pairwise | Pairwise (chosen, rejected) |
| Online / offline | Online (PPO rollouts) | Offline (fixed dataset) |
| Cost at 70B scale | ~$5–10M (Meta estimate) | ~$500K–1M (estimate) |
| Adopted by open-source teams | Rare | Standard (TRL, Axolotl, etc.) |
Key Papers
- InstructGPT (Ouyang et al., 2022) — introduced RLHF for LLMs; the paper that made ChatGPT possible
- Constitutional AI (Bai et al., 2022) — Anthropic's extension using AI feedback instead of human labelers
- DPO (Rafailov et al., 2023) — the closed-form derivation that eliminated the reward model
- RLHF Workflow: From Reward Modeling to Online RLHF (Dong et al., 2024) — practical analysis of when each approach works
Try Fine-Tuning Lab →: See alignment training in context alongside SFT, LoRA, and DPO — with interactive config choices and a side-by-side output comparison.
[Video: embedded video]
- InstructGPT: Training Language Models to Follow Instructions with Human Feedback (Ouyang et al., 2022)
- Direct Preference Optimization (DPO) — Rafailov et al., 2023
- Lilian Weng: Reinforcement Learning from Human Feedback
- Constitutional AI: Harmlessness from AI Feedback — Anthropic, 2022
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →