RLHF and DPO: How Models Learn to Do What You Want
The full alignment pipeline — SFT, reward model training, PPO — and why DPO replaced most of it. Includes the Bradley-Terry model, KL penalty mechanics, reward hacking failure modes, and practical tradeoffs between RLHF and DPO.
Every capable LLM you've used was trained with human feedback at some point. The model you see — helpful, coherent, aligned with what you actually want — is the product of a training process that goes far beyond next-token prediction. RLHF and its successor DPO are the techniques that bridge the gap between 'predicts text' and 'does what you ask'.
This post explains the full pipeline from SFT through reward models to PPO, then shows why DPO quietly replaced most of it — and what the tradeoffs look like in practice.
Step 1: Supervised Fine-Tuning (SFT)
Before any human feedback enters the picture, the base pretrained model is fine-tuned on a curated set of (prompt, ideal response) pairs. This is standard supervised learning — cross-entropy loss on the target tokens. The goal is to get the model into the right 'shape' before the more expensive alignment steps.
- Dataset: typically 10K–100K high-quality demonstrations
- Training: 1–3 epochs, learning rate 1e-5 to 5e-5
- Result: a model that knows the format and rough style of good answers
- Limitation: the model only learns from what humans wrote — it can't generalise to novel preferences
SFT alone is often surprisingly good. LIMA (2023) showed that 1,000 carefully chosen examples could match RLHF-tuned models on many tasks. The alignment gap is real but sometimes smaller than assumed.
Step 2: Reward Model Training
A reward model is a separate neural network trained to predict which of two responses a human would prefer. Human annotators are shown pairs of responses to the same prompt and asked to rank them. This comparison data — hundreds of thousands of pairwise preferences — trains the reward model.
The reward model uses the Bradley-Terry preference model under the hood: for a pair of responses (y_w, y_l) to prompt x, the probability that y_w is preferred is:
P(y_w > y_l | x) = σ(r(x, y_w) - r(x, y_l))
where:
r(x, y) = reward model score for response y to prompt x
σ = sigmoid function
Training objective: maximize log-likelihood of human preferences
Loss = -E[log σ(r(x, y_w) - r(x, y_l))]
The reward model is typically initialized from the SFT model with the final layer replaced by a scalar head. Training converges relatively quickly — a few thousand gradient steps on the preference data.
Step 3: RL Fine-Tuning with PPO
The SFT model is now fine-tuned using Proximal Policy Optimization (PPO) to maximize the reward model's score on generated responses. For each prompt, the policy (LLM) generates a response, the reward model scores it, and the policy parameters are updated to generate higher-scoring responses.
Objective: maximize E[r(x, y)] - β * KL(π_θ || π_ref)
where:
r(x, y) = reward model score
KL(π_θ||π_ref) = KL divergence from reference (SFT) model
β = KL penalty coefficient (typically 0.01–0.1)
The KL penalty prevents the policy from drifting too far
from the SFT model — without it, reward hacking occurs fast.
The KL divergence penalty is critical. Without it, the policy quickly learns to produce outputs that score high on the reward model but are nonsensical to humans — this is reward hacking. With too large a β, the policy barely moves from SFT. Getting β right requires careful tuning.
Why RLHF Is Expensive
| Cost Component | Why It Hurts | Approximate Scale |
|---|---|---|
| Human annotation | Pairwise comparisons are slow and expensive | 100K–1M pairs, $0.05–0.50/pair |
| Reward model training | Full fine-tune of a separate LLM | Equivalent to SFT training cost |
| PPO stability | Requires careful hyperparameter tuning | Many failed runs before convergence |
| 4 models in memory | Policy, reference, reward, value model all loaded simultaneously | 4× inference VRAM during training |
| Iteration speed | Each PPO step requires multiple forward passes | 5–20× slower than SFT per token |
PPO for LLMs is notoriously unstable. Reward hacking, mode collapse, and training divergence are common. OpenAI's original InstructGPT paper mentions 'careful reward normalization' and 'PPO-clip' modifications — both essential but underdocumented.
DPO: The Simpler Alternative
Direct Preference Optimization (DPO), introduced by Rafailov et al. at Stanford in 2023, makes a key observation: the optimal policy under the RLHF objective has a closed-form solution. You don't need a reward model at all — you can fine-tune directly on preference pairs using a simple classification loss.
The DPO loss is elegant:
DPO Loss = -E[log σ(β * log(π_θ(y_w|x)/π_ref(y_w|x)) - β * log(π_θ(y_l|x)/π_ref(y_l|x)))]
Intuitively: increase the probability of preferred responses
relative to the reference policy, while decreasing the probability
of rejected responses — with β controlling how hard you push.
No reward model needed. No PPO. Just one model, one pass, one loss.
DPO trains on the same (prompt, chosen, rejected) triplets that would go into reward model training — but skips the reward model entirely and fine-tunes the policy directly. Training is as stable as SFT.
RLHF vs DPO: When to Use What
| Dimension | RLHF | DPO |
|---|---|---|
| Training stability | Difficult — PPO requires careful tuning | Stable — SFT-like training loop |
| Infrastructure complexity | 4 models in memory simultaneously | 2 models (policy + frozen reference) |
| Data requirements | Same pairwise preference data | Same pairwise preference data |
| Online learning | Can generate new responses during training | Offline only — uses fixed dataset |
| Fine-grained control | High — reward shaping possible | Lower — direct on preferences only |
| Compute cost | 5–10× more than DPO | Comparable to SFT |
| Common use cases | Frontier labs with massive compute | Open-source fine-tuning, smaller teams |
In practice: most open-source alignment pipelines (Zephyr, Mistral-Instruct, Llama-3-Instruct) use DPO or variants. Full RLHF with PPO is primarily used by labs with the infrastructure to make it stable — OpenAI, Anthropic, Google.
Real Cost Comparison
For a 7B model fine-tuned on 50K preference pairs on 8× A100 GPUs:
- SFT only: ~4–6 hours, ~$80–120 on cloud
- DPO: ~6–10 hours (includes reference model), ~$120–200
- RLHF (reward model + PPO): ~48–72 hours total, ~$800–2000, with high probability of at least one failed run
Beyond DPO: SimPO and IPO
DPO's weaknesses have spawned a family of improvements:
- IPO (Identity Preference Optimization, 2024): fixes DPO's tendency to overfit on preference pairs by adding a regularization term
- SimPO (Simple Preference Optimization, 2024): removes the reference model entirely, uses average log-likelihood as the implicit reward — faster and often better
- KTO (Kahneman-Tversky Optimization, 2024): works on non-paired feedback (binary good/bad labels) — more data-efficient when pairwise comparison is expensive
- ORPO (Odds Ratio Preference Optimization): integrates SFT and preference learning into a single stage
Limitations of Both Approaches
- Reward hacking (RLHF): the model finds responses that score well on the reward model but aren't actually better — more common with small/weak reward models
- Distribution shift: both approaches can cause the model to degrade on tasks not well represented in the preference data
- Label noise: human annotators disagree 10–20% of the time — preference data has real noise that compounds
- Positional bias: annotators often prefer longer or more confident-sounding responses regardless of quality
- DPO offline limitation: DPO can't improve from responses it generates itself — it's limited to what's in the fixed dataset
Fine-Tuning Lab →: Compare SFT vs DPO training configs, see how the Bradley-Terry model works on real preference pairs, and trace reward model training curves.
→ Interactive: The RLHF / DPO / PPO module in Systems Lab walks through the full pipeline and PPO vs DPO trade-offs interactively.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →