AI Engineering 11 min read

RLHF and DPO: How Models Learn to Do What You Want

The full alignment pipeline — SFT, reward model training, PPO — and why DPO replaced most of it. Includes the Bradley-Terry model, KL penalty mechanics, reward hacking failure modes, and practical tradeoffs between RLHF and DPO.

Every capable LLM you've used was trained with human feedback at some point. The model you see — helpful, coherent, aligned with what you actually want — is the product of a training process that goes far beyond next-token prediction. RLHF and its successor DPO are the techniques that bridge the gap between 'predicts text' and 'does what you ask'.

This post explains the full pipeline from SFT through reward models to PPO, then shows why DPO quietly replaced most of it — and what the tradeoffs look like in practice.

Step 1: Supervised Fine-Tuning (SFT)

Before any human feedback enters the picture, the base pretrained model is fine-tuned on a curated set of (prompt, ideal response) pairs. This is standard supervised learning — cross-entropy loss on the target tokens. The goal is to get the model into the right 'shape' before the more expensive alignment steps.

Dataset: typically 10K–100K high-quality demonstrations
Training: 1–3 epochs, learning rate 1e-5 to 5e-5
Result: a model that knows the format and rough style of good answers
Limitation: the model only learns from what humans wrote — it can't generalise to novel preferences

SFT alone is often surprisingly good. LIMA (2023) showed that 1,000 carefully chosen examples could match RLHF-tuned models on many tasks. The alignment gap is real but sometimes smaller than assumed.

Step 2: Reward Model Training

A reward model is a separate neural network trained to predict which of two responses a human would prefer. Human annotators are shown pairs of responses to the same prompt and asked to rank them. This comparison data — hundreds of thousands of pairwise preferences — trains the reward model.

The reward model uses the Bradley-Terry preference model under the hood: for a pair of responses (y_w, y_l) to prompt x, the probability that y_w is preferred is:

P(y_w > y_l | x) = σ(r(x, y_w) - r(x, y_l))

where:
  r(x, y) = reward model score for response y to prompt x
  σ       = sigmoid function

Training objective: maximize log-likelihood of human preferences
Loss = -E[log σ(r(x, y_w) - r(x, y_l))]

The reward model is typically initialized from the SFT model with the final layer replaced by a scalar head. Training converges relatively quickly — a few thousand gradient steps on the preference data.

Step 3: RL Fine-Tuning with PPO

The SFT model is now fine-tuned using Proximal Policy Optimization (PPO) to maximize the reward model's score on generated responses. For each prompt, the policy (LLM) generates a response, the reward model scores it, and the policy parameters are updated to generate higher-scoring responses.

Objective: maximize E[r(x, y)] - β * KL(π_θ || π_ref)

where:
  r(x, y)      = reward model score
  KL(π_θ||π_ref) = KL divergence from reference (SFT) model
  β            = KL penalty coefficient (typically 0.01–0.1)

The KL penalty prevents the policy from drifting too far
from the SFT model — without it, reward hacking occurs fast.

The KL divergence penalty is critical. Without it, the policy quickly learns to produce outputs that score high on the reward model but are nonsensical to humans — this is reward hacking. With too large a β, the policy barely moves from SFT. Getting β right requires careful tuning.

Why RLHF Is Expensive

Cost Component	Why It Hurts	Approximate Scale
Human annotation	Pairwise comparisons are slow and expensive	100K–1M pairs, $0.05–0.50/pair
Reward model training	Full fine-tune of a separate LLM	Equivalent to SFT training cost
PPO stability	Requires careful hyperparameter tuning	Many failed runs before convergence
4 models in memory	Policy, reference, reward, value model all loaded simultaneously	4× inference VRAM during training
Iteration speed	Each PPO step requires multiple forward passes	5–20× slower than SFT per token

PPO for LLMs is notoriously unstable. Reward hacking, mode collapse, and training divergence are common. OpenAI's original InstructGPT paper mentions 'careful reward normalization' and 'PPO-clip' modifications — both essential but underdocumented.

DPO: The Simpler Alternative

Direct Preference Optimization (DPO), introduced by Rafailov et al. at Stanford in 2023, makes a key observation: the optimal policy under the RLHF objective has a closed-form solution. You don't need a reward model at all — you can fine-tune directly on preference pairs using a simple classification loss.

The DPO loss is elegant:

DPO Loss = -E[log σ(β * log(π_θ(y_w|x)/π_ref(y_w|x)) - β * log(π_θ(y_l|x)/π_ref(y_l|x)))]

Intuitively: increase the probability of preferred responses
relative to the reference policy, while decreasing the probability
of rejected responses — with β controlling how hard you push.

No reward model needed. No PPO. Just one model, one pass, one loss.

DPO trains on the same (prompt, chosen, rejected) triplets that would go into reward model training — but skips the reward model entirely and fine-tunes the policy directly. Training is as stable as SFT.

RLHF vs DPO: When to Use What

Dimension	RLHF	DPO
Training stability	Difficult — PPO requires careful tuning	Stable — SFT-like training loop
Infrastructure complexity	4 models in memory simultaneously	2 models (policy + frozen reference)
Data requirements	Same pairwise preference data	Same pairwise preference data
Online learning	Can generate new responses during training	Offline only — uses fixed dataset
Fine-grained control	High — reward shaping possible	Lower — direct on preferences only
Compute cost	5–10× more than DPO	Comparable to SFT
Common use cases	Frontier labs with massive compute	Open-source fine-tuning, smaller teams

In practice: most open-source alignment pipelines (Zephyr, Mistral-Instruct, Llama-3-Instruct) use DPO or variants. Full RLHF with PPO is primarily used by labs with the infrastructure to make it stable — OpenAI, Anthropic, Google.

Real Cost Comparison

For a 7B model fine-tuned on 50K preference pairs on 8× A100 GPUs:

SFT only: ~4–6 hours, ~$80–120 on cloud
DPO: ~6–10 hours (includes reference model), ~$120–200
RLHF (reward model + PPO): ~48–72 hours total, ~$800–2000, with high probability of at least one failed run

Beyond DPO: SimPO and IPO

DPO's weaknesses have spawned a family of improvements:

IPO (Identity Preference Optimization, 2024): fixes DPO's tendency to overfit on preference pairs by adding a regularization term
SimPO (Simple Preference Optimization, 2024): removes the reference model entirely, uses average log-likelihood as the implicit reward — faster and often better
KTO (Kahneman-Tversky Optimization, 2024): works on non-paired feedback (binary good/bad labels) — more data-efficient when pairwise comparison is expensive
ORPO (Odds Ratio Preference Optimization): integrates SFT and preference learning into a single stage

Limitations of Both Approaches

Reward hacking (RLHF): the model finds responses that score well on the reward model but aren't actually better — more common with small/weak reward models
Distribution shift: both approaches can cause the model to degrade on tasks not well represented in the preference data
Label noise: human annotators disagree 10–20% of the time — preference data has real noise that compounds
Positional bias: annotators often prefer longer or more confident-sounding responses regardless of quality
DPO offline limitation: DPO can't improve from responses it generates itself — it's limited to what's in the fixed dataset

Fine-Tuning Lab →: Compare SFT vs DPO training configs, see how the Bradley-Terry model works on real preference pairs, and trace reward model training curves.

→ Interactive: The RLHF / DPO / PPO module in Systems Lab walks through the full pipeline and PPO vs DPO trade-offs interactively.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →