GenAI Systems Lab Open interactive version →
AI Engineering 11 min read

RLHF and DPO: How Models Learn to Do What You Want

The full alignment pipeline — SFT, reward model training, PPO — and why DPO replaced most of it. Includes the Bradley-Terry model, KL penalty mechanics, reward hacking failure modes, and practical tradeoffs between RLHF and DPO.

Every capable LLM you've used was trained with human feedback at some point. The model you see — helpful, coherent, aligned with what you actually want — is the product of a training process that goes far beyond next-token prediction. RLHF and its successor DPO are the techniques that bridge the gap between 'predicts text' and 'does what you ask'.

This post explains the full pipeline from SFT through reward models to PPO, then shows why DPO quietly replaced most of it — and what the tradeoffs look like in practice.

Step 1: Supervised Fine-Tuning (SFT)

Before any human feedback enters the picture, the base pretrained model is fine-tuned on a curated set of (prompt, ideal response) pairs. This is standard supervised learning — cross-entropy loss on the target tokens. The goal is to get the model into the right 'shape' before the more expensive alignment steps.

SFT alone is often surprisingly good. LIMA (2023) showed that 1,000 carefully chosen examples could match RLHF-tuned models on many tasks. The alignment gap is real but sometimes smaller than assumed.

Step 2: Reward Model Training

A reward model is a separate neural network trained to predict which of two responses a human would prefer. Human annotators are shown pairs of responses to the same prompt and asked to rank them. This comparison data — hundreds of thousands of pairwise preferences — trains the reward model.

The reward model uses the Bradley-Terry preference model under the hood: for a pair of responses (y_w, y_l) to prompt x, the probability that y_w is preferred is:

P(y_w > y_l | x) = σ(r(x, y_w) - r(x, y_l))

where:
  r(x, y) = reward model score for response y to prompt x
  σ       = sigmoid function

Training objective: maximize log-likelihood of human preferences
Loss = -E[log σ(r(x, y_w) - r(x, y_l))]

The reward model is typically initialized from the SFT model with the final layer replaced by a scalar head. Training converges relatively quickly — a few thousand gradient steps on the preference data.

Step 3: RL Fine-Tuning with PPO

The SFT model is now fine-tuned using Proximal Policy Optimization (PPO) to maximize the reward model's score on generated responses. For each prompt, the policy (LLM) generates a response, the reward model scores it, and the policy parameters are updated to generate higher-scoring responses.

Objective: maximize E[r(x, y)] - β * KL(π_θ || π_ref)

where:
  r(x, y)      = reward model score
  KL(π_θ||π_ref) = KL divergence from reference (SFT) model
  β            = KL penalty coefficient (typically 0.01–0.1)

The KL penalty prevents the policy from drifting too far
from the SFT model — without it, reward hacking occurs fast.

The KL divergence penalty is critical. Without it, the policy quickly learns to produce outputs that score high on the reward model but are nonsensical to humans — this is reward hacking. With too large a β, the policy barely moves from SFT. Getting β right requires careful tuning.

Why RLHF Is Expensive

Cost ComponentWhy It HurtsApproximate Scale
Human annotationPairwise comparisons are slow and expensive100K–1M pairs, $0.05–0.50/pair
Reward model trainingFull fine-tune of a separate LLMEquivalent to SFT training cost
PPO stabilityRequires careful hyperparameter tuningMany failed runs before convergence
4 models in memoryPolicy, reference, reward, value model all loaded simultaneously4× inference VRAM during training
Iteration speedEach PPO step requires multiple forward passes5–20× slower than SFT per token

PPO for LLMs is notoriously unstable. Reward hacking, mode collapse, and training divergence are common. OpenAI's original InstructGPT paper mentions 'careful reward normalization' and 'PPO-clip' modifications — both essential but underdocumented.

DPO: The Simpler Alternative

Direct Preference Optimization (DPO), introduced by Rafailov et al. at Stanford in 2023, makes a key observation: the optimal policy under the RLHF objective has a closed-form solution. You don't need a reward model at all — you can fine-tune directly on preference pairs using a simple classification loss.

The DPO loss is elegant:

DPO Loss = -E[log σ(β * log(π_θ(y_w|x)/π_ref(y_w|x)) - β * log(π_θ(y_l|x)/π_ref(y_l|x)))]

Intuitively: increase the probability of preferred responses
relative to the reference policy, while decreasing the probability
of rejected responses — with β controlling how hard you push.

No reward model needed. No PPO. Just one model, one pass, one loss.

DPO trains on the same (prompt, chosen, rejected) triplets that would go into reward model training — but skips the reward model entirely and fine-tunes the policy directly. Training is as stable as SFT.

RLHF vs DPO: When to Use What

DimensionRLHFDPO
Training stabilityDifficult — PPO requires careful tuningStable — SFT-like training loop
Infrastructure complexity4 models in memory simultaneously2 models (policy + frozen reference)
Data requirementsSame pairwise preference dataSame pairwise preference data
Online learningCan generate new responses during trainingOffline only — uses fixed dataset
Fine-grained controlHigh — reward shaping possibleLower — direct on preferences only
Compute cost5–10× more than DPOComparable to SFT
Common use casesFrontier labs with massive computeOpen-source fine-tuning, smaller teams

In practice: most open-source alignment pipelines (Zephyr, Mistral-Instruct, Llama-3-Instruct) use DPO or variants. Full RLHF with PPO is primarily used by labs with the infrastructure to make it stable — OpenAI, Anthropic, Google.

Real Cost Comparison

For a 7B model fine-tuned on 50K preference pairs on 8× A100 GPUs:

Beyond DPO: SimPO and IPO

DPO's weaknesses have spawned a family of improvements:

Limitations of Both Approaches

Fine-Tuning Lab →: Compare SFT vs DPO training configs, see how the Bradley-Terry model works on real preference pairs, and trace reward model training curves.

→ Interactive: The RLHF / DPO / PPO module in Systems Lab walks through the full pipeline and PPO vs DPO trade-offs interactively.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →