AI Engineering 11 min read

RLHF and DPO: How Language Models Learn Human Preferences

How reinforcement learning from human feedback works end to end, why DPO replaced it for most teams, and what actually changes in the model during alignment training.

Alignment is not censorship. When people talk about "aligning" a language model, they mean something specific: shifting the output distribution toward responses that humans prefer — more helpful, more accurate, less harmful. The base model, trained on internet text, outputs what's statistically likely. Alignment training steers it toward what's actually good.

RLHF (Reinforcement Learning from Human Feedback) was the technique that made this work at scale. DPO (Direct Preference Optimization) is the technique that replaced it for most teams. Understanding both tells you something important about how modern LLMs actually work.

The 3-Phase RLHF Pipeline

RLHF as introduced in InstructGPT (2022) has three distinct phases. Each builds on the previous.

Phase 1: Supervised Fine-Tuning (SFT)

You start with a pretrained base model and fine-tune it on a curated dataset of (prompt, ideal response) pairs — usually written or heavily edited by human contractors. This teaches the model the general format and style of helpful responses. The base model can generate coherent text; SFT teaches it what a good assistant response looks like.

Dataset size: typically 10K–100K high-quality (prompt, response) pairs
Training objective: standard next-token prediction (cross-entropy loss)
Result: a model that responds helpfully but without preference calibration
Cost: moderate — the dataset is expensive to produce, training is standard fine-tuning

Phase 2: Reward Model Training

Human labelers are shown multiple model outputs for the same prompt and rank them by preference. These preference pairs (prompt, better_response, worse_response) are used to train a separate reward model — a model that predicts how much a human would prefer a given response.

The reward model is a fine-tuned version of the SFT model with a regression head instead of a next-token prediction head. It outputs a single scalar: the estimated human preference score.

Training data: ~100K–500K pairwise comparisons
Labeling cost: significant — each comparison requires a human to read and rank two outputs
The reward model is a proxy for human judgment — it will have its own failure modes
Quality of this model directly caps alignment quality — garbage in, garbage out

Phase 3: PPO (Reinforcement Learning)

The SFT model is now fine-tuned using the reward model as the reward signal. The model generates responses, the reward model scores them, and PPO updates the policy to maximize the reward. A KL penalty term keeps the policy from diverging too far from the SFT model — without it, the model reward-hacks into nonsense.

max_θ E[r_φ(x, y)] − β · KL(π_θ(y|x) || π_SFT(y|x))

Where:
  r_φ(x, y)   = reward model score for (prompt x, response y)
  π_θ          = current policy (LLM being trained)
  π_SFT        = reference SFT policy (frozen)
  β            = KL penalty coefficient (typically 0.1–0.5)
  KL term      = prevents reward hacking / distribution collapse

PPO is brittle. The KL penalty coefficient β is hard to tune — too low and the model reward-hacks, too high and it barely learns anything. This instability is one of the main reasons DPO took over.

Why PPO Is Brittle in Practice

Reward hacking: the model finds patterns that score well with the reward model but don't generalize to real human preferences
KL penalty sensitivity: β requires careful tuning per model and dataset — there's no universal good value
Training instability: RL training with LLMs can diverge unexpectedly, especially at scale
Memory overhead: you're running four models simultaneously (policy, reference, reward model, value function)
Sample inefficiency: PPO requires many rollouts per policy update compared to supervised learning

Meta's 70B Llama RLHF training reportedly cost $5–10M in compute. The complexity is not just financial — the engineering burden of debugging RL training runs at this scale is substantial.

What DPO Does Differently

Direct Preference Optimization (Rafailov et al., 2023) sidesteps the RL loop entirely. The key insight: the optimal policy under the RLHF objective has a closed-form solution. You don't need a separate reward model or PPO — you can derive the alignment objective directly from preference data.

DPO treats the language model itself as the implicit reward model. Given a preference pair (prompt x, chosen response y_w, rejected response y_l), the DPO loss directly increases the probability of y_w relative to y_l, while staying anchored to the SFT reference policy.

L_DPO = -log σ(β · log(π_θ(y_w|x) / π_ref(y_w|x))
                − β · log(π_θ(y_l|x) / π_ref(y_l|x)))

Where:
  y_w      = chosen (preferred) response
  y_l      = rejected (less preferred) response
  π_θ      = model being trained
  π_ref    = frozen SFT reference model
  σ        = sigmoid function
  β        = temperature controlling KL penalty strength

DPO eliminates the separate reward model and the PPO loop. It trains directly on (prompt, chosen, rejected) triples using a modified cross-entropy objective. Same preference data, half the complexity.

The Bradley-Terry Model and Why It Works

Both RLHF and DPO are grounded in the Bradley-Terry model of pairwise preference — a probabilistic framework that says the probability of preferring response A over B is proportional to exp(r_A) / (exp(r_A) + exp(r_B)), where r is the underlying reward.

This is a well-studied model from statistics used in sports rankings, psychology, and economics. It's the mathematical bridge between discrete pairwise judgments ("I prefer A to B") and continuous reward signals. RLHF trains a reward model to learn these reward values then uses RL to optimize them; DPO directly optimizes the same Bradley-Terry objective without the intermediate step.

What Actually Changes in the Weights

The practical question: when you run alignment training, which layers of the model actually change, and how much?

Attention layers: moderate changes, especially in middle and upper layers where complex reasoning happens
MLP layers: significant changes — this is where much of the preference-relevant knowledge seems to live
Embedding layers: minor changes — token representations stay mostly stable
Layer norm parameters: small but measurable changes throughout

Research on mechanistic interpretability has shown that the 'refusal' direction in models often corresponds to specific directions in residual stream space in mid-to-upper layers. When alignment training succeeds, it's amplifying these directions for harmful inputs. When it fails (jailbreaks), adversarial prompts are moving the activation away from those directions.

When RLHF Still Beats DPO

DPO has largely replaced RLHF for fine-tuning teams using open models. But for frontier model training, RLHF (or variants like REINFORCE, GRPO) still dominates in certain regimes:

Very large scale: the optimal policy derivation in DPO assumes certain properties that may not hold at 100B+ parameter scale
Complex reward signals: when the reward isn't reducible to pairwise preferences (e.g., multi-dimensional evaluations, process rewards), RL-based methods are more flexible
Online learning: DPO is an offline method — it trains on fixed preference data. Online RL can collect new preference data mid-training, which matters for certain capability gains
Process reward models: models like o1 and o3 that reward step-by-step reasoning quality require per-step reward signals that PPO handles naturally but DPO doesn't

The real frontier labs (OpenAI, Anthropic, Google DeepMind) still use RL-based alignment methods for their most capable models. DPO is the practical choice for teams working with 7B–70B open models on a budget — which is most practitioners.

Key Numbers

Aspect	RLHF (PPO)	DPO
Training phases	3 (SFT + RM + PPO)	2 (SFT + DPO)
Separate reward model needed	Yes	No
Memory overhead	4× model copies in memory	2× model copies
Training stability	Low (RL instability)	High (supervised objective)
Preference data format	Rankings or pairwise	Pairwise (chosen, rejected)
Online / offline	Online (PPO rollouts)	Offline (fixed dataset)
Cost at 70B scale	~$5–10M (Meta estimate)	~$500K–1M (estimate)
Adopted by open-source teams	Rare	Standard (TRL, Axolotl, etc.)

Key Papers

InstructGPT (Ouyang et al., 2022) — introduced RLHF for LLMs; the paper that made ChatGPT possible
Constitutional AI (Bai et al., 2022) — Anthropic's extension using AI feedback instead of human labelers
DPO (Rafailov et al., 2023) — the closed-form derivation that eliminated the reward model
RLHF Workflow: From Reward Modeling to Online RLHF (Dong et al., 2024) — practical analysis of when each approach works

Try Fine-Tuning Lab →: See alignment training in context alongside SFT, LoRA, and DPO — with interactive config choices and a side-by-side output comparison.

[Video: embedded video]

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →