GenAI Systems Lab Open interactive version →
AI Engineering 11 min read

RLHF and DPO: How Language Models Learn Human Preferences

How reinforcement learning from human feedback works end to end, why DPO replaced it for most teams, and what actually changes in the model during alignment training.

Alignment is not censorship. When people talk about "aligning" a language model, they mean something specific: shifting the output distribution toward responses that humans prefer — more helpful, more accurate, less harmful. The base model, trained on internet text, outputs what's statistically likely. Alignment training steers it toward what's actually good.

RLHF (Reinforcement Learning from Human Feedback) was the technique that made this work at scale. DPO (Direct Preference Optimization) is the technique that replaced it for most teams. Understanding both tells you something important about how modern LLMs actually work.

The 3-Phase RLHF Pipeline

RLHF as introduced in InstructGPT (2022) has three distinct phases. Each builds on the previous.

Phase 1: Supervised Fine-Tuning (SFT)

You start with a pretrained base model and fine-tune it on a curated dataset of (prompt, ideal response) pairs — usually written or heavily edited by human contractors. This teaches the model the general format and style of helpful responses. The base model can generate coherent text; SFT teaches it what a good assistant response looks like.

Phase 2: Reward Model Training

Human labelers are shown multiple model outputs for the same prompt and rank them by preference. These preference pairs (prompt, better_response, worse_response) are used to train a separate reward model — a model that predicts how much a human would prefer a given response.

The reward model is a fine-tuned version of the SFT model with a regression head instead of a next-token prediction head. It outputs a single scalar: the estimated human preference score.

Phase 3: PPO (Reinforcement Learning)

The SFT model is now fine-tuned using the reward model as the reward signal. The model generates responses, the reward model scores them, and PPO updates the policy to maximize the reward. A KL penalty term keeps the policy from diverging too far from the SFT model — without it, the model reward-hacks into nonsense.

max_θ E[r_φ(x, y)] − β · KL(π_θ(y|x) || π_SFT(y|x))

Where:
  r_φ(x, y)   = reward model score for (prompt x, response y)
  π_θ          = current policy (LLM being trained)
  π_SFT        = reference SFT policy (frozen)
  β            = KL penalty coefficient (typically 0.1–0.5)
  KL term      = prevents reward hacking / distribution collapse

PPO is brittle. The KL penalty coefficient β is hard to tune — too low and the model reward-hacks, too high and it barely learns anything. This instability is one of the main reasons DPO took over.

Why PPO Is Brittle in Practice

Meta's 70B Llama RLHF training reportedly cost $5–10M in compute. The complexity is not just financial — the engineering burden of debugging RL training runs at this scale is substantial.

What DPO Does Differently

Direct Preference Optimization (Rafailov et al., 2023) sidesteps the RL loop entirely. The key insight: the optimal policy under the RLHF objective has a closed-form solution. You don't need a separate reward model or PPO — you can derive the alignment objective directly from preference data.

DPO treats the language model itself as the implicit reward model. Given a preference pair (prompt x, chosen response y_w, rejected response y_l), the DPO loss directly increases the probability of y_w relative to y_l, while staying anchored to the SFT reference policy.

L_DPO = -log σ(β · log(π_θ(y_w|x) / π_ref(y_w|x))
                − β · log(π_θ(y_l|x) / π_ref(y_l|x)))

Where:
  y_w      = chosen (preferred) response
  y_l      = rejected (less preferred) response
  π_θ      = model being trained
  π_ref    = frozen SFT reference model
  σ        = sigmoid function
  β        = temperature controlling KL penalty strength

DPO eliminates the separate reward model and the PPO loop. It trains directly on (prompt, chosen, rejected) triples using a modified cross-entropy objective. Same preference data, half the complexity.

The Bradley-Terry Model and Why It Works

Both RLHF and DPO are grounded in the Bradley-Terry model of pairwise preference — a probabilistic framework that says the probability of preferring response A over B is proportional to exp(r_A) / (exp(r_A) + exp(r_B)), where r is the underlying reward.

This is a well-studied model from statistics used in sports rankings, psychology, and economics. It's the mathematical bridge between discrete pairwise judgments ("I prefer A to B") and continuous reward signals. RLHF trains a reward model to learn these reward values then uses RL to optimize them; DPO directly optimizes the same Bradley-Terry objective without the intermediate step.

What Actually Changes in the Weights

The practical question: when you run alignment training, which layers of the model actually change, and how much?

Research on mechanistic interpretability has shown that the 'refusal' direction in models often corresponds to specific directions in residual stream space in mid-to-upper layers. When alignment training succeeds, it's amplifying these directions for harmful inputs. When it fails (jailbreaks), adversarial prompts are moving the activation away from those directions.

When RLHF Still Beats DPO

DPO has largely replaced RLHF for fine-tuning teams using open models. But for frontier model training, RLHF (or variants like REINFORCE, GRPO) still dominates in certain regimes:

The real frontier labs (OpenAI, Anthropic, Google DeepMind) still use RL-based alignment methods for their most capable models. DPO is the practical choice for teams working with 7B–70B open models on a budget — which is most practitioners.

Key Numbers

AspectRLHF (PPO)DPO
Training phases3 (SFT + RM + PPO)2 (SFT + DPO)
Separate reward model neededYesNo
Memory overhead4× model copies in memory2× model copies
Training stabilityLow (RL instability)High (supervised objective)
Preference data formatRankings or pairwisePairwise (chosen, rejected)
Online / offlineOnline (PPO rollouts)Offline (fixed dataset)
Cost at 70B scale~$5–10M (Meta estimate)~$500K–1M (estimate)
Adopted by open-source teamsRareStandard (TRL, Axolotl, etc.)

Key Papers

Try Fine-Tuning Lab →: See alignment training in context alongside SFT, LoRA, and DPO — with interactive config choices and a side-by-side output comparison.

[Video: embedded video]

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →