DPO: Direct Preference Optimization — Alignment Without a Reward Model
Stanford's 2023 paper simplifying RLHF into a single classification objective. How DPO replaced PPO in most open-source fine-tuning pipelines — and the tradeoffs vs. RLHF.
RLHF works. But RLHF is complicated. It requires training a separate reward model, then using PPO to fine-tune the policy against it — with a KL penalty to prevent reward hacking, and careful hyperparameter tuning to keep PPO stable. A three-stage pipeline with multiple places to fail.
In May 2023, Rafael Rafailov and colleagues at Stanford published 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model'. The RLHF objective could be reformulated as a simple classification problem: one fine-tuning pass, no separate reward model, no PPO. DPO is now the dominant alignment method for open-source models.
The key mathematical insight
In RLHF, you train a reward model and use it to fine-tune the policy. The constrained RLHF objective has a known closed-form solution — rearranging it shows the reward can be expressed in terms of policy probabilities directly. The policy IS implicitly a reward model. You don't need to train a separate one.
RLHF: three stages (SFT → reward model → PPO)
DPO insight: reward = β · log[ π(y|x) / π_ref(y|x) ]
DPO loss (binary cross-entropy on preference pairs):
L = −log σ(
β·log[π(y_w|x)/π_ref(y_w|x)] ← push up probability of chosen response
− β·log[π(y_l|x)/π_ref(y_l|x)] ← push down probability of rejected response
)
One fine-tuning pass. No PPO. No reward model.
DPO directly trains the policy to assign higher probability to preferred responses relative to the reference policy. The preference signal is baked into a single classification loss — no reinforcement learning required.
What you need for DPO training
- Preference dataset: (prompt, chosen_response, rejected_response) triplets — the same data you'd use for RLHF reward model training
- Reference model: a frozen copy of the SFT model, used to compute π_ref in the loss
- One fine-tuning run: standard supervised fine-tuning with the DPO loss function
DPO vs. RLHF: when each wins
| Aspect | RLHF (PPO) | DPO |
|---|---|---|
| Complexity | 3 stages, 2+ models | 1 stage, 2 model copies |
| Stability | PPO can be unstable | Standard fine-tuning dynamics |
| Online/offline | Online — model explores | Offline — fixed preference dataset |
| Best for | Verifiable reward signals (math, code) | Human preference data as primary signal |
Key limitation: DPO is offline — it can't explore to find better responses. For tasks with ground-truth reward signals, process-based RLHF often outperforms DPO. This is why OpenAI's o1/o3 use RLHF with process reward models, not DPO.
DPO variants
- KTO: uses individual thumbs-up/down feedback instead of pairwise comparisons
- ORPO: combines SFT and DPO into a single training objective, removing the reference model entirely
- SimPO: further simplifies by removing the reference model computation
Compare alignment methods across models →: See how DPO-trained and RLHF-trained models differ in practice.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →