GenAI Systems Lab Open interactive version →
AI Engineering 8 min read

DPO: Direct Preference Optimization — Alignment Without a Reward Model

Stanford's 2023 paper simplifying RLHF into a single classification objective. How DPO replaced PPO in most open-source fine-tuning pipelines — and the tradeoffs vs. RLHF.

RLHF works. But RLHF is complicated. It requires training a separate reward model, then using PPO to fine-tune the policy against it — with a KL penalty to prevent reward hacking, and careful hyperparameter tuning to keep PPO stable. A three-stage pipeline with multiple places to fail.

In May 2023, Rafael Rafailov and colleagues at Stanford published 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model'. The RLHF objective could be reformulated as a simple classification problem: one fine-tuning pass, no separate reward model, no PPO. DPO is now the dominant alignment method for open-source models.

The key mathematical insight

In RLHF, you train a reward model and use it to fine-tune the policy. The constrained RLHF objective has a known closed-form solution — rearranging it shows the reward can be expressed in terms of policy probabilities directly. The policy IS implicitly a reward model. You don't need to train a separate one.

RLHF: three stages (SFT → reward model → PPO)

DPO insight: reward = β · log[ π(y|x) / π_ref(y|x) ]

DPO loss (binary cross-entropy on preference pairs):
  L = −log σ(
    β·log[π(y_w|x)/π_ref(y_w|x)]   ← push up probability of chosen response
  − β·log[π(y_l|x)/π_ref(y_l|x)]  ← push down probability of rejected response
  )

One fine-tuning pass. No PPO. No reward model.

DPO directly trains the policy to assign higher probability to preferred responses relative to the reference policy. The preference signal is baked into a single classification loss — no reinforcement learning required.

What you need for DPO training

DPO vs. RLHF: when each wins

AspectRLHF (PPO)DPO
Complexity3 stages, 2+ models1 stage, 2 model copies
StabilityPPO can be unstableStandard fine-tuning dynamics
Online/offlineOnline — model exploresOffline — fixed preference dataset
Best forVerifiable reward signals (math, code)Human preference data as primary signal

Key limitation: DPO is offline — it can't explore to find better responses. For tasks with ground-truth reward signals, process-based RLHF often outperforms DPO. This is why OpenAI's o1/o3 use RLHF with process reward models, not DPO.

DPO variants

Compare alignment methods across models →: See how DPO-trained and RLHF-trained models differ in practice.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →