AI Engineering 8 min read

DPO: Direct Preference Optimization — Alignment Without a Reward Model

Stanford's 2023 paper simplifying RLHF into a single classification objective. How DPO replaced PPO in most open-source fine-tuning pipelines — and the tradeoffs vs. RLHF.

RLHF works. But RLHF is complicated. It requires training a separate reward model, then using PPO to fine-tune the policy against it — with a KL penalty to prevent reward hacking, and careful hyperparameter tuning to keep PPO stable. A three-stage pipeline with multiple places to fail.

In May 2023, Rafael Rafailov and colleagues at Stanford published 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model'. The RLHF objective could be reformulated as a simple classification problem: one fine-tuning pass, no separate reward model, no PPO. DPO is now the dominant alignment method for open-source models.

The key mathematical insight

In RLHF, you train a reward model and use it to fine-tune the policy. The constrained RLHF objective has a known closed-form solution — rearranging it shows the reward can be expressed in terms of policy probabilities directly. The policy IS implicitly a reward model. You don't need to train a separate one.

RLHF: three stages (SFT → reward model → PPO)

DPO insight: reward = β · log[ π(y|x) / π_ref(y|x) ]

DPO loss (binary cross-entropy on preference pairs):
  L = −log σ(
    β·log[π(y_w|x)/π_ref(y_w|x)]   ← push up probability of chosen response
  − β·log[π(y_l|x)/π_ref(y_l|x)]  ← push down probability of rejected response
  )

One fine-tuning pass. No PPO. No reward model.

DPO directly trains the policy to assign higher probability to preferred responses relative to the reference policy. The preference signal is baked into a single classification loss — no reinforcement learning required.

What you need for DPO training

Preference dataset: (prompt, chosen_response, rejected_response) triplets — the same data you'd use for RLHF reward model training
Reference model: a frozen copy of the SFT model, used to compute π_ref in the loss
One fine-tuning run: standard supervised fine-tuning with the DPO loss function

DPO vs. RLHF: when each wins

Aspect	RLHF (PPO)	DPO
Complexity	3 stages, 2+ models	1 stage, 2 model copies
Stability	PPO can be unstable	Standard fine-tuning dynamics
Online/offline	Online — model explores	Offline — fixed preference dataset
Best for	Verifiable reward signals (math, code)	Human preference data as primary signal

Key limitation: DPO is offline — it can't explore to find better responses. For tasks with ground-truth reward signals, process-based RLHF often outperforms DPO. This is why OpenAI's o1/o3 use RLHF with process reward models, not DPO.

DPO variants

KTO: uses individual thumbs-up/down feedback instead of pairwise comparisons
ORPO: combines SFT and DPO into a single training objective, removing the reference model entirely
SimPO: further simplifies by removing the reference model computation

Compare alignment methods across models →: See how DPO-trained and RLHF-trained models differ in practice.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →