DPO vs RLHF vs GRPO: Which Alignment Method Should You Use?
Three dominant approaches to aligning fine-tuned models with human preferences. How DPO eliminates the reward model, why GRPO is driving reasoning model breakthroughs, and a decision tree for choosing the right method based on your data, compute, and stability requirements.
The Alignment Problem
SFT gives you a model that can follow instructions. Alignment gives you a model that follows instructions *well* — refusing harmful requests, matching human preferences on tone and format, and being honest when uncertain. The three dominant methods for alignment are RLHF, DPO, and GRPO — and choosing wrong costs weeks of training time.
RLHF: The Original Method
Reinforcement Learning from Human Feedback trains a separate reward model on human preference pairs (response A vs response B), then uses PPO to fine-tune the LLM to maximize reward. The core insight: you can't write a reward function for 'good writing', but humans can compare outputs.
RLHF pipeline: (1) SFT on demonstrations → (2) Collect preference pairs → (3) Train reward model → (4) PPO fine-tuning against reward model. Each stage requires separate infrastructure and introduces compounding variance.
The problem: reward model hacking. The LLM learns to produce responses that score high on the reward model but diverge from real quality. This requires careful KL-divergence penalties and frequent reward model refreshes. Running PPO at scale requires 3–4x the GPU memory of inference.
DPO: Eliminating the Reward Model
Direct Preference Optimization reframes alignment as a classification problem. Instead of training a reward model and then doing RL, DPO directly optimizes the LLM using preference pairs — showing the model (chosen, rejected) pairs and maximizing the likelihood ratio.
- No separate reward model needed — reduces infrastructure complexity significantly
- No RL training loop — standard supervised training with a modified loss function
- Much more stable training — no reward hacking, no PPO instability
- Requires high-quality preference pairs — garbage in, garbage out more acutely than RLHF
DPO is the dominant production choice for most fine-tuning workflows today. It's what most open-source fine-tuning libraries default to, and it's significantly easier to debug.
GRPO: Group Relative Policy Optimization
GRPO is the alignment method behind DeepSeek-R1 and the broader reasoning model wave. Instead of preference pairs, it generates multiple responses to the same prompt, scores them with a verifiable reward (correct/incorrect for math, passes tests for code), and optimizes relative to the group.
GRPO key insight: you don't need human preference pairs if your task has a ground-truth verifiable reward. Math answer is right or wrong. Code passes tests or not. This makes GRPO ideal for domains where correctness is checkable.
The tradeoff: GRPO requires verifiable rewards, which means it only works cleanly for structured tasks. You can't GRPO your way to better creative writing without a proxy reward model — which reintroduces the reward hacking problem.
Decision Framework
| Scenario | Recommended Method | Reason |
|---|---|---|
| General instruction following, chat | DPO | Stable, no reward model, preference pairs available |
| Math, code, structured reasoning | GRPO | Verifiable rewards, drives inference-time scaling |
| Very large scale, frontier model training | RLHF (PPO) | Maximum flexibility, can refine reward model over time |
| Limited preference data budget | DPO | More sample-efficient than RLHF |
| Unknown task structure | DPO → evaluate → GRPO if structured | DPO as baseline, GRPO if task rewards are verifiable |
- DPO paper (Rafailov et al., 2023)
- GRPO paper (DeepSeek-R1)
- Hugging Face TRL library (DPO + GRPO trainers)
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →