AI Engineering 11 min read

DPO vs RLHF vs GRPO: Which Alignment Method Should You Use?

Three dominant approaches to aligning fine-tuned models with human preferences. How DPO eliminates the reward model, why GRPO is driving reasoning model breakthroughs, and a decision tree for choosing the right method based on your data, compute, and stability requirements.

The Alignment Problem

SFT gives you a model that can follow instructions. Alignment gives you a model that follows instructions *well* — refusing harmful requests, matching human preferences on tone and format, and being honest when uncertain. The three dominant methods for alignment are RLHF, DPO, and GRPO — and choosing wrong costs weeks of training time.

RLHF: The Original Method

Reinforcement Learning from Human Feedback trains a separate reward model on human preference pairs (response A vs response B), then uses PPO to fine-tune the LLM to maximize reward. The core insight: you can't write a reward function for 'good writing', but humans can compare outputs.

RLHF pipeline: (1) SFT on demonstrations → (2) Collect preference pairs → (3) Train reward model → (4) PPO fine-tuning against reward model. Each stage requires separate infrastructure and introduces compounding variance.

The problem: reward model hacking. The LLM learns to produce responses that score high on the reward model but diverge from real quality. This requires careful KL-divergence penalties and frequent reward model refreshes. Running PPO at scale requires 3–4x the GPU memory of inference.

DPO: Eliminating the Reward Model

Direct Preference Optimization reframes alignment as a classification problem. Instead of training a reward model and then doing RL, DPO directly optimizes the LLM using preference pairs — showing the model (chosen, rejected) pairs and maximizing the likelihood ratio.

No separate reward model needed — reduces infrastructure complexity significantly
No RL training loop — standard supervised training with a modified loss function
Much more stable training — no reward hacking, no PPO instability
Requires high-quality preference pairs — garbage in, garbage out more acutely than RLHF

DPO is the dominant production choice for most fine-tuning workflows today. It's what most open-source fine-tuning libraries default to, and it's significantly easier to debug.

GRPO: Group Relative Policy Optimization

GRPO is the alignment method behind DeepSeek-R1 and the broader reasoning model wave. Instead of preference pairs, it generates multiple responses to the same prompt, scores them with a verifiable reward (correct/incorrect for math, passes tests for code), and optimizes relative to the group.

GRPO key insight: you don't need human preference pairs if your task has a ground-truth verifiable reward. Math answer is right or wrong. Code passes tests or not. This makes GRPO ideal for domains where correctness is checkable.

The tradeoff: GRPO requires verifiable rewards, which means it only works cleanly for structured tasks. You can't GRPO your way to better creative writing without a proxy reward model — which reintroduces the reward hacking problem.

Decision Framework

Scenario	Recommended Method	Reason
General instruction following, chat	DPO	Stable, no reward model, preference pairs available
Math, code, structured reasoning	GRPO	Verifiable rewards, drives inference-time scaling
Very large scale, frontier model training	RLHF (PPO)	Maximum flexibility, can refine reward model over time
Limited preference data budget	DPO	More sample-efficient than RLHF
Unknown task structure	DPO → evaluate → GRPO if structured	DPO as baseline, GRPO if task rewards are verifiable

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →