GenAI Systems Lab Open interactive version →
AI Engineering 11 min read

DPO vs RLHF vs GRPO: Which Alignment Method Should You Use?

Three dominant approaches to aligning fine-tuned models with human preferences. How DPO eliminates the reward model, why GRPO is driving reasoning model breakthroughs, and a decision tree for choosing the right method based on your data, compute, and stability requirements.

The Alignment Problem

SFT gives you a model that can follow instructions. Alignment gives you a model that follows instructions *well* — refusing harmful requests, matching human preferences on tone and format, and being honest when uncertain. The three dominant methods for alignment are RLHF, DPO, and GRPO — and choosing wrong costs weeks of training time.

RLHF: The Original Method

Reinforcement Learning from Human Feedback trains a separate reward model on human preference pairs (response A vs response B), then uses PPO to fine-tune the LLM to maximize reward. The core insight: you can't write a reward function for 'good writing', but humans can compare outputs.

RLHF pipeline: (1) SFT on demonstrations → (2) Collect preference pairs → (3) Train reward model → (4) PPO fine-tuning against reward model. Each stage requires separate infrastructure and introduces compounding variance.

The problem: reward model hacking. The LLM learns to produce responses that score high on the reward model but diverge from real quality. This requires careful KL-divergence penalties and frequent reward model refreshes. Running PPO at scale requires 3–4x the GPU memory of inference.

DPO: Eliminating the Reward Model

Direct Preference Optimization reframes alignment as a classification problem. Instead of training a reward model and then doing RL, DPO directly optimizes the LLM using preference pairs — showing the model (chosen, rejected) pairs and maximizing the likelihood ratio.

DPO is the dominant production choice for most fine-tuning workflows today. It's what most open-source fine-tuning libraries default to, and it's significantly easier to debug.

GRPO: Group Relative Policy Optimization

GRPO is the alignment method behind DeepSeek-R1 and the broader reasoning model wave. Instead of preference pairs, it generates multiple responses to the same prompt, scores them with a verifiable reward (correct/incorrect for math, passes tests for code), and optimizes relative to the group.

GRPO key insight: you don't need human preference pairs if your task has a ground-truth verifiable reward. Math answer is right or wrong. Code passes tests or not. This makes GRPO ideal for domains where correctness is checkable.

The tradeoff: GRPO requires verifiable rewards, which means it only works cleanly for structured tasks. You can't GRPO your way to better creative writing without a proxy reward model — which reintroduces the reward hacking problem.

Decision Framework

ScenarioRecommended MethodReason
General instruction following, chatDPOStable, no reward model, preference pairs available
Math, code, structured reasoningGRPOVerifiable rewards, drives inference-time scaling
Very large scale, frontier model trainingRLHF (PPO)Maximum flexibility, can refine reward model over time
Limited preference data budgetDPOMore sample-efficient than RLHF
Unknown task structureDPO → evaluate → GRPO if structuredDPO as baseline, GRPO if task rewards are verifiable

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →