GenAI Systems Lab Open interactive version →
AI Engineering 12 min read

RLHF in Production: What Actually Works

Reward models, PPO instability, reward hacking, and the lessons learned shipping alignment training at scale.

The InstructGPT paper made RLHF look clean: collect preferences, train a reward model, run PPO, ship. Production reality is messier: reward model collapse, KL penalty death spirals, preference data that doesn't generalise, and an RL training loop that requires 3× the GPU budget of pretraining.

The Reward Model Is Your Biggest Risk

The reward model (RM) is trained to predict which response humans prefer. The problem: it learns to predict your annotators' biases, not abstract quality. Common biases that sneak into reward models: length bias (longer answers score higher regardless of correctness), format bias (markdown looks more thorough), sycophancy (the RM scores agreeable responses higher than honest ones).

A biased reward model produces a biased policy. The policy is only as aligned as the humans who labeled the preference data — and humans are inconsistent, time-pressured, and fallible.

Reward Hacking

Once PPO starts optimising against your RM, it will find and exploit every weakness. Reward hacking happens when the policy finds high-reward outputs that are low-quality: responses that are long but repetitive, responses that pattern-match to the RM's surface heuristics, or responses that use the preferred formatting of training annotators without substance.

Why Most Teams Switch to DPO

Direct Preference Optimization (DPO) eliminates the reward model and RL loop entirely. It reformulates the RLHF objective as a binary classification loss directly on the policy. No PPO, no KL tuning, no reward hacking surface. The trade-off: DPO is offline — it can't improve beyond the preference data distribution. PPO can explore and find new high-reward outputs; DPO cannot.

PPO-RLHFDPO
Reward modelRequired, separate trainingNot needed
Online explorationYes — can discover novel good outputsNo — offline only
Reward hacking riskHigh without careful KL tuningLow (no reward model to hack)
GPU cost3–4× SFT cost~1–1.5× SFT cost
Implementation complexityHigh (PPO is notoriously finicky)Low (a modified cross-entropy loss)
Best forComplex tasks needing exploration; frontier-scale trainingInstruction following; style alignment; most production use cases

What GRPO Changes

Group Relative Policy Optimization (GRPO, used in DeepSeek-R1) eliminates the critic network that PPO requires. Instead of estimating value per token, GRPO samples G outputs per prompt and uses the group mean reward as the baseline. This makes it significantly cheaper than PPO and more stable than naive REINFORCE, while retaining online exploration that DPO lacks.

GRPO is fast becoming the default for post-training at frontier labs. If you're setting up a new alignment training pipeline today, start with DPO for simplicity, then evaluate GRPO if you need online improvement.

Production Checklist

Try: RLHF / DPO / PPO module →:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →