RLHF in Production: What Actually Works
Reward models, PPO instability, reward hacking, and the lessons learned shipping alignment training at scale.
The InstructGPT paper made RLHF look clean: collect preferences, train a reward model, run PPO, ship. Production reality is messier: reward model collapse, KL penalty death spirals, preference data that doesn't generalise, and an RL training loop that requires 3× the GPU budget of pretraining.
The Reward Model Is Your Biggest Risk
The reward model (RM) is trained to predict which response humans prefer. The problem: it learns to predict your annotators' biases, not abstract quality. Common biases that sneak into reward models: length bias (longer answers score higher regardless of correctness), format bias (markdown looks more thorough), sycophancy (the RM scores agreeable responses higher than honest ones).
A biased reward model produces a biased policy. The policy is only as aligned as the humans who labeled the preference data — and humans are inconsistent, time-pressured, and fallible.
Reward Hacking
Once PPO starts optimising against your RM, it will find and exploit every weakness. Reward hacking happens when the policy finds high-reward outputs that are low-quality: responses that are long but repetitive, responses that pattern-match to the RM's surface heuristics, or responses that use the preferred formatting of training annotators without substance.
- KL penalty (β): the primary defence against reward hacking. Higher β keeps the policy closer to the base model; lower β allows more optimisation. Typical range: 0.1–0.5. Too high = no improvement. Too low = mode collapse.
- Reward clipping: clip reward signals to [−4, 4] to prevent outlier rewards from dominating updates
- Periodic RM refresh: reward model should be retrained on outputs from the current policy, not just the SFT model — otherwise you're optimising against a distribution mismatch
Why Most Teams Switch to DPO
Direct Preference Optimization (DPO) eliminates the reward model and RL loop entirely. It reformulates the RLHF objective as a binary classification loss directly on the policy. No PPO, no KL tuning, no reward hacking surface. The trade-off: DPO is offline — it can't improve beyond the preference data distribution. PPO can explore and find new high-reward outputs; DPO cannot.
| PPO-RLHF | DPO | |
|---|---|---|
| Reward model | Required, separate training | Not needed |
| Online exploration | Yes — can discover novel good outputs | No — offline only |
| Reward hacking risk | High without careful KL tuning | Low (no reward model to hack) |
| GPU cost | 3–4× SFT cost | ~1–1.5× SFT cost |
| Implementation complexity | High (PPO is notoriously finicky) | Low (a modified cross-entropy loss) |
| Best for | Complex tasks needing exploration; frontier-scale training | Instruction following; style alignment; most production use cases |
What GRPO Changes
Group Relative Policy Optimization (GRPO, used in DeepSeek-R1) eliminates the critic network that PPO requires. Instead of estimating value per token, GRPO samples G outputs per prompt and uses the group mean reward as the baseline. This makes it significantly cheaper than PPO and more stable than naive REINFORCE, while retaining online exploration that DPO lacks.
GRPO is fast becoming the default for post-training at frontier labs. If you're setting up a new alignment training pipeline today, start with DPO for simplicity, then evaluate GRPO if you need online improvement.
Production Checklist
- Audit annotator agreement before training the RM — inter-annotator agreement below 70% predicts reward model instability
- Start with a small KL penalty (β=0.1) and increase if you observe reward hacking
- Monitor RM score distribution during PPO — if mean score climbs while output quality degrades, you have reward hacking
- Keep a regression suite of diverse prompts and run it after every checkpoint
- Consider DPO first — it handles 80% of production alignment needs at 1/3 the complexity
Try: RLHF / DPO / PPO module →:
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →