AI Engineering 12 min read

RLHF in Production: What Actually Works

Reward models, PPO instability, reward hacking, and the lessons learned shipping alignment training at scale.

The InstructGPT paper made RLHF look clean: collect preferences, train a reward model, run PPO, ship. Production reality is messier: reward model collapse, KL penalty death spirals, preference data that doesn't generalise, and an RL training loop that requires 3× the GPU budget of pretraining.

The Reward Model Is Your Biggest Risk

The reward model (RM) is trained to predict which response humans prefer. The problem: it learns to predict your annotators' biases, not abstract quality. Common biases that sneak into reward models: length bias (longer answers score higher regardless of correctness), format bias (markdown looks more thorough), sycophancy (the RM scores agreeable responses higher than honest ones).

A biased reward model produces a biased policy. The policy is only as aligned as the humans who labeled the preference data — and humans are inconsistent, time-pressured, and fallible.

Reward Hacking

Once PPO starts optimising against your RM, it will find and exploit every weakness. Reward hacking happens when the policy finds high-reward outputs that are low-quality: responses that are long but repetitive, responses that pattern-match to the RM's surface heuristics, or responses that use the preferred formatting of training annotators without substance.

KL penalty (β): the primary defence against reward hacking. Higher β keeps the policy closer to the base model; lower β allows more optimisation. Typical range: 0.1–0.5. Too high = no improvement. Too low = mode collapse.
Reward clipping: clip reward signals to [−4, 4] to prevent outlier rewards from dominating updates
Periodic RM refresh: reward model should be retrained on outputs from the current policy, not just the SFT model — otherwise you're optimising against a distribution mismatch

Why Most Teams Switch to DPO

Direct Preference Optimization (DPO) eliminates the reward model and RL loop entirely. It reformulates the RLHF objective as a binary classification loss directly on the policy. No PPO, no KL tuning, no reward hacking surface. The trade-off: DPO is offline — it can't improve beyond the preference data distribution. PPO can explore and find new high-reward outputs; DPO cannot.

	PPO-RLHF	DPO
Reward model	Required, separate training	Not needed
Online exploration	Yes — can discover novel good outputs	No — offline only
Reward hacking risk	High without careful KL tuning	Low (no reward model to hack)
GPU cost	3–4× SFT cost	~1–1.5× SFT cost
Implementation complexity	High (PPO is notoriously finicky)	Low (a modified cross-entropy loss)
Best for	Complex tasks needing exploration; frontier-scale training	Instruction following; style alignment; most production use cases

What GRPO Changes

Group Relative Policy Optimization (GRPO, used in DeepSeek-R1) eliminates the critic network that PPO requires. Instead of estimating value per token, GRPO samples G outputs per prompt and uses the group mean reward as the baseline. This makes it significantly cheaper than PPO and more stable than naive REINFORCE, while retaining online exploration that DPO lacks.

GRPO is fast becoming the default for post-training at frontier labs. If you're setting up a new alignment training pipeline today, start with DPO for simplicity, then evaluate GRPO if you need online improvement.

Production Checklist

Audit annotator agreement before training the RM — inter-annotator agreement below 70% predicts reward model instability
Start with a small KL penalty (β=0.1) and increase if you observe reward hacking
Monitor RM score distribution during PPO — if mean score climbs while output quality degrades, you have reward hacking
Keep a regression suite of diverse prompts and run it after every checkpoint
Consider DPO first — it handles 80% of production alignment needs at 1/3 the complexity

Try: RLHF / DPO / PPO module →:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →