AI Engineering 13 min read

RLHF From Scratch: Reward Model Training, PPO Loop, KL Penalty, and Mode Collapse

Three-phase pipeline: SFT → reward model (Bradley-Terry loss) → PPO fine-tuning with KL penalty. Why the KL penalty is non-negotiable. Mode collapse: symptoms, causes, fixes. What frontier lab RE interviews actually ask you to implement.

RLHF (Reinforcement Learning from Human Feedback) is the dominant technique for aligning language models to human preferences. Understanding it at implementation depth — not just the concept — is what separates Research Engineers from MLEs in frontier lab interviews.

The Three-Phase Pipeline

Phase 1 — SFT: Fine-tune the base model on demonstration data (human-written responses). This gives you a starting policy that's instruction-following. Phase 2 — Reward model: Train a model to predict human preference between two responses. Input: (prompt, response). Output: scalar reward. Phase 3 — RL fine-tuning: Use PPO to update the SFT model to generate responses that maximize the reward model's score, subject to a KL penalty against the SFT policy.

Reward Model Training

The reward model (RM) takes a Bradley-Terry preference model as its foundation: given two responses A and B, P(A preferred over B) = σ(r(A) - r(B)). You train it with a pairwise loss: minimize -log σ(r(chosen) - r(rejected)).

# Reward model loss
import torch
import torch.nn.functional as F

def reward_loss(chosen_rewards, rejected_rewards):
    # Bradley-Terry preference model
    # chosen_rewards, rejected_rewards: (batch_size,)
    return -F.logsigmoid(chosen_rewards - rejected_rewards).mean()

# The reward model is typically the SFT model
# with the language model head replaced by a scalar head
# Input: (prompt + response tokens)
# Output: single scalar reward

The reward model is the weakest link. Human annotators disagree. Annotation guidelines matter more than architecture. A reward model trained on noisy preferences will produce a policy that optimizes for noise.

PPO for Language Models

PPO (Proximal Policy Optimization) is used to optimize the policy (SFT model) against the reward model. The core idea: take gradient steps that improve reward while not straying too far from the current policy (clipped objective).

# Simplified PPO objective for LMs
# π_θ: current policy
# π_old: policy from previous step
# r_t: per-token reward
# A_t: advantage estimate

# Ratio of new to old policy probability
ratio = π_θ(a_t | s_t) / π_old(a_t | s_t)

# Clipped objective — prevents large updates
L_clip = min(
    ratio * A_t,
    clip(ratio, 1 - ε, 1 + ε) * A_t
)

# Full RLHF objective:
# L = E[r(prompt, response)] - β * KL(π_θ || π_sft)
# KL penalty prevents reward hacking

The KL Penalty: Why It's Non-Negotiable

Without the KL penalty against the SFT policy, the model will reward-hack: find ways to maximize the reward model's score that don't correspond to actual quality. Common failure modes: repetitive text that confuses the RM, very long responses that score higher due to RM bias, degenerate token patterns.

KL(π_θ || π_sft) = E[log π_θ(a) - log π_sft(a)] — measures how far the current policy has drifted from SFT. β controls the tradeoff: high β → stays close to SFT (safe, low reward gains). Low β → maximizes reward (risky, reward hacking). Typical β values: 0.01–0.1. Start high, anneal down carefully. Monitor: if the KL divergence grows faster than the reward, you're reward hacking.

Mode Collapse in RLHF

Mode collapse in RLHF: the policy converges to a narrow distribution of responses that score well on the reward model but lack diversity. The model finds one or two 'winning' response patterns and exploits them.

Symptom: all responses start sounding the same despite different prompts. Cause: reward model is overfit to surface patterns, RM feedback signal is too noisy, KL coefficient is too low. Fix: increase KL coefficient, use a diverse reward model ensemble, add entropy bonus to PPO objective. Diagnose: track output diversity (BLEU self-similarity, n-gram overlap across responses) alongside reward.

What Frontier Lab Interviews Test

Research Engineer interviews at Anthropic, DeepMind, and Cohere for RLHF roles will ask you to implement the reward loss from scratch, explain the KL penalty derivation, and debug a training run where the reward goes up but the policy degrades. They expect you to name the failure modes without being prompted.

Common interview question: 'Your reward goes up monotonically but human raters say the model is getting worse. What do you investigate?' Expected answer covers: reward hacking via RM exploitation, mode collapse, annotation distribution shift, KL penalty misconfiguration.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →