Reward Modeling from Logs: Turning Traffic into Training Signal
How to build a reward model from implicit feedback logs using Bradley-Terry pairwise preference, DPO from preference pairs, and the contamination traps that silently corrupt your reward signal.
You have a stream of implicit feedback from production. Now what? The goal is to convert behavioral signals into training signal: either a reward model that scores responses, or direct preference pairs for DPO. Both require transforming noisy behavioral events into structured (chosen, rejected) pairs — and both have specific contamination traps that corrupt the signal if you are not careful.
From Logs to Preferences
The fundamental unit of reward modeling from logs is a preference pair: two responses to the same query, one implicitly preferred over the other. The standard construction: for a given query, the response that received a positive implicit signal (clicked, copied, led to task completion) is chosen; a response shown but not selected — or followed by a reformulation — is rejected.
The rejected response must have been shown. You cannot sample random model outputs as negative examples — the implicit negative signal only carries information if the user actually saw the alternative and passed over it. This is the exposure constraint, and it is the first thing that breaks naive implementations.
Bradley-Terry Model
The Bradley-Terry model is the theoretical foundation for pairwise preference learning. Given two responses A and B to the same query, it models the probability that A is preferred over B as a function of their latent reward scores. The training objective maximizes the log-likelihood of observed preference pairs:
# Bradley-Terry preference probability
import torch
import torch.nn.functional as F
def bt_preference_prob(r_A, r_B):
# P(A preferred over B) under Bradley-Terry model
return torch.sigmoid(r_A - r_B)
def bt_loss(r_chosen, r_rejected):
# Negative log-likelihood — gradient pushes r_chosen > r_rejected
return -F.logsigmoid(r_chosen - r_rejected).mean()
# Architecture: policy model backbone + scalar regression head
# Training: freeze backbone, train head first; unfreeze for end-to-end fine-tune
DPO from Implicit Pairs
Direct Preference Optimization bypasses the explicit reward model by directly optimizing the policy from preference pairs. The DPO loss for implicit pairs is identical to explicit pairs — the difference is only in how you construct (chosen, rejected): chosen = response user clicked/copied/used; rejected = response shown but skipped or followed by reformulation.
def dpo_loss(chosen_logps, rejected_logps,
ref_chosen_logps, ref_rejected_logps, beta=0.1):
"""
DPO loss from implicit preference pairs.
beta controls how strongly to deviate from the reference policy.
"""
chosen_rewards = beta * (chosen_logps - ref_chosen_logps)
rejected_rewards = beta * (rejected_logps - ref_rejected_logps)
return -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
Contamination Traps
Three specific ways implicit reward signal gets corrupted, and how to detect each:
- Exposure bias: the reward model is trained on pairs where the rejected response was shown. But the model you are training will generate different distributions at inference. Responses that were never shown have no calibration. Fix: include some randomly sampled model outputs as negatives in addition to behaviorally rejected responses — this expands the support of your training distribution.
- Label noise from ambiguous signals: a user who reformulates a query might be clarifying rather than reacting to a bad answer. A user who abandons a session might have gotten what they needed. Mislabeled negatives add noise that degrades reward model calibration. Fix: apply a confidence filter — only use pairs where the negative signal is unambiguous (explicit skip in a ranked interface, immediate reformulation with identical semantic intent). Discard ambiguous signals rather than guessing.
- Temporal leakage: if your training set includes behavioral data from after your test set queries, the reward model has implicitly seen test-time user behavior. This inflates eval metrics. Fix: strict temporal train/test splits. All behavioral data used for training must predate the test set queries by at least one full deployment cycle.
Quality Gates Before Training
- Deduplication: remove exact-match and near-duplicate query pairs before training. Duplicates inflate confidence in certain patterns and cause overfitting to common queries.
- Length normalization check: if chosen responses are systematically longer than rejected, the reward model will learn length as a proxy for quality. Measure and correct for length correlation in your pair construction.
- Calibration audit on held-out explicit labels: collect a small set of explicit human preferences (100-500 pairs) and measure reward model agreement. If agreement is below 70%, the implicit signal is too noisy to train on.
- Reward hacking detection: after training, generate responses that maximize predicted reward without satisfying the underlying task. Any pattern that scores high but is low-quality indicates the reward model has learned a spurious correlation.
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al.)
- Learning to Summarize from Human Feedback (Stiennon et al.)
- Reward Model Ensembles Help Mitigate Overoptimization (Eisenstein et al.)
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →