GenAI Systems Lab Open interactive version →
AI Engineering 11 min read

Reward Modeling from Logs: Turning Traffic into Training Signal

How to build a reward model from implicit feedback logs using Bradley-Terry pairwise preference, DPO from preference pairs, and the contamination traps that silently corrupt your reward signal.

You have a stream of implicit feedback from production. Now what? The goal is to convert behavioral signals into training signal: either a reward model that scores responses, or direct preference pairs for DPO. Both require transforming noisy behavioral events into structured (chosen, rejected) pairs — and both have specific contamination traps that corrupt the signal if you are not careful.

From Logs to Preferences

The fundamental unit of reward modeling from logs is a preference pair: two responses to the same query, one implicitly preferred over the other. The standard construction: for a given query, the response that received a positive implicit signal (clicked, copied, led to task completion) is chosen; a response shown but not selected — or followed by a reformulation — is rejected.

The rejected response must have been shown. You cannot sample random model outputs as negative examples — the implicit negative signal only carries information if the user actually saw the alternative and passed over it. This is the exposure constraint, and it is the first thing that breaks naive implementations.

Bradley-Terry Model

The Bradley-Terry model is the theoretical foundation for pairwise preference learning. Given two responses A and B to the same query, it models the probability that A is preferred over B as a function of their latent reward scores. The training objective maximizes the log-likelihood of observed preference pairs:

# Bradley-Terry preference probability
import torch
import torch.nn.functional as F

def bt_preference_prob(r_A, r_B):
    # P(A preferred over B) under Bradley-Terry model
    return torch.sigmoid(r_A - r_B)

def bt_loss(r_chosen, r_rejected):
    # Negative log-likelihood — gradient pushes r_chosen > r_rejected
    return -F.logsigmoid(r_chosen - r_rejected).mean()

# Architecture: policy model backbone + scalar regression head
# Training: freeze backbone, train head first; unfreeze for end-to-end fine-tune

DPO from Implicit Pairs

Direct Preference Optimization bypasses the explicit reward model by directly optimizing the policy from preference pairs. The DPO loss for implicit pairs is identical to explicit pairs — the difference is only in how you construct (chosen, rejected): chosen = response user clicked/copied/used; rejected = response shown but skipped or followed by reformulation.

def dpo_loss(chosen_logps, rejected_logps,
             ref_chosen_logps, ref_rejected_logps, beta=0.1):
    """
    DPO loss from implicit preference pairs.
    beta controls how strongly to deviate from the reference policy.
    """
    chosen_rewards  = beta * (chosen_logps  - ref_chosen_logps)
    rejected_rewards = beta * (rejected_logps - ref_rejected_logps)
    return -F.logsigmoid(chosen_rewards - rejected_rewards).mean()

Contamination Traps

Three specific ways implicit reward signal gets corrupted, and how to detect each:

Quality Gates Before Training

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →