Evaluation 11 min read

Counterfactual Offline Evaluation: IPS and Doubly Robust Estimators Explained

The logging policy bias problem: why your offline eval systematically undervalues items your old model buried. Inverse Propensity Scoring and the Doubly Robust estimator from scratch. When to log propensities at serving time and why that matters.

The Logging Policy Problem

Your production ranking model has a bias problem that's invisible in offline evaluation. Every click you've logged was on an item your previous model chose to show. Items your model ranked low were shown to fewer users and clicked on less — not because they're irrelevant, but because they weren't exposed. Your offline eval data is not a random sample; it's a biased sample produced by your logging policy.

Naive offline evaluation using logged data systematically undervalues items that the old model buried. A new model that ranks those items higher will look worse in offline evaluation — but perform better in online A/B tests. This is the counterfactual evaluation problem.

Inverse Propensity Scoring (IPS) for Offline Evaluation

If we know the probability that each item was shown by the logging policy, we can reweight observations inversely by that probability to simulate what would have happened under the new policy.

import numpy as np

def ips_offline_eval(
    new_policy_scores:  np.ndarray,   # new model's score for each item
    logging_propensity: np.ndarray,   # P(shown | logging_policy) for each item
    reward:             np.ndarray,   # observed reward (1=click, 0=no click)
    clip_threshold:     float = 10.0  # clip high weights to reduce variance
) -> float:
    """
    IPS estimator for expected reward under new_policy.
    Returns estimated reward if we had deployed new_policy.
    """
    # Importance weights: new_policy / logging_policy
    # But we don't know new_policy propensities from scores alone —
    # convert to a distribution via softmax
    new_probs = np.exp(new_policy_scores) / np.exp(new_policy_scores).sum()
    
    weights = new_probs / (logging_propensity + 1e-8)
    weights = np.clip(weights, 0, clip_threshold)   # reduce high-variance weights
    
    ips_estimate = (weights * reward).mean()
    return float(ips_estimate)

# Doubly Robust (DR) estimator — lower variance than pure IPS
def doubly_robust_eval(
    new_policy_scores:  np.ndarray,
    logging_propensity: np.ndarray,
    reward:             np.ndarray,
    reward_model:       callable,     # trained on logged data: f(item) → predicted_reward
    clip_threshold:     float = 10.0
) -> float:
    """
    DR estimator: uses reward model as baseline, corrects with IPS.
    Unbiased if EITHER propensity model OR reward model is correct.
    """
    new_probs = np.exp(new_policy_scores) / np.exp(new_policy_scores).sum()
    weights = np.clip(new_probs / (logging_propensity + 1e-8), 0, clip_threshold)
    
    predicted_rewards = reward_model(np.arange(len(reward)))
    dr_estimate = (predicted_rewards * new_probs).sum() + (weights * (reward - predicted_rewards)).mean()
    return float(dr_estimate)

Logging Propensity Estimation

To compute IPS weights, you need to know P(item shown | logging policy). For simple ranking, this is the examination propensity at each rank (power-law model). For complex policies with real-time filtering, you need to log the propensity at serving time.

# Log propensity at serving time — critical for offline eval
import json, time

def serve_and_log(query: str, user_id: str, production_model) -> dict:
    ranked_items = production_model.rank(query)
    
    log_entry = {
        "timestamp":  time.time(),
        "query":      query,
        "user_id":    user_id,
        "served": [
            {
                "item_id":    item.id,
                "rank":       rank,
                "score":      item.score,
                "propensity": 1.0 / (rank ** 0.6)   # or your propensity model
            }
            for rank, item in enumerate(ranked_items[:20], start=1)
        ]
    }
    write_to_log(json.dumps(log_entry))
    return {"items": [i["item_id"] for i in log_entry["served"][:10]]}

When to Use Counterfactual Evaluation

Counterfactual evaluation is most valuable when: (a) your A/B test budget is limited and you need to filter candidates before live testing; (b) your logging data is heavily biased by the previous model; (c) you're evaluating models that significantly change the ranking (not just marginal improvements). For marginal improvements to a stable ranking, standard offline metrics are sufficient.

Always use DR estimator over pure IPS if you can train a reward model — it has lower variance and is robust to mis-specified propensity estimates.
Log propensities at serving time. Reconstructing them offline from rank is approximate; the true propensity depends on the real-time filtering logic your serving stack applied.
IPS estimates are unbiased but high-variance. Use clipping (cap at 10x or 100x) to reduce variance at the cost of slight bias.
Counterfactual evaluation does not replace A/B testing — it narrows the candidate set. Only metrics from live experiments are fully trustworthy.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →