The Eval Flywheel: From Implicit Feedback to Continuous Model Improvement
How clicks, dwell time, and bounce rate become training signal. The position bias problem, IPS debiasing for implicit feedback, and the architectural loop that connects user behavior back to model retraining without the held-out test set bottleneck.
The Eval Flywheel Problem
A team ships a model. They evaluate it with a held-out test set. They retrain every month. Three months in, the test set has leaked into the training distribution. Six months in, their offline metrics are excellent and their product metrics are flat. This is the eval flywheel problem: the loop that should produce better models is producing models better at your test set rather than better in the world.
The solution is a self-reinforcing loop that connects real user behavior back to model improvement without the human-labeled-test-set bottleneck.
Implicit Feedback Signals
Implicit feedback is behavior that reveals preferences without explicit ratings. Users don't say 'this result is good' — they click it, they dwell on it, they return to the session, or they abandon it.
Inverse Propensity Scoring (IPS) for Position Bias
A click at rank 3 is worth more than a click at rank 1, because the probability of a user even examining rank 3 is lower. IPS weights each click inversely by the probability the user would have clicked given the position, regardless of quality.
import numpy as np
# Examination propensity model: P(examined | rank)
# Fitted from randomized experiments or swap experiments
def examination_propensity(rank: int, alpha: float = 0.6) -> float:
"""Power-law model: P(examine | rank) = 1 / rank^alpha"""
return 1.0 / (rank ** alpha)
def ips_relevance(click: int, rank: int, alpha: float = 0.6) -> float:
"""
IPS-debiased relevance estimate.
click=1 if user clicked, 0 otherwise.
IPS weight = 1 / P(examine | rank) applied to positive labels only.
"""
if click == 0:
return 0.0
return click / examination_propensity(rank, alpha)
# Training data with position-debiased labels
training_data = [
{"query": "q1", "doc": "d1", "rank": 1, "click": 1},
{"query": "q1", "doc": "d2", "rank": 3, "click": 1}, # rank-3 click is more signal
{"query": "q2", "doc": "d3", "rank": 2, "click": 0},
]
for row in training_data:
row["ips_label"] = ips_relevance(row["click"], row["rank"])
print(f"rank {row['rank']}, click {row['click']} → IPS label = {row['ips_label']:.3f}")
The Flywheel: Closing the Loop
Collect implicit feedback from production → clean and debias (IPS for position, filtering for bot/spam traffic) → generate weak supervision labels → retrain ranking/retrieval model → deploy via canary → collect new feedback. Each loop iteration produces a model that better reflects real user intent, which generates higher-quality implicit feedback for the next iteration.
# Minimal flywheel pipeline
class EvalFlywheel:
def __init__(self, click_log_table: str, model_registry_path: str):
self.click_log = click_log_table
self.registry = model_registry_path
def generate_training_data(self, days: int = 30):
"""Pull recent clicks, apply IPS debiasing, output training pairs."""
clicks = self.load_clicks(days)
pairs = []
for click in clicks:
ips_weight = ips_relevance(click["clicked"], click["rank"])
if ips_weight > 0:
pairs.append({
"query": click["query"],
"positive_doc": click["doc_id"],
"weight": ips_weight,
"negative_docs": self.sample_negatives(click["query"], click["doc_id"])
})
return pairs
def run_iteration(self):
data = self.generate_training_data()
model = self.train_on_pairs(data)
metrics = self.offline_eval(model)
if metrics["ndcg@10"] > self.current_champion_ndcg():
self.register(model)
self.deploy_canary(model, pct=0.05)
The flywheel only works if you resist the urge to immediately deploy models that beat the offline test set. Always canary-test against real user metrics (CTR, session success rate) before promoting. The whole point of the flywheel is to catch the gap between offline and online quality.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →