The Online Evaluation Loop: Making Your Eval Set Self-Refreshing
Static eval sets go stale. How to pipeline production failures into your eval set automatically, detect distribution shift before it becomes a user complaint, and run shadow evaluations without user exposure.
Static eval sets have a half-life. They go stale as your model improves, as your user population shifts, as your product changes scope. The eval set you built in month one measures month-one failure modes. By month six, you are measuring a past version of your own problems. The solution is to close the loop: make production traffic the source of new eval cases automatically.
Why Static Eval Sets Fail
- Distribution shift: your users change. New use cases emerge, old ones fade. A static eval built from early-adopter traffic does not represent the current user population.
- Overfitting to the eval: your team iterates against the eval set over months. Consciously or not, engineering decisions get made that optimize the measured eval. The eval stops being a proxy for production quality and becomes a target.
- Missing failure modes: your static eval can only contain failure modes you already knew about when you built it. Novel failure modes — which are the most important ones to catch — are systematically absent.
- Staleness of correct answers: ground truth changes. Policy documents update, product features change, correct answers evolve. Static evals with fixed gold labels become wrong over time without anyone noticing.
The Production Failure Pipeline
A production failure pipeline is an automated system that routes production failures into your eval set without human review of every case. The pipeline has three stages: detection, triage, and intake.
class ProductionFailurePipeline:
def __init__(self, eval_store, model_judge, dedup_store):
self.eval_store = eval_store
self.model_judge = model_judge # LLM-as-judge for triage
self.dedup_store = dedup_store
def process_session(self, session):
# Stage 1: Detection — identify failure signals
signals = []
if session.has_reformulation:
signals.append(("reformulation", session.reformulation_pair))
if session.has_negative_feedback:
signals.append(("explicit_negative", session.feedback_event))
if session.abandoned_after_first_response:
signals.append(("abandonment", session))
if not signals:
return
# Stage 2: Triage — model judge confirms genuine failure
for signal_type, signal_data in signals:
j = self.model_judge.evaluate(signal_data)
if j.is_genuine_failure and j.confidence > 0.8:
# Stage 3: Deduplicate and add to eval set
fp = self.compute_fingerprint(signal_data)
if not self.dedup_store.exists(fp):
self.eval_store.add(
query=signal_data.query,
failure_type=signal_type,
source="production",
date=session.timestamp,
requires_human_review=(j.confidence < 0.95),
)
self.dedup_store.add(fp)
Set a budget for human review. Not every case that enters the pipeline needs human adjudication — the LLM judge handles 80%. Flag the 20% where judge confidence is below threshold for a weekly human review cycle. This keeps the pipeline sustainable without requiring manual triage of every production failure.
Shadow Evaluation Pattern
Shadow evaluation runs a candidate model alongside the live model on real traffic, compares outputs offline, and surfaces cases where the candidate would have behaved differently — without exposing users to the candidate model. This is the safest way to evaluate on production distribution before any live exposure.
- Architecture: every request goes to both the live model and the shadow model. The shadow model response is logged but not returned to the user. An offline comparison job runs nightly.
- What to compare: response length distribution, confidence scores, semantic similarity to live response (cosine sim below 0.85 flags divergence worth reviewing), policy compliance scores, latency percentiles.
- When shadow disagrees with live: divergent cases are routed to the production failure pipeline for human review. High divergence rate is a signal that the candidate model has meaningfully different behavior — good or bad.
- Cost: shadow evaluation doubles inference cost. Run at 5-10% traffic sample rather than 100%. This gives sufficient statistical power for aggregate comparisons without doubling your inference bill.
Distribution Shift Detection
Before your eval set can catch distribution shift, you need to detect it. Three approaches at different levels of granularity:
- Embedding drift monitoring: embed every incoming query with a frozen encoder. Track the centroid of the embedding distribution weekly. Cosine distance between week N and week N-1 above 0.05-0.1 signals meaningful drift. This catches broad topic shifts.
- Query cluster analysis: run k-means on the embedding space weekly, track cluster membership counts over time. Clusters that grow more than 2x in a week indicate an emerging use case. Clusters that shrink to near-zero indicate a dying use case. Both warrant eval set updates.
- Topic model over time: fit an LDA or BERTopic model on weekly query samples. Track topic proportions. Rising topics not represented in your eval set are eval blind spots. Falling topics that dominate your eval set are wasted eval coverage.
The Self-Refreshing Eval
A self-refreshing eval set has three components running in parallel: (1) the static core — high-quality hand-curated cases covering fundamental capabilities that never age out; (2) the rolling window — auto-ingested production failures from the last 90 days, continuously refreshed; (3) the cluster-representative sample — one auto-selected query per major cluster, updated weekly to track distribution shifts.
Target ratio for most production systems: 30% static core, 50% rolling window, 20% cluster sample. Adjust toward more static core for high-stakes regulated domains (legal, medical) where stability matters more than recency. Adjust toward more rolling window for consumer products where user behavior evolves rapidly.
- Governance: every eval case needs a source tag (static/rolling/cluster), an ingestion date, and an expiry policy. Rolling window cases expire at 90 days. Cluster samples expire when the cluster drops below 2% of traffic.
- Versioning: tag your eval set by date. When you report a metric, report it against a specific eval version. This prevents retroactive confusion when the eval set changes and metrics move.
- Drift alert threshold: if more than 30% of your current eval set was ingested from a different distribution than today live traffic (measured by embedding distance), trigger a refresh. A stale eval set is worse than no eval set — it gives false confidence.
- Evaluating Language Models: An Ongoing Challenge (Liang et al., HELM)
- DataComp: In Search of the Next Generation of Multimodal Datasets
- Dynabench: Rethinking Benchmarking in NLP
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →