GenAI Systems Lab Open interactive version →
AI Engineering 9 min read

The Online Evaluation Loop: Making Your Eval Set Self-Refreshing

Static eval sets go stale. How to pipeline production failures into your eval set automatically, detect distribution shift before it becomes a user complaint, and run shadow evaluations without user exposure.

Static eval sets have a half-life. They go stale as your model improves, as your user population shifts, as your product changes scope. The eval set you built in month one measures month-one failure modes. By month six, you are measuring a past version of your own problems. The solution is to close the loop: make production traffic the source of new eval cases automatically.

Why Static Eval Sets Fail

The Production Failure Pipeline

A production failure pipeline is an automated system that routes production failures into your eval set without human review of every case. The pipeline has three stages: detection, triage, and intake.

class ProductionFailurePipeline:
    def __init__(self, eval_store, model_judge, dedup_store):
        self.eval_store  = eval_store
        self.model_judge = model_judge  # LLM-as-judge for triage
        self.dedup_store = dedup_store

    def process_session(self, session):
        # Stage 1: Detection — identify failure signals
        signals = []
        if session.has_reformulation:
            signals.append(("reformulation", session.reformulation_pair))
        if session.has_negative_feedback:
            signals.append(("explicit_negative", session.feedback_event))
        if session.abandoned_after_first_response:
            signals.append(("abandonment", session))
        if not signals:
            return

        # Stage 2: Triage — model judge confirms genuine failure
        for signal_type, signal_data in signals:
            j = self.model_judge.evaluate(signal_data)
            if j.is_genuine_failure and j.confidence > 0.8:
                # Stage 3: Deduplicate and add to eval set
                fp = self.compute_fingerprint(signal_data)
                if not self.dedup_store.exists(fp):
                    self.eval_store.add(
                        query=signal_data.query,
                        failure_type=signal_type,
                        source="production",
                        date=session.timestamp,
                        requires_human_review=(j.confidence < 0.95),
                    )
                    self.dedup_store.add(fp)

Set a budget for human review. Not every case that enters the pipeline needs human adjudication — the LLM judge handles 80%. Flag the 20% where judge confidence is below threshold for a weekly human review cycle. This keeps the pipeline sustainable without requiring manual triage of every production failure.

Shadow Evaluation Pattern

Shadow evaluation runs a candidate model alongside the live model on real traffic, compares outputs offline, and surfaces cases where the candidate would have behaved differently — without exposing users to the candidate model. This is the safest way to evaluate on production distribution before any live exposure.

Distribution Shift Detection

Before your eval set can catch distribution shift, you need to detect it. Three approaches at different levels of granularity:

The Self-Refreshing Eval

A self-refreshing eval set has three components running in parallel: (1) the static core — high-quality hand-curated cases covering fundamental capabilities that never age out; (2) the rolling window — auto-ingested production failures from the last 90 days, continuously refreshed; (3) the cluster-representative sample — one auto-selected query per major cluster, updated weekly to track distribution shifts.

Target ratio for most production systems: 30% static core, 50% rolling window, 20% cluster sample. Adjust toward more static core for high-stakes regulated domains (legal, medical) where stability matters more than recency. Adjust toward more rolling window for consumer products where user behavior evolves rapidly.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →