Evaluation 10 min read

Building an Eval Pipeline That Actually Catches Production Failures

Why unit tests aren't enough for LLMs. How to design offline evals, online evals, and shadow evaluation so regressions don't reach users.

An eval pipeline is the thing that tells you whether your AI system is getting better or worse before users tell you. Without one, you're flying blind — every prompt change, model upgrade, or retrieval tweak is a gamble. With one, you have a feedback loop that makes iteration safe.

What makes a good eval?

A good eval is a set of (input, expected behaviour) pairs that cover your production distribution. Not hand-picked happy paths — representative samples of what users actually send, including the hard cases that caused incidents.

Coverage: spans the full distribution of real inputs — common cases, edge cases, and known failure modes
Ground truth: each example has a clear expected output or a rubric for what 'good' looks like
Sensitivity: the eval detects regressions before they ship, not after
Stability: same test suite, consistent results across runs at the same model/prompt version

The minimum viable eval set is 100 examples. Below that, statistical noise drowns out real signal. 500 examples is good. 2,000+ is production-grade. Quality matters more than quantity — 100 well-chosen examples beat 10,000 random ones.

The three layers of LLM evaluation

Layer	What it tests	Example metric
Unit evals	Single turn: one input, one expected output	Exact match, ROUGE, LLM-as-judge
Integration evals	Multi-turn flows, tool calls, retrieval + generation	Task success rate, tool call accuracy
Production evals	Real user traffic: latency, cost, human feedback, flag rate	Thumbs up/down, session completion, CSAT

Evaluation methods

Exact match

Best for classification, extraction, and any output with a definitive correct answer. Does the output exactly match the expected string? Simple, zero-cost, unambiguous.

LLM-as-judge

Use a strong LLM (usually GPT-4o or Claude Opus) to score outputs on a rubric. This scales to subjective outputs like summarisation, tone, and reasoning quality. The trick: give the judge a specific rubric with criteria and a score from 1–5, not just 'is this good?'

JUDGE_PROMPT = """You are evaluating an AI response for faithfulness to source material.

Source: {source}
Question: {question}
Response: {response}

Score the response on faithfulness (1-5):
5 = Every claim directly supported by the source
4 = Mostly supported, minor extrapolations
3 = Partially supported, some unsupported claims
2 = Several claims not in source
1 = Response contradicts or ignores source

Return JSON: {"score": N, "reason": "one sentence explanation"}"""

def judge_faithfulness(source, question, response):
    result = llm(JUDGE_PROMPT.format(
        source=source, question=question, response=response
    ))
    return json.loads(result)

RAGAS metrics (for RAG)

RAGAS is a framework for evaluating RAG pipelines with four key metrics: Faithfulness (is the answer grounded in the retrieved context?), Answer Relevancy (does the answer address the question?), Context Precision (are retrieved chunks actually needed?), and Context Recall (did retrieval find all the relevant information?).

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

dataset = {
  "question": ["What is prompt caching?", ...],
  "answer": ["Prompt caching stores...", ...],
  "contexts": [["Claude supports caching...", "Cache hit rate..."], ...],
  "ground_truth": ["Prompt caching is a technique...", ...]
}

result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(result)  # DataFrame with per-metric scores

Building the pipeline

class EvalPipeline:
    def __init__(self, system_under_test, eval_set, judges):
        self.sut = system_under_test   # your AI pipeline
        self.eval_set = eval_set       # list of {input, expected, metadata}
        self.judges = judges           # list of scorer functions

    def run(self):
        results = []
        for example in self.eval_set:
            output = self.sut(example["input"])
            scores = {j.__name__: j(example, output) for j in self.judges}
            results.append({
                "input": example["input"],
                "expected": example["expected"],
                "output": output,
                "scores": scores,
                "passed": all(s >= s_threshold for s, s_threshold in scores.items())
            })

        pass_rate = sum(r["passed"] for r in results) / len(results)
        print(f"Pass rate: {pass_rate:.1%} ({sum(r['passed'] for r in results)}/{len(results)})")
        return results

Gating deployments with evals

An eval suite is only valuable if it gates deployments. The pattern: run evals in CI on every prompt or code change, fail the pipeline if pass rate drops below your threshold, and require a human review before promoting to production. This prevents the most common LLMOps failure — a well-intentioned prompt change that regresses edge case handling.

Set your pass threshold at 5% below your baseline, not at 100%. Some variance is expected. What you're catching is regressions — a 10-point drop in pass rate on a prompt change is a signal, not noise.

Eval set maintenance

An eval set goes stale. As your product evolves, the distribution of real inputs shifts. Build a pipeline that: captures user inputs from production (with consent), flags low-confidence or flagged outputs for review, and adds a batch of real examples to the eval set each month. Your eval set should be a living document, not a one-time effort.

Try the Evaluation module →: Build and run an eval pipeline on a sample RAG system in the Systems module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →