Building an Eval Pipeline That Actually Catches Production Failures
Why unit tests aren't enough for LLMs. How to design offline evals, online evals, and shadow evaluation so regressions don't reach users.
An eval pipeline is the thing that tells you whether your AI system is getting better or worse before users tell you. Without one, you're flying blind — every prompt change, model upgrade, or retrieval tweak is a gamble. With one, you have a feedback loop that makes iteration safe.
What makes a good eval?
A good eval is a set of (input, expected behaviour) pairs that cover your production distribution. Not hand-picked happy paths — representative samples of what users actually send, including the hard cases that caused incidents.
- Coverage: spans the full distribution of real inputs — common cases, edge cases, and known failure modes
- Ground truth: each example has a clear expected output or a rubric for what 'good' looks like
- Sensitivity: the eval detects regressions before they ship, not after
- Stability: same test suite, consistent results across runs at the same model/prompt version
The minimum viable eval set is 100 examples. Below that, statistical noise drowns out real signal. 500 examples is good. 2,000+ is production-grade. Quality matters more than quantity — 100 well-chosen examples beat 10,000 random ones.
The three layers of LLM evaluation
| Layer | What it tests | Example metric |
|---|---|---|
| Unit evals | Single turn: one input, one expected output | Exact match, ROUGE, LLM-as-judge |
| Integration evals | Multi-turn flows, tool calls, retrieval + generation | Task success rate, tool call accuracy |
| Production evals | Real user traffic: latency, cost, human feedback, flag rate | Thumbs up/down, session completion, CSAT |
Evaluation methods
Exact match
Best for classification, extraction, and any output with a definitive correct answer. Does the output exactly match the expected string? Simple, zero-cost, unambiguous.
LLM-as-judge
Use a strong LLM (usually GPT-4o or Claude Opus) to score outputs on a rubric. This scales to subjective outputs like summarisation, tone, and reasoning quality. The trick: give the judge a specific rubric with criteria and a score from 1–5, not just 'is this good?'
JUDGE_PROMPT = """You are evaluating an AI response for faithfulness to source material.
Source: {source}
Question: {question}
Response: {response}
Score the response on faithfulness (1-5):
5 = Every claim directly supported by the source
4 = Mostly supported, minor extrapolations
3 = Partially supported, some unsupported claims
2 = Several claims not in source
1 = Response contradicts or ignores source
Return JSON: {"score": N, "reason": "one sentence explanation"}"""
def judge_faithfulness(source, question, response):
result = llm(JUDGE_PROMPT.format(
source=source, question=question, response=response
))
return json.loads(result)
RAGAS metrics (for RAG)
RAGAS is a framework for evaluating RAG pipelines with four key metrics: Faithfulness (is the answer grounded in the retrieved context?), Answer Relevancy (does the answer address the question?), Context Precision (are retrieved chunks actually needed?), and Context Recall (did retrieval find all the relevant information?).
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
dataset = {
"question": ["What is prompt caching?", ...],
"answer": ["Prompt caching stores...", ...],
"contexts": [["Claude supports caching...", "Cache hit rate..."], ...],
"ground_truth": ["Prompt caching is a technique...", ...]
}
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(result) # DataFrame with per-metric scores
Building the pipeline
class EvalPipeline:
def __init__(self, system_under_test, eval_set, judges):
self.sut = system_under_test # your AI pipeline
self.eval_set = eval_set # list of {input, expected, metadata}
self.judges = judges # list of scorer functions
def run(self):
results = []
for example in self.eval_set:
output = self.sut(example["input"])
scores = {j.__name__: j(example, output) for j in self.judges}
results.append({
"input": example["input"],
"expected": example["expected"],
"output": output,
"scores": scores,
"passed": all(s >= s_threshold for s, s_threshold in scores.items())
})
pass_rate = sum(r["passed"] for r in results) / len(results)
print(f"Pass rate: {pass_rate:.1%} ({sum(r['passed'] for r in results)}/{len(results)})")
return results
Gating deployments with evals
An eval suite is only valuable if it gates deployments. The pattern: run evals in CI on every prompt or code change, fail the pipeline if pass rate drops below your threshold, and require a human review before promoting to production. This prevents the most common LLMOps failure — a well-intentioned prompt change that regresses edge case handling.
Set your pass threshold at 5% below your baseline, not at 100%. Some variance is expected. What you're catching is regressions — a 10-point drop in pass rate on a prompt change is a signal, not noise.
Eval set maintenance
An eval set goes stale. As your product evolves, the distribution of real inputs shifts. Build a pipeline that: captures user inputs from production (with consent), flags low-confidence or flagged outputs for review, and adds a batch of real examples to the eval set each month. Your eval set should be a living document, not a one-time effort.
Try the Evaluation module →: Build and run an eval pipeline on a sample RAG system in the Systems module.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →