The Eval Crisis: Why Most AI Evals Are Wrong
Four specific ways AI evaluation fails — benchmark contamination, Goodhart's Law, eval-train leakage, and measuring the wrong task — and what a good eval suite actually looks like.
The problem with how we measure AI
Most AI evals are wrong. Not wrong like 'slightly off' — wrong like measuring the wrong thing entirely. Teams benchmark their models, get good numbers, ship to production, and then watch quality degrade in ways the benchmark never predicted. This is not a tooling problem. It is a thinking problem.
There are four specific ways evals fail. Each one is avoidable. None of them gets discussed in the papers that introduce new benchmarks.
Failure 1: Benchmark contamination
The model you are evaluating was trained on data scraped from the internet. So was the benchmark. If MMLU questions, their answers, or their structural patterns appeared anywhere in the training corpus, the model is not demonstrating general reasoning — it is pattern-matching against memorized content.
This is not hypothetical. In 2023, multiple studies found that GPT-4 and other frontier models showed substantially higher scores on benchmarks than on freshly-created equivalent questions. The gap was not small. On some tasks it was 10–20 percentage points. The model had not learned to do the task — it had learned to recognize the task.
If your benchmark was published before your model's training cutoff, assume contamination until proven otherwise. Published benchmarks become training data. That is the default assumption, not the exception.
The fix is not to avoid benchmarks — it is to treat published benchmark scores as lower bounds on actual difficulty and to build private holdout evals that were never published anywhere. Your own eval suite, created after training cutoff, is the only contamination-safe measurement you have.
Failure 2: Goodhart's Law in eval design
Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. In AI evals, this manifests as optimizing the model toward the eval signal until the eval no longer measures what you intended.
The most common version: a team uses an LLM-as-judge eval (GPT-4o scores responses 1–10). They tune prompts, fine-tune the model, and optimize against the judge score. After three weeks the judge score is 8.5. Actual user satisfaction is worse than before. Why? The model learned to produce responses that the judge scores highly — longer, more structured, with more hedging and acknowledgment — rather than responses that are actually useful to humans.
A related version: RLHF optimizes against a reward model. The reward model is a proxy for human preference. After enough optimization, the policy model learns to exploit weaknesses in the reward model rather than generate genuinely preferred outputs. This is reward hacking. It is not a failure of RLHF — it is a failure of treating the proxy as the real objective.
Countermeasure: run a held-out eval with a different judge model than the one used during optimization. If the score diverges significantly, you have Goodhart's problem. Also: periodically human-evaluate a random sample — the judge score and human score should track each other over time.
Failure 3: Eval-train leakage
This is different from contamination. Contamination is about benchmark data appearing in pretraining. Eval-train leakage is about your own eval set bleeding into your fine-tuning or RLHF pipeline.
It happens like this: your team builds a 500-question eval. You run fine-tuning. You improve the eval score. You run more fine-tuning. You improve it more. After six iterations, you have implicitly trained against the eval distribution. You did not do this deliberately — the model saw eval-like examples in fine-tuning data, or the questions were drafted by the same people who drafted training examples. Either way, the eval is no longer independent.
The structural fix is the same as in classical ML: build your eval set before training begins, keep it strictly separate, never use eval examples as training examples, and rotate in fresh questions periodically. This discipline is routine in academic ML and routinely absent in production LLM teams.
Red flag: if your eval score improves faster than your deployment quality, you have leakage. Eval scores should lag deployment quality slightly — the eval is harder than average real use. If it's the other way around, something has leaked.
Failure 4: The wrong task altogether
The most damaging eval failure does not involve data corruption or statistical artifacts. It involves measuring a task that is not the actual task. You are running a summarization benchmark. Your users use the model for customer support. These are not the same thing.
This happens at every scale. Teams use MMLU to evaluate reasoning but their product requires multi-step planning. They use ROUGE to evaluate generation quality but their users care about factual accuracy. They use pass@k to evaluate coding ability but their production code runs in a constrained environment with specific library versions.
The proxies are not wrong in the abstract — they are wrong for the specific product. And because they are published, validated, and easy to run, they get used anyway.
Before building any eval, write the test in this form: 'A user asks [X]. The model does [Y]. This counts as a success if [Z].' If you cannot fill in Z with a concrete, verifiable criterion tied to user value, you do not have an eval — you have a number.
What a good eval suite actually looks like
- Task-specific: built around the actual use cases your model serves, not general capability proxies
- Contamination-safe: created after training cutoff, kept private, never published
- Leakage-proof: completely separated from training data by provenance, not just by file path
- Multi-judge: uses both automated scoring and human evaluation, correlated to catch judge drift
- Layered: capability evals (can it do the task at all) + quality evals (how well) + regression evals (did the last change break anything)
- Versioned: every eval run logged with the model checkpoint, prompt version, and eval set version — so you can reproduce any historical score
None of this is complicated. All of it is skipped. The teams with good evals built them early and treated them as infrastructure — not as a one-time measurement before a launch.
The uncomfortable truth
Frontier labs have dozens of evaluation researchers. They publish benchmark papers, run contamination analyses, and maintain private holdout sets. Despite all of this, benchmark scores for frontier models are still partially contaminated, still partly Goodhart'd, and still only loosely correlated with what users actually want.
If this is true at the frontier, it is certainly true for teams running fine-tuned models on production tasks with a handful of engineers and a 200-question eval set. The bar is not perfection. The bar is: do you know which failure modes your eval has, and are you correcting for them?
Most teams do not know. That is the crisis.
- Are Emergent Abilities of Large Language Models a Mirage?
- Goodhart's Law in Reinforcement Learning from Human Feedback
- Contamination Report: Memorization of MMLU and Other Benchmarks
- HELM: Holistic Evaluation of Language Models
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →