Evaluation 10 min read

Goodhart's Law in ML: Benchmark Saturation, Contamination, and When SOTA Numbers Lie

How GLUE was saturated in 2 years. Why MMLU scores for LLMs are contamination-upper-bounds. The difference between metric optimization and metric gaming. What strong evaluation actually looks like when you can't trust leaderboards.

Goodhart's Law in ML: When the Benchmark Becomes the Target

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. ML benchmarks are the canonical example. MMLU, HumanEval, SQuAD, GLUE — every major benchmark has been saturated, gamed, or contaminated to the point where leaderboard position and real-world utility are largely decoupled.

How Benchmark Saturation Happens

Phase 1 — Benchmark introduced. Captures a real capability gap. Best models score 50-60%, humans score 90%+. Progress is meaningful. Phase 2 — Community optimizes against it. Architecture search, hyperparameter tuning, data curation all target benchmark performance. Phase 3 — Near-human performance. Models hit 85-90%. Papers claim 'human-level.' Phase 4 — Contamination discovered. Models were trained on internet text that includes the benchmark. Scores don't transfer. Phase 5 — New benchmark introduced. Repeat.

GLUE was saturated in ~2 years (2018-2020). SuperGLUE lasted ~1 year. MMLU is already contaminated. HumanEval pass@1 numbers from LLMs are unreliable due to training data contamination. The field keeps building new benchmarks and keeps running into the same problem.

The Contamination Problem for LLMs

Language models trained on internet-scale data have seen most public benchmarks. When you evaluate GPT-4 on MMLU, you're not measuring reasoning ability — you're measuring a combination of reasoning ability and memorization of benchmark questions. These cannot be cleanly separated.

The honest position: LLM benchmark numbers are upper bounds on contamination-free capability and lower bounds on memorization. We don't know the split. Any paper claiming SOTA on MMLU without a contamination analysis should be read skeptically.

Metric Gaming vs. Metric Optimization

There's a real distinction. Metric optimization: you improve the metric because you improved the underlying capability it measures. Metric gaming: you improve the metric without improving the capability — by exploiting the measurement methodology.

BLEU gaming: train directly on BLEU loss. BLEU improves; translation quality may not. BLEU rewards n-gram overlap, not semantic accuracy. ROUGE gaming: generate very long summaries. ROUGE-Recall goes up; summary quality degrades. Code benchmark gaming: fine-tune on problems similar to HumanEval. Pass@1 improves; general coding ability unchanged. LLM-as-judge gaming: generate verbose, well-structured outputs. LLM judges favor these styles regardless of content quality.

What Strong Evaluation Actually Looks Like

The frontier labs (Anthropic, OpenAI, Google DeepMind) have largely moved away from standard benchmarks for internal evaluation. They build private eval suites with: (1) held-out problems never published, (2) human evaluation on real use cases, (3) red-teaming by domain experts, (4) longitudinal tracking on the same fixed test sets over model generations.

For production systems, the benchmark that matters is: does the metric move when the system gets better, and does it predict user outcomes? If NDCG@10 improves by 3% and user satisfaction doesn't change, the metric is measuring the wrong thing.

Distinguishing Real Progress from Benchmark Progress

Cross-benchmark consistency: does the improvement show up on multiple independent benchmarks measuring the same capability? If a method improves MMLU but not BIG-Bench Hard or vice versa, the result is probably noisy. Out-of-distribution test: does the improvement hold on a held-out private test set the authors didn't touch? Human eval agreement: do human raters agree the method is better? If not, the metric is probably wrong. Transfer to downstream tasks: does improving on the benchmark improve the downstream application? If not, the benchmark is measuring something irrelevant.

The Research Taste Question This Leads To

In a high-TC interview, you might be asked: 'We improved our model's MMLU score from 78% to 81%. How confident are you that it's actually better?' The answer they want walks through: contamination risk, whether the improvement is consistent across subsets, whether human eval agrees, and what the right next experiment is (probably: test on a private eval set with problems written after the training cutoff).

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →