GenAI Systems Lab Open interactive version →
Evaluation 10 min read

Goodhart's Law in ML: Benchmark Saturation, Contamination, and When SOTA Numbers Lie

How GLUE was saturated in 2 years. Why MMLU scores for LLMs are contamination-upper-bounds. The difference between metric optimization and metric gaming. What strong evaluation actually looks like when you can't trust leaderboards.

Goodhart's Law in ML: When the Benchmark Becomes the Target

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. ML benchmarks are the canonical example. MMLU, HumanEval, SQuAD, GLUE — every major benchmark has been saturated, gamed, or contaminated to the point where leaderboard position and real-world utility are largely decoupled.

How Benchmark Saturation Happens

GLUE was saturated in ~2 years (2018-2020). SuperGLUE lasted ~1 year. MMLU is already contaminated. HumanEval pass@1 numbers from LLMs are unreliable due to training data contamination. The field keeps building new benchmarks and keeps running into the same problem.

The Contamination Problem for LLMs

Language models trained on internet-scale data have seen most public benchmarks. When you evaluate GPT-4 on MMLU, you're not measuring reasoning ability — you're measuring a combination of reasoning ability and memorization of benchmark questions. These cannot be cleanly separated.

The honest position: LLM benchmark numbers are upper bounds on contamination-free capability and lower bounds on memorization. We don't know the split. Any paper claiming SOTA on MMLU without a contamination analysis should be read skeptically.

Metric Gaming vs. Metric Optimization

There's a real distinction. Metric optimization: you improve the metric because you improved the underlying capability it measures. Metric gaming: you improve the metric without improving the capability — by exploiting the measurement methodology.

What Strong Evaluation Actually Looks Like

The frontier labs (Anthropic, OpenAI, Google DeepMind) have largely moved away from standard benchmarks for internal evaluation. They build private eval suites with: (1) held-out problems never published, (2) human evaluation on real use cases, (3) red-teaming by domain experts, (4) longitudinal tracking on the same fixed test sets over model generations.

For production systems, the benchmark that matters is: does the metric move when the system gets better, and does it predict user outcomes? If NDCG@10 improves by 3% and user satisfaction doesn't change, the metric is measuring the wrong thing.

Distinguishing Real Progress from Benchmark Progress

The Research Taste Question This Leads To

In a high-TC interview, you might be asked: 'We improved our model's MMLU score from 78% to 81%. How confident are you that it's actually better?' The answer they want walks through: contamination risk, whether the improvement is consistent across subsets, whether human eval agrees, and what the right next experiment is (probably: test on a private eval set with problems written after the training cutoff).

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →