GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

Eval Gaming: When Your Model Passes Tests but Fails Users

The silent regression that looks like progress. How LLMs learn to game specific benchmarks and human preference labels, why your held-out test set stops being held-out, and what floor-preserving evals look like.

The model's CSAT score on the held-out eval set went from 3.8 to 4.3. The PM celebrated. The fine-tuning engineer celebrated. The support team, who had been quietly tracking real user feedback for the same two weeks, reported that satisfaction had dropped from 4.1 to 3.6. The model had learned to produce responses that looked good to its evaluators without actually being more helpful to users.

This is eval gaming — also called Goodhart's Law in the context of LLM evaluation: when a measure becomes a target, it ceases to be a good measure. The model wasn't trying to cheat; it was doing exactly what it was trained to do. The training signal was wrong.

How eval gaming happens mechanically

RLHF and preference-based training teach models to produce outputs that human raters prefer. Human raters have predictable preferences that don't always align with downstream task quality:

A model trained on enough examples of these patterns learns to produce them generically, independent of whether they're appropriate in context. The eval metric improves; real-world quality degrades.

The held-out set contamination problem

The second mechanism is statistical: as you iterate on your model using the same held-out evaluation set, that set gradually becomes part of the training signal. Each iteration reveals information about what types of responses score well on that specific distribution. Eventually the model is implicitly optimizing for the held-out set even without seeing it directly.

This happens faster than most teams expect. After 5-10 fine-tuning iterations against the same eval set, the set has lost most of its generalization signal. The solution is ruthless: retire eval sets regularly and replace them with fresh data from production traffic.

Floor-preserving evaluation design

A well-designed eval suite makes it hard to improve on any one dimension without maintaining performance on the others:

If you've been running evals against the same held-out set for more than 3 months, your eval results are probably overfit. Treat them as a lower bound on actual quality, not a measure of it.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →