Evaluation 10 min read

How to Know If Your Fine-Tune Actually Helped

Most fine-tuning evals catch the win and miss the cost. Catastrophic forgetting, distribution tail coverage, and calibration degradation — the four measurements that make fine-tuning decisions defensible instead of optimistic.

The metric that shows the win and hides the cost

Fine-tuning evaluations almost universally show improvement on the target task. That is what fine-tuning is designed to do. What they almost universally fail to show is the cost: degradation on out-of-scope tasks, narrowing of the model's generalisation, and quiet regression on the 20% of real queries that fell outside the training distribution.

A fine-tuned model that scores 15 points higher on your benchmark and 8 points lower on general reasoning is not a better model for your users — it is a more specialised model that may or may not serve them better depending on what they actually ask. Most fine-tuning evaluations only measure the first number.

The full evaluation framework

A defensible fine-tuning evaluation requires four distinct measurements, run before and after fine-tuning, on different data splits:

Task-specific accuracy: does the model do the target task better? Measured on a held-out test split of the training distribution. This is the only metric most teams run.
Catastrophic forgetting: does the model do general tasks worse? Run a general capability eval (MMLU subset, or your own set of out-of-domain queries) on both the base and fine-tuned model. If the fine-tuned model drops more than 5–10% on general tasks, the fine-tune is too aggressive.
Distribution tail coverage: does the model handle the 20% of queries that are not in your training distribution? Collect edge cases — ambiguous queries, multi-intent queries, queries with unusual formatting — and test both models on them. Fine-tuned models frequently regress specifically on edge cases.
Calibration: does the model's confidence still correlate with accuracy? After fine-tuning, logprob calibration often degrades — the model becomes overconfident on in-distribution inputs and underconfident on everything else. Run calibration checks on both splits.

Catastrophic forgetting: detecting it before users do

Catastrophic forgetting is the most common hidden cost of fine-tuning and the least often measured. It occurs when updating the model's weights for the target task overwrites knowledge from pretraining. The effect is usually gradual and task-specific — not a sudden collapse in capability but a quiet degradation in the domains that are least represented in fine-tuning data.

Full fine-tuning is most susceptible; LoRA and QLoRA fine-tuning with appropriate rank settings substantially reduce forgetting by limiting the number of weight updates. But even LoRA can cause forgetting if the learning rate is too high or the training runs too long.

A reliable catastrophic forgetting check: before fine-tuning, generate responses to 200 diverse general-capability prompts from the base model. After fine-tuning, run the same prompts through the fine-tuned model. Score both sets with an LLM judge on a simple rubric (helpful, accurate, coherent). If the fine-tuned model drops more than 8% on this set, reduce learning rate and re-train.

The fine-tuning eval checklist

Split correctly: training/validation/test splits must be made before fine-tuning. Any examples seen during training are contaminated. Use a strict 70/15/15 split with no leakage.
Baseline is the base model, not the previous fine-tune: always compare your new fine-tune against the original base model as the reference point, not the previous iteration. Otherwise you hide the cumulative forgetting across iterations.
Run general eval every iteration: if you are doing iterative fine-tuning (multiple rounds), run the forgetting check after each round. Forgetting accumulates — a 3% drop per round becomes a 15% drop after five rounds.
Test on queries you did not write: the most reliable test of a fine-tune is real production queries, not questions written by the same team that wrote the training data. Run shadow eval before deploying.
Document the cost explicitly: in your evaluation report, list both the gain on target task and the loss on general capability. A fine-tune that gains 12 points on task and loses 6 points on general capability is not automatically the right choice — that decision depends on your use case.

When fine-tuning evaluation should block deployment

Fine-tuning should not be deployed if: forgetting exceeds 10% on general capability, edge case regression exceeds 15%, or calibration degrades substantially (ECE increases by more than 0.05). These thresholds are guidelines not laws — adjust for your risk tolerance — but the key discipline is having thresholds at all. Most teams have none.

A model that passes task-specific eval but fails one of the above checks is a model that will surprise you in production. The production failure will come from the unchecked dimension — the edge cases, the forgotten tasks, the overconfident wrong answers. Run the full check before every deployment.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →