GenAI Systems Lab Open interactive version →
Evaluation 10 min read

How to Know If Your Fine-Tune Actually Helped

Most fine-tuning evals catch the win and miss the cost. Catastrophic forgetting, distribution tail coverage, and calibration degradation — the four measurements that make fine-tuning decisions defensible instead of optimistic.

The metric that shows the win and hides the cost

Fine-tuning evaluations almost universally show improvement on the target task. That is what fine-tuning is designed to do. What they almost universally fail to show is the cost: degradation on out-of-scope tasks, narrowing of the model's generalisation, and quiet regression on the 20% of real queries that fell outside the training distribution.

A fine-tuned model that scores 15 points higher on your benchmark and 8 points lower on general reasoning is not a better model for your users — it is a more specialised model that may or may not serve them better depending on what they actually ask. Most fine-tuning evaluations only measure the first number.

The full evaluation framework

A defensible fine-tuning evaluation requires four distinct measurements, run before and after fine-tuning, on different data splits:

Catastrophic forgetting: detecting it before users do

Catastrophic forgetting is the most common hidden cost of fine-tuning and the least often measured. It occurs when updating the model's weights for the target task overwrites knowledge from pretraining. The effect is usually gradual and task-specific — not a sudden collapse in capability but a quiet degradation in the domains that are least represented in fine-tuning data.

Full fine-tuning is most susceptible; LoRA and QLoRA fine-tuning with appropriate rank settings substantially reduce forgetting by limiting the number of weight updates. But even LoRA can cause forgetting if the learning rate is too high or the training runs too long.

A reliable catastrophic forgetting check: before fine-tuning, generate responses to 200 diverse general-capability prompts from the base model. After fine-tuning, run the same prompts through the fine-tuned model. Score both sets with an LLM judge on a simple rubric (helpful, accurate, coherent). If the fine-tuned model drops more than 8% on this set, reduce learning rate and re-train.

The fine-tuning eval checklist

When fine-tuning evaluation should block deployment

Fine-tuning should not be deployed if: forgetting exceeds 10% on general capability, edge case regression exceeds 15%, or calibration degrades substantially (ECE increases by more than 0.05). These thresholds are guidelines not laws — adjust for your risk tolerance — but the key discipline is having thresholds at all. Most teams have none.

A model that passes task-specific eval but fails one of the above checks is a model that will surprise you in production. The production failure will come from the unchecked dimension — the edge cases, the forgotten tasks, the overconfident wrong answers. Run the full check before every deployment.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →