AI Engineering 10 min read

Fine-Tuning Fundamentals: What It Is, When to Use It, and When Not To

The mental model every engineer needs before touching fine-tuning. What fine-tuning actually changes in a model, the three real use cases where it beats prompting, and the five situations where it's the wrong tool.

Fine-tuning is one of the most misused techniques in applied AI. Teams reach for it when they hit a quality ceiling, when a prompt isn't doing what they want, when a model doesn't know their domain. Sometimes it's the right tool. Often it's not — and a poorly motivated fine-tuning run costs weeks of engineering time, GPU budget, and ongoing maintenance overhead without solving the underlying problem.

Before you fine-tune anything, you need a clear answer to three questions: what exactly is wrong with the base model? Would more data in context fix it? And what would success actually look like? This post gives you the mental model to answer all three.

What fine-tuning actually changes

Fine-tuning updates the weights of a pre-trained model on a new dataset. Unlike prompting — which changes the model's input — fine-tuning changes the model itself. After fine-tuning, the model generates differently even with the same prompt, because its internal representations have shifted.

What shifts specifically depends on what kind of fine-tuning you do and what data you use. But the core mechanism is always the same: gradient descent updates weights to minimise loss on your new dataset, pulling the model's behaviour toward the distribution of your training examples.

Fine-tuning does not add knowledge by magic. If the base model has never encountered a concept, fine-tuning on a small dataset won't reliably teach it. Fine-tuning is best at reshaping how a model uses knowledge it already has — style, format, tone, task structure — not at injecting new facts.

The three real use cases for fine-tuning

1. Style and format adaptation

The base model writes in a general style. Your product needs a very specific voice, output format, or structural pattern. A customer support bot that always responds in a specific JSON schema. A code assistant that always writes idiomatic Python with specific patterns. A document analyser that always produces a fixed structured report. Prompting can get close, but fine-tuning locks it in — consistently, at every temperature, without long system prompts.

2. Task specialisation on a narrow domain

The base model is a generalist. Your use case is highly specific — medical record extraction, legal clause classification, financial report summarisation. The model knows the domain at a broad level, but struggles with the specific vocabulary, edge cases, and output requirements. Fine-tuning on high-quality domain examples teaches the model the specific patterns that matter for your task.

3. Latency and cost optimisation via distillation

You're using a large frontier model and the quality is excellent, but the cost or latency is unsustainable at scale. Use the frontier model to generate high-quality (instruction, response) pairs, then fine-tune a smaller model on this 'teacher' data. This is knowledge distillation — the small model learns to approximate the large model's outputs at a fraction of the inference cost.

Five situations where fine-tuning is the wrong tool

Your evaluation is weak: if you can't reliably measure whether the fine-tuned model is better, you can't tell whether you've improved anything. Fix evals before fine-tuning.
You have less than ~500 high-quality examples: below this threshold, fine-tuning usually hurts general capability more than it helps task performance. Use few-shot prompting instead.
Your knowledge changes frequently: fine-tuning bakes knowledge into weights. If your domain knowledge updates weekly, you'll need to retrain constantly. RAG is almost always better for dynamic knowledge.
You're trying to fix hallucination on factual tasks: fine-tuning on correct facts doesn't reliably reduce hallucination. The model has memorised incorrect patterns that are hard to override with a small dataset. Use RAG or structured retrieval instead.
You haven't exhausted prompt engineering: a well-structured system prompt with good few-shot examples often achieves 80-90% of what fine-tuning would achieve at zero maintenance cost. Always benchmark a strong prompt baseline before investing in fine-tuning.

The most common fine-tuning mistake: treating it as a shortcut to skip prompt engineering. Fine-tuning is harder to iterate on, harder to debug, and introduces ongoing maintenance overhead. It should only enter the picture after you've maximised what's achievable through prompting.

The fine-tuning decision checklist

Do you have 500+ high-quality, consistent (instruction, response) pairs in your target domain?
Have you built an eval set that can reliably measure improvement?
Have you benchmarked a strong prompt engineering baseline?
Is the failure mode style/format (not factual accuracy)?
Does your knowledge change slowly enough that retraining isn't constant overhead?

If you answered yes to all five: fine-tuning is likely the right tool. If you answered no to any: address those first.

Compare fine-tuning, RAG, and prompting approaches →: See how different adaptation strategies perform across varied task types.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →