The Fine-Tuning Playbook: LoRA, QLoRA, and When to Use Each
A practical decision framework for fine-tuning LLMs — from full parameter training to 4-bit QLoRA on consumer GPUs.
**Prerequisite: Step 5 (Pretraining Data) helps but not required.** After this post you'll know when fine-tuning is the right choice over prompting, what LoRA and PEFT actually do at a conceptual level, and how to decide if instruction-tuning your model is worth the cost.
Fine-tuning is the most misused tool in the modern ML stack. Teams fine-tune when they should be prompting, prompt when they should be fine-tuning, and almost always skip the step that matters most: building an eval harness before they start.
When Fine-Tuning Actually Wins
Fine-tuning beats prompting when: the task requires consistent output format at high volume; the model needs domain vocabulary it wasn't trained on; latency constraints make long system prompts expensive; or you're doing classification/extraction where a 7B fine-tuned model outperforms GPT-4 at 10% of the cost.
Fine-tuning doesn't add knowledge — it adjusts behaviour. If the base model doesn't know a fact, fine-tuning won't teach it that fact. Use RAG for knowledge, fine-tuning for style and format.
The Method Decision Framework
| Method | When to use | VRAM | Quality ceiling |
|---|---|---|---|
| Full FT | Significant task distribution shift; large dataset (>50K examples) | High (full model) | Best |
| LoRA | Adapter for a new task; <10K examples; limited GPU budget | Medium (adapter only) | Near-full-FT |
| QLoRA | Consumer GPU; 4-bit base + LoRA adapter; cost-constrained | Low | Slightly below LoRA |
| Prompt tuning | API-only access; very small dataset; fast iteration | Zero | Limited |
The 5-Step Production Workflow
- Build your eval harness first — define metrics and test cases before touching training data. If you can't measure it, you can't improve it.
- Curate your dataset — quality beats quantity. 500 expertly curated examples often outperform 50,000 scraped ones.
- Establish a baseline — prompt the base model with your best system prompt. Fine-tuning should beat this by a meaningful margin or don't ship it.
- Train with a small LoRA first — rank 8–16, alpha 16–32, 1–3 epochs. Validate on held-out examples before scaling.
- Regression test — fine-tuning degrades capabilities outside the training distribution. Always test on tasks beyond your specific domain.
Dataset Quality Is the Bottleneck
The most common fine-tuning failure is dataset quality, not model choice or hyperparameters. Every low-quality example you include teaches the model to produce low-quality outputs. Filter aggressively: remove duplicates, remove examples where the reference output is itself wrong, and maintain label balance.
Dataset sizing heuristics
- Classification/extraction: 500–2,000 examples per class typically sufficient for LoRA
- Style/format transfer: 1,000–5,000 examples of the target style
- Domain adaptation: 10,000+ examples if the domain vocabulary diverges significantly from pretraining
- Instruction following (general): diminishing returns above 50,000 high-quality pairs
The Mistakes That Cost Teams the Most
- No eval before training: you can't know if fine-tuning helped without a baseline and held-out test set
- Overfitting on small datasets: val loss plateaus then rises — if you see this, stop training or add data augmentation
- Forgetting capabilities regression: fine-tuned models lose general ability. Always include general-domain examples in your training mix
- Skipping the prompt baseline: prompt engineering often closes 80% of the performance gap before you touch any training code
- Deploying without shadow evaluation: run your fine-tuned model alongside the base model on live traffic before full cutover
The teams that fine-tune best treat it as a last resort, not a first resort. Exhaust prompting, RAG, and retrieval before reaching for gradient updates.
Try: Fine-Tuning Workflows module →:
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →