AI Engineering 11 min read

The Fine-Tuning Playbook: LoRA, QLoRA, and When to Use Each

A practical decision framework for fine-tuning LLMs — from full parameter training to 4-bit QLoRA on consumer GPUs.

**Prerequisite: Step 5 (Pretraining Data) helps but not required.** After this post you'll know when fine-tuning is the right choice over prompting, what LoRA and PEFT actually do at a conceptual level, and how to decide if instruction-tuning your model is worth the cost.

Fine-tuning is the most misused tool in the modern ML stack. Teams fine-tune when they should be prompting, prompt when they should be fine-tuning, and almost always skip the step that matters most: building an eval harness before they start.

When Fine-Tuning Actually Wins

Fine-tuning beats prompting when: the task requires consistent output format at high volume; the model needs domain vocabulary it wasn't trained on; latency constraints make long system prompts expensive; or you're doing classification/extraction where a 7B fine-tuned model outperforms GPT-4 at 10% of the cost.

Fine-tuning doesn't add knowledge — it adjusts behaviour. If the base model doesn't know a fact, fine-tuning won't teach it that fact. Use RAG for knowledge, fine-tuning for style and format.

The Method Decision Framework

Method	When to use	VRAM	Quality ceiling
Full FT	Significant task distribution shift; large dataset (>50K examples)	High (full model)	Best
LoRA	Adapter for a new task; <10K examples; limited GPU budget	Medium (adapter only)	Near-full-FT
QLoRA	Consumer GPU; 4-bit base + LoRA adapter; cost-constrained	Low	Slightly below LoRA
Prompt tuning	API-only access; very small dataset; fast iteration	Zero	Limited

The 5-Step Production Workflow

Build your eval harness first — define metrics and test cases before touching training data. If you can't measure it, you can't improve it.
Curate your dataset — quality beats quantity. 500 expertly curated examples often outperform 50,000 scraped ones.
Establish a baseline — prompt the base model with your best system prompt. Fine-tuning should beat this by a meaningful margin or don't ship it.
Train with a small LoRA first — rank 8–16, alpha 16–32, 1–3 epochs. Validate on held-out examples before scaling.
Regression test — fine-tuning degrades capabilities outside the training distribution. Always test on tasks beyond your specific domain.

Dataset Quality Is the Bottleneck

The most common fine-tuning failure is dataset quality, not model choice or hyperparameters. Every low-quality example you include teaches the model to produce low-quality outputs. Filter aggressively: remove duplicates, remove examples where the reference output is itself wrong, and maintain label balance.

Dataset sizing heuristics

Classification/extraction: 500–2,000 examples per class typically sufficient for LoRA
Style/format transfer: 1,000–5,000 examples of the target style
Domain adaptation: 10,000+ examples if the domain vocabulary diverges significantly from pretraining
Instruction following (general): diminishing returns above 50,000 high-quality pairs

The Mistakes That Cost Teams the Most

No eval before training: you can't know if fine-tuning helped without a baseline and held-out test set
Overfitting on small datasets: val loss plateaus then rises — if you see this, stop training or add data augmentation
Forgetting capabilities regression: fine-tuned models lose general ability. Always include general-domain examples in your training mix
Skipping the prompt baseline: prompt engineering often closes 80% of the performance gap before you touch any training code
Deploying without shadow evaluation: run your fine-tuned model alongside the base model on live traffic before full cutover

The teams that fine-tune best treat it as a last resort, not a first resort. Exhaust prompting, RAG, and retrieval before reaching for gradient updates.

Try: Fine-Tuning Workflows module →:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →