AI Engineering 10 min read

Full Fine-Tuning vs. PEFT vs. Prompting: The Decision Framework

A practical decision tree for choosing between full fine-tuning, LoRA/QLoRA, and prompt engineering — based on data size, latency requirements, update frequency, and budget.

You've decided fine-tuning is the right approach. Now comes the second decision: which kind of fine-tuning? Full fine-tuning updates every parameter. Parameter-efficient fine-tuning (PEFT) methods like LoRA update a tiny fraction of parameters. And sometimes, prompt engineering with a few-shot example set is close enough that no fine-tuning is warranted at all.

These aren't just options on a scale of 'more compute = better results'. They have different cost profiles, update characteristics, inference tradeoffs, and failure modes. Choosing wrong means wasted GPU hours at best and a model that performs worse than your baseline at worst.

The three approaches compared

Approach	What Gets Updated	VRAM (70B model)	Quality Ceiling	Best For
Prompt engineering	Nothing — inference only	None	~80% of fine-tune quality for most tasks	Format, tone, structured output, few-shot tasks
LoRA / QLoRA (PEFT)	~0.5–2% of params (adapter matrices)	12–48GB with QLoRA	Close to full FT on most tasks	Domain adaptation, style, task specialisation
Full fine-tuning	100% of params	500GB+ (fp16)	Highest possible	Deep domain embedding, distillation, novel capabilities

When to use prompt engineering (not fine-tuning)

You need results in days, not weeks
Your dataset is small (<500 quality examples)
Your task format is well-defined and expressible in a system prompt
You need to update behaviour frequently without retraining
You're testing a product hypothesis before committing engineering time

A well-engineered prompt with 10–20 few-shot examples typically achieves 80–90% of the quality a fine-tuned model would achieve on format and style tasks. Measure the gap before deciding fine-tuning is worth the investment.

When to use LoRA / QLoRA

You have 500–50K high-quality training examples
You need consistent behaviour that a system prompt can't reliably enforce
You're adapting to a specific domain, persona, or task format
You want to run fine-tuning on a single GPU or small cluster
You need multiple specialised models from the same base (LoRA adapters can be swapped at inference time)

When to use full fine-tuning

You have 50K+ high-quality examples and a specific quality target that LoRA doesn't reach
You're doing knowledge distillation from a larger teacher model
You're embedding deep domain knowledge (medical, legal, scientific) that requires broad weight updates
You have the infrastructure to run multi-GPU training and can absorb the compute cost

Full fine-tuning a 70B model requires 500GB+ of GPU VRAM in fp16 — roughly 8x A100-80GB cards just to fit the model + gradients + optimiser state. Unless you have this infrastructure or a compelling reason LoRA won't work, LoRA/QLoRA is almost always the right starting point.

The decision tree

Do you have a strong prompt baseline?
  No → Build one first. Measure the gap.
  Yes → Is the gap significant and consistent across your eval set?
    No → Prompt engineering is sufficient.
    Yes → Do you have 500+ high-quality training examples?
      No → Collect more data first.
      Yes → Can LoRA reach your quality target? (run small experiment)
        Yes → Use LoRA / QLoRA
        No → Do you have multi-GPU infrastructure?
          Yes → Full fine-tuning
          No → Get infrastructure or revisit quality target

Compare fine-tuning approaches →: See how full fine-tuning, LoRA, and prompting perform across different task types.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →