PEFT Methods Compared: LoRA, Prefix Tuning, Prompt Tuning, and Adapters
The four main parameter-efficient fine-tuning approaches — what each trains, when each wins, and why LoRA dominates most production use cases despite not being the oldest method.
LoRA gets most of the attention in the PEFT ecosystem — and for good reason. But it's not the only parameter-efficient fine-tuning method, and understanding what the others do (and why LoRA beat them in practice) builds a clearer mental model of what PEFT is actually optimising for.
The four main PEFT families: LoRA (low-rank adapter matrices), Prefix Tuning (prepend trainable tokens to the key-value sequence), Prompt Tuning (prepend trainable tokens to the input), and Adapter Layers (insert small bottleneck networks into transformer layers). They all share the same goal — adapt a large pretrained model efficiently — but make different tradeoffs.
LoRA (Low-Rank Adaptation)
Injects trainable low-rank matrices ΔW = BA alongside frozen weight matrices. At inference, adapters can be merged into base weights — zero inference overhead. Rank r controls capacity: higher rank captures more complex adaptations but uses more parameters.
- Parameters: ~0.1–2% of base model (depending on rank and target modules)
- Inference overhead: zero after merging; ~5% before merging
- Works well for: style, format, domain adaptation, instruction following
- Weakness: less effective at tasks requiring broad architectural changes
Adapter Layers
Insert small bottleneck networks (down-project → nonlinearity → up-project) into each transformer layer. The adapter learns a residual transformation: output = input + adapter(input). Only the adapter weights are trained.
- Parameters: ~1–8% of base model
- Inference overhead: permanent — the adapter layers are always active, adding latency
- Works well for: multi-task learning (one adapter per task, same base model)
- Weakness: the permanent inference overhead makes it less attractive for production single-task models
Prefix Tuning
Prepends a sequence of trainable 'virtual tokens' to the key and value tensors at every attention layer. These prefix vectors are never part of the input text — they exist only inside the attention computation, acting as soft prompts that influence every layer's attention.
- Parameters: prefix length × model dimension × num layers × 2 (K and V) — typically ~0.1% of base model
- Inference overhead: small — each attention computation includes the prefix tokens
- Works well for: generation tasks, summarisation, translation
- Weakness: harder to optimise than LoRA; can be unstable on small models; prefix length is a difficult hyperparameter
Prompt Tuning
Simplest of the PEFT methods. Prepend a small number of trainable continuous tokens to the input embedding. Only the token embeddings are trainable — the entire model is frozen. At very large model scales (11B+), prompt tuning was shown to approach full fine-tuning quality.
- Parameters: prompt length × embedding dimension — typically <0.01% of base model
- Inference overhead: the prompt tokens add to sequence length
- Works well for: classification tasks on very large models
- Weakness: significantly underperforms LoRA on smaller models; unstable optimisation
Why LoRA dominates in practice
| Method | Params | Inference Overhead | Stability | Production Use |
|---|---|---|---|---|
| LoRA | 0.1–2% | Zero (after merge) | High | Dominant |
| QLoRA | 0.1–2% | Zero (after merge) | High | Dominant (low VRAM) |
| Adapters | 1–8% | Permanent (~10ms) | High | Multi-task setups |
| Prefix Tuning | ~0.1% | Small | Medium | Niche |
| Prompt Tuning | <0.01% | Minimal | Low on small models | Mostly replaced by LoRA |
LoRA wins on the combination of: zero inference overhead after merging, stable optimisation across model sizes, flexibility in rank selection, and the ability to train on top of quantised models (QLoRA). No other PEFT method checks all four boxes.
Compare PEFT methods →: See how LoRA, adapters, and other PEFT approaches perform across task types.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →