AI Engineering 9 min read

QLoRA: Fine-Tuning 70B Models on a Single GPU

How 4-bit quantisation + LoRA adapters make large model fine-tuning accessible. NF4 quantisation explained, double quantisation, bfloat16 adapters, and what you lose vs. full LoRA.

LoRA reduced fine-tuning VRAM requirements dramatically — but even with LoRA, a 70B model requires ~140GB of VRAM just to load the model weights in fp16. That's still two A100-80GB cards minimum, before gradients and optimiser state. For most teams, 70B fine-tuning was still out of reach.

QLoRA, published by Tim Dettmers and colleagues in May 2023, broke that wall. By quantising the base model to 4-bit and applying LoRA adapters to the quantised weights, QLoRA made it possible to fine-tune a 70B model on a single 48GB GPU — and a 13B model on a single consumer RTX 3090. The quality gap vs. full fine-tuning was minimal.

How QLoRA stacks the savings

Approach	70B model VRAM (training)	13B model VRAM (training)
Full fine-tuning (fp16)	~560GB	~104GB
LoRA (fp16 base)	~280GB (gradients only on adapters)	~52GB
QLoRA (NF4 base + bf16 adapters)	~48GB	~12GB

NF4: the right quantisation format for weights

QLoRA uses NormalFloat 4-bit (NF4) quantisation for the frozen base model weights. NF4 is designed specifically for normally-distributed weight values — which LLM weights are. Unlike INT4 (which divides the numeric range into equal buckets), NF4 allocates more buckets near zero where most weight values cluster, reducing quantisation error.

NF4 is information-theoretically optimal for normally distributed weights. The base model weights are stored in 4-bit but dequantised to bf16 on-the-fly during the forward pass. Only the LoRA adapter weights stay in bf16 throughout — they're small enough that this adds minimal memory overhead.

Double quantisation

QLoRA also introduces double quantisation: quantising the quantisation constants themselves. Standard quantisation stores a scaling factor (fp32) per block of ~64 weights. Double quantisation quantises these scaling factors to 8-bit, saving an additional ~0.5 bits per parameter — small but meaningful at 70B scale.

Paged optimisers

A third QLoRA innovation: paged optimisers using NVIDIA unified memory. The optimiser state (momentum, variance for Adam) is stored in CPU RAM and paged to GPU VRAM only when needed. This prevents OOM errors on memory spikes during training without requiring the entire optimiser state to fit in VRAM at once.

QLoRA setup

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 4-bit quantisation config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 — optimal for LLM weights
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in bf16
    bnb_4bit_use_double_quant=True,     # double quantisation
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for k-bit training (cast LayerNorm to fp32, etc.)
model = prepare_model_for_kbit_training(model)

# Apply LoRA on top of quantised model
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"], ...)
model = get_peft_model(model, lora_config)

What you lose with QLoRA

Training speed: QLoRA is ~30% slower than LoRA on fp16 due to dequantisation overhead during forward pass
Slight quality gap: on complex reasoning tasks, QLoRA fine-tuned models show a small but measurable quality reduction vs. full precision LoRA
Quantisation noise: NF4 introduces small quantisation errors that compound in very deep models — less relevant for most practical tasks

For most production use cases, the quality difference between QLoRA and full LoRA is negligible on task-specific benchmarks. Start with QLoRA — it unlocks fine-tuning for teams without enterprise GPU clusters, and the 30% speed penalty is a reasonable tradeoff.

Explore fine-tuning approaches →: Compare QLoRA, LoRA, and full fine-tuning across quality and compute dimensions.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →