AI Engineering 5 min read

LoRA: One Base Model, N Adapters — What Actually Changes

LoRA freezes pretrained weight matrix W and trains two small matrices A (d×r) and B (r×d). Effective weight is W + A×B. At rank 8, a 4096×4096 matrix shrinks from 16M to 65K trainable parameters. Adapters hot-swap at inference time.

The team needs to serve 200 customer-specific model variants. Each customer's model is a fine-tuned 70B. Storage math: 200 variants × 140 GB per model = 28 terabytes, before considering inference serving. Loading a different model per request takes seconds. Keeping all 200 loaded simultaneously is economically impossible. There has to be a different architecture.

The different architecture requires understanding what fine-tuning actually changes in a large weight matrix.

A transformer's core computation is dominated by large linear layers — matrices of shape d_model × d_model. For a 70B model these are often 4096 × 4096 or larger. Full fine-tuning modifies all values in all of these matrices. But there is an empirical observation that turns out to matter enormously: the updates that fine-tuning makes to these matrices are low-rank. The change from a pretrained model to a task-specific model can be approximated as the product of two much smaller matrices, not a full matrix of the original shape. This holds across a wide range of tasks and model families.

The method that exploits this observation freezes the original pretrained weight matrix W and instead trains two small matrices: A with shape d × r and B with shape r × d, where r is the rank and r is much smaller than d. At inference time the effective weight is W + A×B. The pretrained knowledge lives in the frozen W. The task-specific adaptation lives entirely in the tiny A×B product.

Single attention weight matrix W: 4096 × 4096

Full fine-tuning:
  Trainable params:  4096 × 4096 = 16,777,216
  Storage (fp16):    ~32 MB per matrix

LoRA, rank r = 8:
  A: 4096 ×    8 =     32,768 params
  B:    8 × 4096 =     32,768 params
  Total:               65,536 params  (0.39% of full)
  Storage (fp16):   ~0.12 MB per matrix

200 customer adapters (all layers, full model):
  Full fine-tuning:  200 × ~140 GB = 28 TB
  LoRA at r=8:       200 × ~100 MB = 20 GB total

Serving model:  one 70B base loaded once (frozen)
                + customer adapter loaded per request (~100 MB, tens of ms)

Hot-swapping at inference means loading a different A×B pair without reloading W. The base model stays resident in GPU VRAM. For each request, the serving layer identifies the correct adapter, loads the A and B matrices for that customer (roughly 100 MB total across all layers), and applies them during the forward pass as an additive offset to the frozen weights. The latency cost is tens of milliseconds, not the multiple seconds of a full model load.

The rank r controls the expressiveness of the adapter. At r = 1 the adapter can only represent rank-1 transformations — a highly constrained set of behaviors. At r = 64 the adapter can represent richer adaptations at higher storage and compute cost. For most instruction-following and style adaptation tasks, r = 8 or r = 16 is sufficient. For tasks requiring substantial factual knowledge addition, higher ranks help but rarely need to exceed r = 64.

The team moved to LoRA. One 70B base model, 200 adapters totaling 20 GB. Inference cost dropped by an order of magnitude. Customer-specific behavior was preserved. The only thing that changed per customer was which 65,536 values got added to each weight matrix during each forward pass.

LoRA works because fine-tuning updates are empirically low-rank — freezing W and training only the small A and B matrices captures the task-specific delta while preserving everything pretrained, making it possible to serve hundreds of specialized variants from a single loaded base model.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →