LoRA From Scratch: Low-Rank Adaptation in 30 Lines of PyTorch
Implement the LoRA adapter: freeze the base weights, attach two small matrices A and B, train only the adapter. Count parameters before and after. Merge the adapter for zero-overhead inference. The paper made concrete.
Fine-tuning a 7B parameter model updates 7 billion floats. At FP16 that is 14GB of weight updates, plus optimizer states requiring another 28GB. LoRA (Hu et al., 2021) makes this feasible with one observation: weight updates during fine-tuning have low intrinsic rank. Instead of learning a full delta matrix ΔW of shape (d_in × d_out), learn two small matrices A (d_in × r) and B (r × d_out) where r is tiny. ΔW = BA. For r=8 and 4096×4096: 8×8192 = 65,536 parameters instead of 16,777,216 — a 256× reduction.
Why low-rank updates work
The pre-trained model has learned rich priors. Fine-tuning nudges them toward a specific task. The empirical claim (verified in the LoRA paper) is that this nudge lives in a low-dimensional subspace of the full weight matrix. Gradient descent during fine-tuning is mostly making low-rank moves. LoRA constrains the update to exactly this subspace, throwing away directions that would not be used anyway.
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
def __init__(self, in_features, out_features, rank=8, alpha=16):
super().__init__()
self.base = nn.Linear(in_features, out_features, bias=False)
self.base.weight.requires_grad_(False) # freeze base
self.lora_A = nn.Linear(in_features, rank, bias=False)
self.lora_B = nn.Linear(rank, out_features, bias=False)
self.scaling = alpha / rank
nn.init.kaiming_uniform_(self.lora_A.weight)
nn.init.zeros_(self.lora_B.weight) # zero init → no change at start
def forward(self, x):
return self.base(x) + self.scaling * self.lora_B(self.lora_A(x))
def merge(self):
"""Merge LoRA into base for inference (zero overhead)."""
delta_W = (self.lora_B.weight @ self.lora_A.weight) * self.scaling
self.base.weight.data += delta_W
return self.base
# Parameter count comparison
d = 4096
full = nn.Linear(d, d, bias=False)
lora = LoRALinear(d, d, rank=8, alpha=16)
full_p = sum(p.numel() for p in full.parameters())
train_p = sum(p.numel() for p in lora.parameters() if p.requires_grad)
print(f"Full fine-tuning: {full_p:,} trainable parameters")
print(f"LoRA trainable: {train_p:,} ({train_p/full_p*100:.2f}%)")
print(f"Ratio: {full_p//train_p}x fewer gradients to store")
Rank, alpha, and which modules to target
Rank r controls expressiveness. r=4 or r=8 works for most tasks. Higher rank is not always better — it means more parameters and potential overfitting on small datasets. Alpha controls the scale: effective_update = (B@A) * (alpha/r). Many practitioners use alpha=2r as a default.
Standard targets: Q, K, V, O attention matrices. Adding FFN layers often improves performance at a small cost. Do not target embedding layers unless you are doing domain adaptation with unusual vocabulary.
Merging for zero-overhead inference
merge() adds the LoRA update into the base weights: base.weight += (B@A) * scaling. After merging, the model is identical in compute to the original — no overhead per forward pass. Multiple adapters (one per user, one per domain) can be stored separately and swapped at serving time. This is what makes multi-tenant fine-tuned serving practical.
Inject LoRALinear into a HuggingFace model's attention layers (monkey-patch self.q_proj, self.v_proj). Count trainable parameters. Run a forward+backward pass. Verify that base.weight.grad is None and lora_A.weight.grad is populated. This confirms the freeze is working — the base weights receive no gradient update.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →