GenAI Systems Lab Open interactive version →
Foundations & Architecture 11 min read

LoRA From Scratch: Low-Rank Adaptation in 30 Lines of PyTorch

Implement the LoRA adapter: freeze the base weights, attach two small matrices A and B, train only the adapter. Count parameters before and after. Merge the adapter for zero-overhead inference. The paper made concrete.

Fine-tuning a 7B parameter model updates 7 billion floats. At FP16 that is 14GB of weight updates, plus optimizer states requiring another 28GB. LoRA (Hu et al., 2021) makes this feasible with one observation: weight updates during fine-tuning have low intrinsic rank. Instead of learning a full delta matrix ΔW of shape (d_in × d_out), learn two small matrices A (d_in × r) and B (r × d_out) where r is tiny. ΔW = BA. For r=8 and 4096×4096: 8×8192 = 65,536 parameters instead of 16,777,216 — a 256× reduction.

Why low-rank updates work

The pre-trained model has learned rich priors. Fine-tuning nudges them toward a specific task. The empirical claim (verified in the LoRA paper) is that this nudge lives in a low-dimensional subspace of the full weight matrix. Gradient descent during fine-tuning is mostly making low-rank moves. LoRA constrains the update to exactly this subspace, throwing away directions that would not be used anyway.

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=8, alpha=16):
        super().__init__()
        self.base   = nn.Linear(in_features, out_features, bias=False)
        self.base.weight.requires_grad_(False)        # freeze base

        self.lora_A = nn.Linear(in_features,  rank,          bias=False)
        self.lora_B = nn.Linear(rank,          out_features,  bias=False)
        self.scaling = alpha / rank

        nn.init.kaiming_uniform_(self.lora_A.weight)
        nn.init.zeros_(self.lora_B.weight)            # zero init → no change at start

    def forward(self, x):
        return self.base(x) + self.scaling * self.lora_B(self.lora_A(x))

    def merge(self):
        """Merge LoRA into base for inference (zero overhead)."""
        delta_W = (self.lora_B.weight @ self.lora_A.weight) * self.scaling
        self.base.weight.data += delta_W
        return self.base

# Parameter count comparison
d = 4096
full  = nn.Linear(d, d, bias=False)
lora  = LoRALinear(d, d, rank=8, alpha=16)

full_p  = sum(p.numel() for p in full.parameters())
train_p = sum(p.numel() for p in lora.parameters() if p.requires_grad)
print(f"Full fine-tuning:  {full_p:,} trainable parameters")
print(f"LoRA trainable:    {train_p:,}  ({train_p/full_p*100:.2f}%)")
print(f"Ratio: {full_p//train_p}x fewer gradients to store")

Rank, alpha, and which modules to target

Rank r controls expressiveness. r=4 or r=8 works for most tasks. Higher rank is not always better — it means more parameters and potential overfitting on small datasets. Alpha controls the scale: effective_update = (B@A) * (alpha/r). Many practitioners use alpha=2r as a default.

Standard targets: Q, K, V, O attention matrices. Adding FFN layers often improves performance at a small cost. Do not target embedding layers unless you are doing domain adaptation with unusual vocabulary.

Merging for zero-overhead inference

merge() adds the LoRA update into the base weights: base.weight += (B@A) * scaling. After merging, the model is identical in compute to the original — no overhead per forward pass. Multiple adapters (one per user, one per domain) can be stored separately and swapped at serving time. This is what makes multi-tenant fine-tuned serving practical.

Inject LoRALinear into a HuggingFace model's attention layers (monkey-patch self.q_proj, self.v_proj). Count trainable parameters. Run a forward+backward pass. Verify that base.weight.grad is None and lora_A.weight.grad is populated. This confirms the freeze is working — the base weights receive no gradient update.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →