AI Engineering 5 min read

Why Full Fine-Tuning Destroys What Pretraining Built

Pretraining encodes world knowledge across billions of gradients. Full fine-tuning on a narrow task applies thousands more gradient steps that overwrite those weights. The model improves at the task and forgets everything else — catastrophic forgetting.

The model scores 94% on the customer service evaluation. The team ships it. Three days later a user asks "what is photosynthesis?" and the model replies: "I'm sorry, I can only help with billing and account questions." The team checks other general knowledge queries. Same pattern. The model has forgotten the world.

This is catastrophic forgetting, and it is not a bug in the training code. It is the mathematically expected outcome of full fine-tuning applied to a large pretrained model.

Pretraining is expensive precisely because it is thorough. A model trained on a trillion tokens of diverse text has received billions of gradient updates — each one nudging the weights toward better next-token prediction across an enormous range of domains, styles, and topics. The resulting weight matrix encodes world knowledge implicitly: facts, causal relationships, syntax across languages, scientific concepts, coding patterns. None of this was stored in named slots. It is distributed across the weights through optimization pressure applied to a massive corpus over weeks of compute time.

Full fine-tuning takes this weight matrix and runs gradient descent again — this time on a small, narrow dataset. Thousands of gradient steps, each computed on customer service transcripts, push the weights toward minimizing loss on that specific distribution. The optimizer has no way to know or care that the previous parameter values encoded something important. It updates wherever the gradient points. The further fine-tuning runs on the narrow dataset, the more the weights shift toward specialization — and the more the broadly distributed pretrained knowledge erodes.

Pretrained weight matrix W  (4096 × 4096 = 16,777,216 parameters)

Full fine-tuning:
  W_new = W + ΔW   ← ΔW has same shape as W
  All 16.7M values updated each gradient step
  After 10,000 steps on narrow data:
    customer service task   → 94% accuracy  ✓
    "what is photosynthesis"→ "I only handle billing"  ✗

LoRA (rank r = 8):
  W stays frozen          ← never modified
  Train A: 4096 × 8  =  32,768 params
       + B: 8 × 4096  =  32,768 params
  Effective weight:  W + A×B
  Only 65,536 of 16.7M parameters trained (0.39%)
  Pretrained knowledge preserved in frozen W

The reason LoRA avoids catastrophic forgetting is that it never touches W. Instead of modifying the pretrained weights, it trains a low-rank decomposition that represents only the task-specific delta. At inference time the effective weight is the frozen W plus the adapter's contribution. The model can still answer general knowledge questions because those answers live in W, which remains unchanged. It handles the fine-tuning task because the adapter adds specialized behavior on top.

This is also why LoRA adapters are small — 65,536 parameters versus 16.7 million for a single layer — and why you can run many adapters on one base model simultaneously. The base knowledge is shared. Only the specialized behavior per task is separately stored and swappable.

The customer service model that forgot photosynthesis was retrained with LoRA on the same dataset. Customer service accuracy held at 93%. Photosynthesis came back. The only thing that changed was that the optimizer was never given permission to touch the weights that encoded it.

Full fine-tuning causes catastrophic forgetting not because something goes wrong, but because gradient descent on a narrow dataset has no reason to preserve the broadly distributed knowledge baked into pretrained weights through billions of prior gradient steps.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →