Knowledge Distillation at Scale: Teaching Small Models to Think Big
How to transfer capabilities from large frontier models to smaller, cheaper, deployable ones — and where it breaks down.
You have a frontier model that's too expensive to serve. Knowledge distillation is how you get 80% of its quality at 10% of the cost. Here's what the technique actually involves and where it breaks down.
The Core Idea
Instead of training a student model on hard labels (correct/incorrect), distillation trains it to match the teacher model's output distribution — the full softmax over the vocabulary. These 'soft targets' contain more information than hard labels: the teacher's confidence, its second-best guesses, and implicit structure about token relationships.
Types of Distillation
| Type | What's Transferred | Cost | Quality |
|---|---|---|---|
| Output distillation | Final token probabilities | Low (inference only) | Good |
| Feature distillation | Intermediate hidden states | Medium | Better |
| Sequence-level KD | Generated sequences as training data | Low (generate once) | Good |
| Online distillation | Teacher generates responses during student training | High | Best |
Sequence-Level Distillation at Scale
The most practical form in 2025: use a frontier teacher (GPT-4, Claude, Gemini) to generate a large synthetic dataset, then fine-tune a smaller student on those generations. This is how Alpaca (GPT-3.5 → LLaMA), Vicuna, and most open instruction models were built. Quality is bounded by the teacher but often surprisingly close.
# Sequence-level distillation pipeline
teacher_responses = []
for prompt in prompt_dataset:
response = teacher_api.generate(prompt) # GPT-4 / Claude
teacher_responses.append({"prompt": prompt, "response": response})
# Fine-tune student on teacher-generated data
student.fine_tune(teacher_responses, method="LoRA")
The Temperature Parameter
Classic distillation uses temperature T > 1 to soften the teacher's distribution before computing KL divergence. Higher T spreads probability mass across more tokens, revealing the teacher's uncertainty. T=2–4 is typical. For sequence-level distillation (training on generated text rather than logits), temperature controls response diversity.
Where Distillation Breaks Down
- Capability compression limits: a 3B student cannot learn 70B reasoning capabilities regardless of how much distillation data you use. There's a hard floor based on model capacity.
- Task mismatch: distillation works best when teacher and student share similar architectures and training distribution. Cross-architecture distillation (different tokenizers) is harder.
- Hallucination inheritance: the student learns to sound like the teacher, including confident-sounding errors. Run your own factual evals — don't assume distillation improves truthfulness.
- Cost of online distillation: having the teacher generate in the training loop is expensive. Usually only justified for the final few training runs.
Speculative Decoding as Implicit Distillation
Speculative decoding (small draft model + large verifier) is architecturally similar to distillation: the draft model learns the teacher's token distribution through the verification/rejection mechanism. Models explicitly trained for speculative decoding (e.g., Medusa, EAGLE) use distillation losses from the target model.
Practical heuristic: if your target model is 3–4× the size of your production model, distillation will recover most of the quality gap. If it's 10× larger (70B → 7B), expect to recover 60–80% of capabilities, with the hardest reasoning tasks suffering most.
- Distilling the Knowledge in a Neural Network (Hinton et al., 2015)
- Sequence-Level Knowledge Distillation (Kim & Rush, 2016)
- Alpaca (Taori et al., 2023)
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →