AI Engineering 9 min read

Knowledge Distillation at Scale: Teaching Small Models to Think Big

How to transfer capabilities from large frontier models to smaller, cheaper, deployable ones — and where it breaks down.

You have a frontier model that's too expensive to serve. Knowledge distillation is how you get 80% of its quality at 10% of the cost. Here's what the technique actually involves and where it breaks down.

The Core Idea

Instead of training a student model on hard labels (correct/incorrect), distillation trains it to match the teacher model's output distribution — the full softmax over the vocabulary. These 'soft targets' contain more information than hard labels: the teacher's confidence, its second-best guesses, and implicit structure about token relationships.

Types of Distillation

Type	What's Transferred	Cost	Quality
Output distillation	Final token probabilities	Low (inference only)	Good
Feature distillation	Intermediate hidden states	Medium	Better
Sequence-level KD	Generated sequences as training data	Low (generate once)	Good
Online distillation	Teacher generates responses during student training	High	Best

Sequence-Level Distillation at Scale

The most practical form in 2025: use a frontier teacher (GPT-4, Claude, Gemini) to generate a large synthetic dataset, then fine-tune a smaller student on those generations. This is how Alpaca (GPT-3.5 → LLaMA), Vicuna, and most open instruction models were built. Quality is bounded by the teacher but often surprisingly close.

# Sequence-level distillation pipeline
teacher_responses = []
for prompt in prompt_dataset:
    response = teacher_api.generate(prompt)  # GPT-4 / Claude
    teacher_responses.append({"prompt": prompt, "response": response})

# Fine-tune student on teacher-generated data
student.fine_tune(teacher_responses, method="LoRA")

The Temperature Parameter

Classic distillation uses temperature T > 1 to soften the teacher's distribution before computing KL divergence. Higher T spreads probability mass across more tokens, revealing the teacher's uncertainty. T=2–4 is typical. For sequence-level distillation (training on generated text rather than logits), temperature controls response diversity.

Where Distillation Breaks Down

Capability compression limits: a 3B student cannot learn 70B reasoning capabilities regardless of how much distillation data you use. There's a hard floor based on model capacity.
Task mismatch: distillation works best when teacher and student share similar architectures and training distribution. Cross-architecture distillation (different tokenizers) is harder.
Hallucination inheritance: the student learns to sound like the teacher, including confident-sounding errors. Run your own factual evals — don't assume distillation improves truthfulness.
Cost of online distillation: having the teacher generate in the training loop is expensive. Usually only justified for the final few training runs.

Speculative Decoding as Implicit Distillation

Speculative decoding (small draft model + large verifier) is architecturally similar to distillation: the draft model learns the teacher's token distribution through the verification/rejection mechanism. Models explicitly trained for speculative decoding (e.g., Medusa, EAGLE) use distillation losses from the target model.

Practical heuristic: if your target model is 3–4× the size of your production model, distillation will recover most of the quality gap. If it's 10× larger (70B → 7B), expect to recover 60–80% of capabilities, with the hardest reasoning tasks suffering most.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →