GenAI Systems Lab Open interactive version →
AI Engineering 9 min read

Knowledge Distillation at Scale: Teaching Small Models to Think Big

How to transfer capabilities from large frontier models to smaller, cheaper, deployable ones — and where it breaks down.

You have a frontier model that's too expensive to serve. Knowledge distillation is how you get 80% of its quality at 10% of the cost. Here's what the technique actually involves and where it breaks down.

The Core Idea

Instead of training a student model on hard labels (correct/incorrect), distillation trains it to match the teacher model's output distribution — the full softmax over the vocabulary. These 'soft targets' contain more information than hard labels: the teacher's confidence, its second-best guesses, and implicit structure about token relationships.

Types of Distillation

TypeWhat's TransferredCostQuality
Output distillationFinal token probabilitiesLow (inference only)Good
Feature distillationIntermediate hidden statesMediumBetter
Sequence-level KDGenerated sequences as training dataLow (generate once)Good
Online distillationTeacher generates responses during student trainingHighBest

Sequence-Level Distillation at Scale

The most practical form in 2025: use a frontier teacher (GPT-4, Claude, Gemini) to generate a large synthetic dataset, then fine-tune a smaller student on those generations. This is how Alpaca (GPT-3.5 → LLaMA), Vicuna, and most open instruction models were built. Quality is bounded by the teacher but often surprisingly close.

# Sequence-level distillation pipeline
teacher_responses = []
for prompt in prompt_dataset:
    response = teacher_api.generate(prompt)  # GPT-4 / Claude
    teacher_responses.append({"prompt": prompt, "response": response})

# Fine-tune student on teacher-generated data
student.fine_tune(teacher_responses, method="LoRA")

The Temperature Parameter

Classic distillation uses temperature T > 1 to soften the teacher's distribution before computing KL divergence. Higher T spreads probability mass across more tokens, revealing the teacher's uncertainty. T=2–4 is typical. For sequence-level distillation (training on generated text rather than logits), temperature controls response diversity.

Where Distillation Breaks Down

Speculative Decoding as Implicit Distillation

Speculative decoding (small draft model + large verifier) is architecturally similar to distillation: the draft model learns the teacher's token distribution through the verification/rejection mechanism. Models explicitly trained for speculative decoding (e.g., Medusa, EAGLE) use distillation losses from the target model.

Practical heuristic: if your target model is 3–4× the size of your production model, distillation will recover most of the quality gap. If it's 10× larger (70B → 7B), expect to recover 60–80% of capabilities, with the hardest reasoning tasks suffering most.


Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →