Chinchilla: The Scaling Laws Paper That Made Bigger Models Smaller
DeepMind's 2022 paper showing GPT-3 was undertrained by 6×. The compute-optimal formula that reshaped how every lab trains models — and why smaller + more data beats larger + less.
In 2020, OpenAI published the original scaling laws paper. The implication everyone drew: make the model bigger. GPT-3 had 175 billion parameters trained on 300 billion tokens. In March 2022, DeepMind published 'Training Compute-Optimal Large Language Models' — the Chinchilla paper. The finding: GPT-3 was massively undertrained. Not in parameters — in tokens. The entire industry had been scaling in the wrong direction.
The compute-optimal formula
Chinchilla's rule: for compute-optimal training, model parameters N and training tokens D should scale in equal proportion. Double the compute? Double both model size AND training tokens. The original scaling laws suggested mostly scaling model size.
Optimal: D ≈ 20 × N (train on ~20 tokens per parameter)
GPT-3: 175B params × 20 = 3.5T tokens needed (trained on 300B — 11× too few)
Chinchilla: 70B params, trained on 1.4T tokens (compute-optimal)
Result: Chinchilla 70B outperforms GPT-3 175B on nearly every benchmark
with 2.5× fewer parameters and roughly equal total compute.
What it changed
- LLaMA (Meta, 2023): trained smaller models on vastly more tokens. LLaMA-7B on 1T tokens outperforms much larger poorly-trained models.
- The 'bigger is always better' intuition broke: a Chinchilla-7B is frequently a better inference choice than a poorly-trained 70B.
- Mistral 7B, Phi, Gemma: small models trained compute-optimally that punch well above their weight class.
The inference implication
Inference cost scales with parameter count, not token count. A Chinchilla-optimal 7B model costs ~10× less per token to serve than a 70B model while achieving similar quality. This is why the LLaMA 3 family became the dominant open-source stack.
When evaluating models for production, always check training token count alongside parameter count. A 7B model trained on 10T tokens will outperform a 70B model trained on 500B tokens on most practical benchmarks — while costing 10× less to serve.
Inference-optimal vs. training-optimal
Chinchilla optimises for training efficiency — minimum compute to reach a given loss. For deployed models, the optimal strategy is to overtrain small models: spend all compute on data, not parameters. This is why Meta trains 8B models on 15–30 trillion tokens — well beyond what Chinchilla recommends for training efficiency.
Compare models by size and benchmark performance →: See Chinchilla's principle in action: smaller models can outperform larger ones when trained well.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →