AI Engineering 9 min read

Chinchilla: The Scaling Laws Paper That Made Bigger Models Smaller

DeepMind's 2022 paper showing GPT-3 was undertrained by 6×. The compute-optimal formula that reshaped how every lab trains models — and why smaller + more data beats larger + less.

In 2020, OpenAI published the original scaling laws paper. The implication everyone drew: make the model bigger. GPT-3 had 175 billion parameters trained on 300 billion tokens. In March 2022, DeepMind published 'Training Compute-Optimal Large Language Models' — the Chinchilla paper. The finding: GPT-3 was massively undertrained. Not in parameters — in tokens. The entire industry had been scaling in the wrong direction.

The compute-optimal formula

Chinchilla's rule: for compute-optimal training, model parameters N and training tokens D should scale in equal proportion. Double the compute? Double both model size AND training tokens. The original scaling laws suggested mostly scaling model size.

Optimal: D ≈ 20 × N  (train on ~20 tokens per parameter)

GPT-3:      175B params × 20 = 3.5T tokens needed  (trained on 300B — 11× too few)
Chinchilla:  70B params,  trained on 1.4T tokens  (compute-optimal)

Result: Chinchilla 70B outperforms GPT-3 175B on nearly every benchmark
with 2.5× fewer parameters and roughly equal total compute.

What it changed

LLaMA (Meta, 2023): trained smaller models on vastly more tokens. LLaMA-7B on 1T tokens outperforms much larger poorly-trained models.
The 'bigger is always better' intuition broke: a Chinchilla-7B is frequently a better inference choice than a poorly-trained 70B.
Mistral 7B, Phi, Gemma: small models trained compute-optimally that punch well above their weight class.

The inference implication

Inference cost scales with parameter count, not token count. A Chinchilla-optimal 7B model costs ~10× less per token to serve than a 70B model while achieving similar quality. This is why the LLaMA 3 family became the dominant open-source stack.

When evaluating models for production, always check training token count alongside parameter count. A 7B model trained on 10T tokens will outperform a 70B model trained on 500B tokens on most practical benchmarks — while costing 10× less to serve.

Inference-optimal vs. training-optimal

Chinchilla optimises for training efficiency — minimum compute to reach a given loss. For deployed models, the optimal strategy is to overtrain small models: spend all compute on data, not parameters. This is why Meta trains 8B models on 15–30 trillion tokens — well beyond what Chinchilla recommends for training efficiency.

Compare models by size and benchmark performance →: See Chinchilla's principle in action: smaller models can outperform larger ones when trained well.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →