Foundations & Architecture 5 min read

Chinchilla: Bigger Is Not Always Better — It's About the Ratio

The Chinchilla paper (DeepMind, 2022) showed most pre-2022 models were over-parameterized and undertrained. The compute-optimal ratio is ~20 tokens per parameter. A smaller model trained longer often beats a larger model trained shorter.

The team picks the largest open model available — 70B parameters — and trains it for as long as their compute budget allows. They stop at 300 billion tokens and evaluate. The model underperforms a 7B model released six months earlier that trained on more data. The 70B has ten times the parameters. The 7B is better. This is not a fluke or a benchmarking artifact. It is the direct consequence of ignoring the most important empirical finding about large language model training from the past several years.

Before 2022, the dominant intuition was that bigger models are better models. Add parameters, improve capability. Researchers at DeepMind ran a systematic experiment to test whether this was actually true given a fixed compute budget. They trained hundreds of models of different sizes on different amounts of data, all using the same total number of FLOPs, and measured final performance. The question they were answering was not which model is biggest — it was which model is best for a given amount of compute spent.

The result, published as the Chinchilla paper in 2022, was that nearly every major language model trained before it was simultaneously too large and undertrained. The optimal allocation of compute is not to maximize parameter count — it is to balance parameter count against training tokens according to a specific ratio.

Empirical rule: optimal training tokens ≈ 20 × model parameters

Model     Optimal tokens    Pre-2022 actual   Verdict
──────────────────────────────────────────────────────────
  7B          140B              ~30B           undertrained by 110B
 13B          260B              ~40B           undertrained by 220B
 70B          1.4T             ~300B           undertrained by 1.1T
175B          3.5T             ~300B           undertrained by 3.2T

Fixed compute budget example:
  Option A: train 70B params on 300B tokens  → GPT-3 scale
  Option B: train 12B params on 1.4T tokens  → Chinchilla-optimal

  Option B wins on every downstream benchmark
  at the same total training compute cost

The insight is that a model parameter is not inherently valuable — it is valuable only if it has seen enough data to learn something reliable. An undertrained large model has many weights that encode noise or redundancy from insufficient exposure to diverse training signal. A smaller model trained on far more tokens has weights that encode more generalizable patterns, because each parameter was updated many more times across more varied data.

This is why the LLaMA family — smaller models trained on substantially more tokens than was conventional — outperformed models with twice the parameter count at the time of their release. It is also why modern practices for open model training now prioritize token count alongside parameter count when allocating compute budgets. More compute should flow into more training tokens, not just more parameters.

The team with the underperforming 70B retrained a 7B model on their full token budget rather than stopping early. The 7B matched the 70B across evaluations at one-tenth the inference cost. The lesson was not to choose smaller models — it was to never leave training tokens unspent when you have compute to run them.

Chinchilla proved that model size and training data must scale together — for any fixed compute budget, training a smaller model on proportionally more tokens consistently beats training a larger model on fewer, because undertrained parameters are compute wasted on weights that never learned enough.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →