Chain-of-Thought: The Prompting Technique That Unlocked LLM Reasoning
Google's 2022 paper showing that asking models to 'think step by step' dramatically improves multi-step reasoning. Why it works and when to use it in production.
Large language models in 2021 had a strange property: they got worse at certain tasks as they got bigger. Multi-step reasoning — math word problems, commonsense inference chains, symbolic manipulations — showed no improvement or even degraded with scale. Adding parameters wasn't helping with tasks that required step-by-step thinking.
In January 2022, a team at Google Brain published 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models'. The finding: show the model examples where reasoning steps are written out explicitly, and it learns to produce intermediate steps itself — accuracy on multi-step tasks jumps dramatically. This was not a training change. It was a prompting change.
Standard prompting vs. chain-of-thought
Standard:
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many?
A: 11
Chain-of-thought:
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many?
A: Roger started with 5 balls. 2 cans × 3 balls = 6 more balls. 5 + 6 = 11. The answer is 11.
Zero-shot chain-of-thought
A follow-up paper found you don't always need few-shot examples. Appending 'Let's think step by step' to a question dramatically improved accuracy on reasoning tasks — a zero-shot variant that became standard in production prompts.
'Let's think step by step' works because it shifts generation strategy: instead of immediately predicting the final answer token, the model predicts intermediate reasoning steps which then condition the final answer on a richer context.
Why it works — and its limits
- Works best on tasks where step-by-step reasoning is natural: math, logic, multi-step analysis
- Less useful for tasks that don't require chaining: simple factual recall, classification
- Can produce plausible-looking but wrong chains — confident-sounding steps that don't reach the right answer
- Requires a sufficiently large model — very small models don't benefit and can produce worse chains
Production applications
- Complex decision pipelines: prompt the model to reason through criteria before outputting a decision
- Code generation: ask for a plan before the implementation
- Document analysis: ask for observations before conclusions
- System prompts: 'Always reason through the user's request before responding'
For production pipelines, consider separating reasoning and answer into two completions. First call: generate the reasoning chain. Second call: given the reasoning, produce the final output. This lets you log and validate reasoning independently from the output.
Test chain-of-thought in the Playground →: Compare standard prompting vs. chain-of-thought on reasoning tasks and see the accuracy difference.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →