AI Engineering 8 min read

Chain-of-Thought: The Prompting Technique That Unlocked LLM Reasoning

Google's 2022 paper showing that asking models to 'think step by step' dramatically improves multi-step reasoning. Why it works and when to use it in production.

Large language models in 2021 had a strange property: they got worse at certain tasks as they got bigger. Multi-step reasoning — math word problems, commonsense inference chains, symbolic manipulations — showed no improvement or even degraded with scale. Adding parameters wasn't helping with tasks that required step-by-step thinking.

In January 2022, a team at Google Brain published 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models'. The finding: show the model examples where reasoning steps are written out explicitly, and it learns to produce intermediate steps itself — accuracy on multi-step tasks jumps dramatically. This was not a training change. It was a prompting change.

Standard prompting vs. chain-of-thought

Standard:
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many?
A: 11

Chain-of-thought:
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many?
A: Roger started with 5 balls. 2 cans × 3 balls = 6 more balls. 5 + 6 = 11. The answer is 11.

Zero-shot chain-of-thought

A follow-up paper found you don't always need few-shot examples. Appending 'Let's think step by step' to a question dramatically improved accuracy on reasoning tasks — a zero-shot variant that became standard in production prompts.

'Let's think step by step' works because it shifts generation strategy: instead of immediately predicting the final answer token, the model predicts intermediate reasoning steps which then condition the final answer on a richer context.

Why it works — and its limits

Works best on tasks where step-by-step reasoning is natural: math, logic, multi-step analysis
Less useful for tasks that don't require chaining: simple factual recall, classification
Can produce plausible-looking but wrong chains — confident-sounding steps that don't reach the right answer
Requires a sufficiently large model — very small models don't benefit and can produce worse chains

Production applications

Complex decision pipelines: prompt the model to reason through criteria before outputting a decision
Code generation: ask for a plan before the implementation
Document analysis: ask for observations before conclusions
System prompts: 'Always reason through the user's request before responding'

For production pipelines, consider separating reasoning and answer into two completions. First call: generate the reasoning chain. Second call: given the reasoning, produce the final output. This lets you log and validate reasoning independently from the output.

Test chain-of-thought in the Playground →: Compare standard prompting vs. chain-of-thought on reasoning tasks and see the accuracy difference.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →