AI Engineering 10 min read

What Changed: Base LLMs vs. Reasoning Models

o1, o3, Claude 3.7 Sonnet — what makes a 'reasoning model' different from a base LLM? Chain-of-thought at training time, hidden scratchpads, inference-time compute scaling, and why these models cost 10x more per token.

OpenAI o1. o3. Claude 3.7 Sonnet. Gemini 2.0 Flash Thinking. A new class of model appeared in 2024—one that doesn't just predict the next token, but spends extra compute thinking before answering. Here's exactly what changed.

The core shift: inference-time compute scaling

Standard LLMs scale quality by making the model bigger (more parameters) or training it on more data. Reasoning models add a third axis: they spend more compute at inference time. Instead of generating one answer pass, they generate a long internal chain-of-thought first—then the final answer.

What's actually different architecturally?

Hidden scratchpad: the model generates intermediate reasoning tokens that are never shown to the user. These are real tokens—they take time and cost money.
Trained on thinking traces: reasoning models are fine-tuned on datasets where the model explicitly reasons step-by-step before answering, via RL with process-level rewards.
Longer TTFT: because the model generates thousands of thinking tokens before the first response token, Time To First Token is dramatically higher than standard models.
Better on multi-step problems: math olympiad, competitive coding, legal reasoning, complex debugging—tasks that require planning and error-correction benefit most.

What it means for your system

Dimension	Base LLM (e.g. GPT-4o)	Reasoning Model (e.g. o3)
TTFT	< 1 second	5–30 seconds
Cost/query	Low	10–20x higher
Accuracy (math/code)	Moderate	State-of-art
Accuracy (simple tasks)	Same	Same or slower
Context window	128K–200K	128K–200K

Use reasoning models when accuracy on a hard task is worth paying for. Use standard models for everything else. The key skill is routing correctly.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →