What Changed: Base LLMs vs. Reasoning Models
o1, o3, Claude 3.7 Sonnet — what makes a 'reasoning model' different from a base LLM? Chain-of-thought at training time, hidden scratchpads, inference-time compute scaling, and why these models cost 10x more per token.
OpenAI o1. o3. Claude 3.7 Sonnet. Gemini 2.0 Flash Thinking. A new class of model appeared in 2024—one that doesn't just predict the next token, but spends extra compute thinking before answering. Here's exactly what changed.
The core shift: inference-time compute scaling
Standard LLMs scale quality by making the model bigger (more parameters) or training it on more data. Reasoning models add a third axis: they spend more compute at inference time. Instead of generating one answer pass, they generate a long internal chain-of-thought first—then the final answer.
What's actually different architecturally?
- Hidden scratchpad: the model generates intermediate reasoning tokens that are never shown to the user. These are real tokens—they take time and cost money.
- Trained on thinking traces: reasoning models are fine-tuned on datasets where the model explicitly reasons step-by-step before answering, via RL with process-level rewards.
- Longer TTFT: because the model generates thousands of thinking tokens before the first response token, Time To First Token is dramatically higher than standard models.
- Better on multi-step problems: math olympiad, competitive coding, legal reasoning, complex debugging—tasks that require planning and error-correction benefit most.
What it means for your system
| Dimension | Base LLM (e.g. GPT-4o) | Reasoning Model (e.g. o3) |
|---|---|---|
| TTFT | < 1 second | 5–30 seconds |
| Cost/query | Low | 10–20x higher |
| Accuracy (math/code) | Moderate | State-of-art |
| Accuracy (simple tasks) | Same | Same or slower |
| Context window | 128K–200K | 128K–200K |
Use reasoning models when accuracy on a hard task is worth paying for. Use standard models for everything else. The key skill is routing correctly.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →