GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

What Changed: Base LLMs vs. Reasoning Models

o1, o3, Claude 3.7 Sonnet — what makes a 'reasoning model' different from a base LLM? Chain-of-thought at training time, hidden scratchpads, inference-time compute scaling, and why these models cost 10x more per token.

OpenAI o1. o3. Claude 3.7 Sonnet. Gemini 2.0 Flash Thinking. A new class of model appeared in 2024—one that doesn't just predict the next token, but spends extra compute thinking before answering. Here's exactly what changed.

The core shift: inference-time compute scaling

Standard LLMs scale quality by making the model bigger (more parameters) or training it on more data. Reasoning models add a third axis: they spend more compute at inference time. Instead of generating one answer pass, they generate a long internal chain-of-thought first—then the final answer.

What's actually different architecturally?

What it means for your system

DimensionBase LLM (e.g. GPT-4o)Reasoning Model (e.g. o3)
TTFT< 1 second5–30 seconds
Cost/queryLow10–20x higher
Accuracy (math/code)ModerateState-of-art
Accuracy (simple tasks)SameSame or slower
Context window128K–200K128K–200K

Use reasoning models when accuracy on a hard task is worth paying for. Use standard models for everything else. The key skill is routing correctly.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →