AI Engineering 10 min read

Model Strategy: When to Use GPT-4, Claude, Gemini, or an Open Model

The model selection decision — capability, cost, latency, data privacy, and fine-tunability. How to build a model strategy that holds up as models evolve.

You will, at some point, be in a meeting where someone asks: 'should we use Claude or GPT-4?' The wrong answer is 'whichever benchmarks best.' The right answer is a framework that maps your specific requirements to the right model — and it changes every six months as the landscape shifts.

This is that framework.

The dimensions that actually matter

Dimension	Questions to ask
Task complexity	Is this a lookup, a reasoning task, a creative task, or a multi-step agent workflow?
Latency budget	What's your P99 target? Chat needs <3s TTFT. Background jobs can tolerate 30s.
Cost per request	What's the monthly volume? Can you route by complexity?
Context length	Do you need 200K tokens for long documents, or does 8K cover your task?
Multimodal	Do you need vision? Audio? If yes, that narrows the field significantly.
Tool use quality	For agents, test function calling accuracy. Models vary significantly here.
Output format	Structured JSON? Markdown? Code? Some models are much more reliable for specific formats.
Compliance	Does data need to stay in a specific region? Does your contract require HIPAA/SOC2 coverage?

The current model landscape (mid-2025)

This section ages fast. Always benchmark the latest model releases against your eval set before switching. Leaderboard rankings do not predict performance on your specific task.

Model	Strongest at	Watch out for
Claude Opus 4	Deep reasoning, long-context, nuanced writing, safety-critical tasks	Slower and pricier than Sonnet; overkill for simple tasks
Claude Sonnet 4	Balanced performance/speed/cost; strong coding and tool use	Not the top choice for very long unstructured creative output
Claude Haiku 4.5	High-volume, latency-sensitive, simple classification and extraction	Weaker on multi-step reasoning
GPT-4o	Multimodal tasks (vision + audio), wide third-party integrations	Context window smaller than Claude at same tier
GPT-4o-mini	Cost-optimised tasks where GPT-4o quality isn't needed	Noticeably weaker reasoning than GPT-4o
Gemini 1.5 Pro	1M token context window, document-heavy tasks, Google Workspace integration	Availability can lag in some regions
Llama 3.1 70B	Self-hosted, cost control, compliance-heavy environments	Needs serving infra; weaker instruction following than frontier
Mistral Large	European data residency, strong code, function calling	Smaller ecosystem than OpenAI/Anthropic

The routing decision tree

Is it a simple classification, extraction, or yes/no task? → Use a small/fast model (Haiku, GPT-4o-mini, Mistral small)
Does it require multi-step reasoning or tool calls? → Benchmark Sonnet vs. GPT-4o on your task
Is it a long-document task (>50K tokens)? → Claude or Gemini 1.5 Pro
Is it multimodal (images/audio)? → GPT-4o or Gemini
Is data sovereignty required? → Self-hosted (Llama, Mistral) or region-locked API
Is it a high-stakes reasoning task where quality > everything? → Claude Opus or GPT-4o on your eval set

Build a model selection eval

Don't pick based on vibes. Build a 100-example eval on your specific task. Run every candidate model. Score with your LLM judge. Normalise by cost per request. The table of (model, quality score, cost) is the only honest basis for a model selection decision.

Rerun this eval every quarter. The landscape shifts. A model that was the clear winner 6 months ago may have been overtaken — or may have degraded if the provider updated the serving infrastructure in ways that affect your use case (this happens more often than providers admit).

Compare models on your task →: Run side-by-side model comparisons with your prompts in the Systems module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →