Model Strategy: When to Use GPT-4, Claude, Gemini, or an Open Model
The model selection decision — capability, cost, latency, data privacy, and fine-tunability. How to build a model strategy that holds up as models evolve.
You will, at some point, be in a meeting where someone asks: 'should we use Claude or GPT-4?' The wrong answer is 'whichever benchmarks best.' The right answer is a framework that maps your specific requirements to the right model — and it changes every six months as the landscape shifts.
This is that framework.
The dimensions that actually matter
| Dimension | Questions to ask |
|---|---|
| Task complexity | Is this a lookup, a reasoning task, a creative task, or a multi-step agent workflow? |
| Latency budget | What's your P99 target? Chat needs <3s TTFT. Background jobs can tolerate 30s. |
| Cost per request | What's the monthly volume? Can you route by complexity? |
| Context length | Do you need 200K tokens for long documents, or does 8K cover your task? |
| Multimodal | Do you need vision? Audio? If yes, that narrows the field significantly. |
| Tool use quality | For agents, test function calling accuracy. Models vary significantly here. |
| Output format | Structured JSON? Markdown? Code? Some models are much more reliable for specific formats. |
| Compliance | Does data need to stay in a specific region? Does your contract require HIPAA/SOC2 coverage? |
The current model landscape (mid-2025)
This section ages fast. Always benchmark the latest model releases against your eval set before switching. Leaderboard rankings do not predict performance on your specific task.
| Model | Strongest at | Watch out for |
|---|---|---|
| Claude Opus 4 | Deep reasoning, long-context, nuanced writing, safety-critical tasks | Slower and pricier than Sonnet; overkill for simple tasks |
| Claude Sonnet 4 | Balanced performance/speed/cost; strong coding and tool use | Not the top choice for very long unstructured creative output |
| Claude Haiku 4.5 | High-volume, latency-sensitive, simple classification and extraction | Weaker on multi-step reasoning |
| GPT-4o | Multimodal tasks (vision + audio), wide third-party integrations | Context window smaller than Claude at same tier |
| GPT-4o-mini | Cost-optimised tasks where GPT-4o quality isn't needed | Noticeably weaker reasoning than GPT-4o |
| Gemini 1.5 Pro | 1M token context window, document-heavy tasks, Google Workspace integration | Availability can lag in some regions |
| Llama 3.1 70B | Self-hosted, cost control, compliance-heavy environments | Needs serving infra; weaker instruction following than frontier |
| Mistral Large | European data residency, strong code, function calling | Smaller ecosystem than OpenAI/Anthropic |
The routing decision tree
- Is it a simple classification, extraction, or yes/no task? → Use a small/fast model (Haiku, GPT-4o-mini, Mistral small)
- Does it require multi-step reasoning or tool calls? → Benchmark Sonnet vs. GPT-4o on your task
- Is it a long-document task (>50K tokens)? → Claude or Gemini 1.5 Pro
- Is it multimodal (images/audio)? → GPT-4o or Gemini
- Is data sovereignty required? → Self-hosted (Llama, Mistral) or region-locked API
- Is it a high-stakes reasoning task where quality > everything? → Claude Opus or GPT-4o on your eval set
Build a model selection eval
Don't pick based on vibes. Build a 100-example eval on your specific task. Run every candidate model. Score with your LLM judge. Normalise by cost per request. The table of (model, quality score, cost) is the only honest basis for a model selection decision.
Rerun this eval every quarter. The landscape shifts. A model that was the clear winner 6 months ago may have been overtaken — or may have degraded if the provider updated the serving infrastructure in ways that affect your use case (this happens more often than providers admit).
Compare models on your task →: Run side-by-side model comparisons with your prompts in the Systems module.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →