GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

Model Strategy: When to Use GPT-4, Claude, Gemini, or an Open Model

The model selection decision — capability, cost, latency, data privacy, and fine-tunability. How to build a model strategy that holds up as models evolve.

You will, at some point, be in a meeting where someone asks: 'should we use Claude or GPT-4?' The wrong answer is 'whichever benchmarks best.' The right answer is a framework that maps your specific requirements to the right model — and it changes every six months as the landscape shifts.

This is that framework.

The dimensions that actually matter

DimensionQuestions to ask
Task complexityIs this a lookup, a reasoning task, a creative task, or a multi-step agent workflow?
Latency budgetWhat's your P99 target? Chat needs <3s TTFT. Background jobs can tolerate 30s.
Cost per requestWhat's the monthly volume? Can you route by complexity?
Context lengthDo you need 200K tokens for long documents, or does 8K cover your task?
MultimodalDo you need vision? Audio? If yes, that narrows the field significantly.
Tool use qualityFor agents, test function calling accuracy. Models vary significantly here.
Output formatStructured JSON? Markdown? Code? Some models are much more reliable for specific formats.
ComplianceDoes data need to stay in a specific region? Does your contract require HIPAA/SOC2 coverage?

The current model landscape (mid-2025)

This section ages fast. Always benchmark the latest model releases against your eval set before switching. Leaderboard rankings do not predict performance on your specific task.

ModelStrongest atWatch out for
Claude Opus 4Deep reasoning, long-context, nuanced writing, safety-critical tasksSlower and pricier than Sonnet; overkill for simple tasks
Claude Sonnet 4Balanced performance/speed/cost; strong coding and tool useNot the top choice for very long unstructured creative output
Claude Haiku 4.5High-volume, latency-sensitive, simple classification and extractionWeaker on multi-step reasoning
GPT-4oMultimodal tasks (vision + audio), wide third-party integrationsContext window smaller than Claude at same tier
GPT-4o-miniCost-optimised tasks where GPT-4o quality isn't neededNoticeably weaker reasoning than GPT-4o
Gemini 1.5 Pro1M token context window, document-heavy tasks, Google Workspace integrationAvailability can lag in some regions
Llama 3.1 70BSelf-hosted, cost control, compliance-heavy environmentsNeeds serving infra; weaker instruction following than frontier
Mistral LargeEuropean data residency, strong code, function callingSmaller ecosystem than OpenAI/Anthropic

The routing decision tree

Build a model selection eval

Don't pick based on vibes. Build a 100-example eval on your specific task. Run every candidate model. Score with your LLM judge. Normalise by cost per request. The table of (model, quality score, cost) is the only honest basis for a model selection decision.

Rerun this eval every quarter. The landscape shifts. A model that was the clear winner 6 months ago may have been overtaken — or may have degraded if the provider updated the serving infrastructure in ways that affect your use case (this happens more often than providers admit).

Compare models on your task →: Run side-by-side model comparisons with your prompts in the Systems module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →