Model Routing: How to Send the Right Query to the Right Model
Complexity-based routing, cost-based routing, and capability routing. How model routers cut inference cost by 40-70% without hurting quality.
Not every query needs GPT-4. A question like "what's the capital of France?" doesn't need the same model as "write a complex multi-step agentic workflow in Python". Model routing sends queries to the right model — and cuts costs by 40–80% without meaningful quality loss.
Why routing works
Large frontier models (GPT-4o, Claude Opus, Gemini Ultra) are 10–50× more expensive per token than smaller models (GPT-4o-mini, Claude Haiku, Gemini Flash). Most queries in production are simple. Routing simple queries to cheap models and hard queries to expensive ones exploits this distribution.
In a typical production chatbot, 60–80% of queries are simple enough for a cheap model. Only 10–20% truly require frontier-model reasoning. The remaining 10–20% are borderline and benefit from being routed conservatively to the stronger model.
Routing strategies
| Strategy | How it works | Pros | Cons |
|---|---|---|---|
| Complexity classifier | Train a small model to score query complexity, route above/below threshold | Fast, cheap to run | Requires labelled training data |
| Length-based | Short queries → small model, long queries → large model | Zero implementation | Very coarse — length ≠ complexity |
| LLM-as-router | Use a cheap model to decide which model to use | Works out of the box | Adds one LLM call latency |
| Cascade routing | Always try small model first; escalate if confidence is low | Optimal cost-quality | Adds latency on escalations |
Cascade routing in practice
def routed_completion(query: str) -> str:
# Step 1: try small model
small_resp = call_model("gpt-4o-mini", query)
# Step 2: score confidence (e.g., with a self-eval prompt)
confidence = score_confidence(small_resp, query)
if confidence > 0.85:
return small_resp # good enough, save money
# Step 3: escalate to large model
return call_model("gpt-4o", query)
RouteLLM and open-source routers
RouteLLM (open-sourced by LMSYS) provides pre-trained routing classifiers that you can drop into any LLM application. They report 40–70% cost reduction with less than 5% quality degradation on standard benchmarks.
Configure a model router →: See how routing decisions change based on query type and how cost changes at scale.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →