GenAI Systems Lab Open interactive version →
AI Engineering 8 min read

Model Routing: How to Send the Right Query to the Right Model

Complexity-based routing, cost-based routing, and capability routing. How model routers cut inference cost by 40-70% without hurting quality.

Not every query needs GPT-4. A question like "what's the capital of France?" doesn't need the same model as "write a complex multi-step agentic workflow in Python". Model routing sends queries to the right model — and cuts costs by 40–80% without meaningful quality loss.

Why routing works

Large frontier models (GPT-4o, Claude Opus, Gemini Ultra) are 10–50× more expensive per token than smaller models (GPT-4o-mini, Claude Haiku, Gemini Flash). Most queries in production are simple. Routing simple queries to cheap models and hard queries to expensive ones exploits this distribution.

In a typical production chatbot, 60–80% of queries are simple enough for a cheap model. Only 10–20% truly require frontier-model reasoning. The remaining 10–20% are borderline and benefit from being routed conservatively to the stronger model.

Routing strategies

StrategyHow it worksProsCons
Complexity classifierTrain a small model to score query complexity, route above/below thresholdFast, cheap to runRequires labelled training data
Length-basedShort queries → small model, long queries → large modelZero implementationVery coarse — length ≠ complexity
LLM-as-routerUse a cheap model to decide which model to useWorks out of the boxAdds one LLM call latency
Cascade routingAlways try small model first; escalate if confidence is lowOptimal cost-qualityAdds latency on escalations

Cascade routing in practice

def routed_completion(query: str) -> str:
    # Step 1: try small model
    small_resp = call_model("gpt-4o-mini", query)

    # Step 2: score confidence (e.g., with a self-eval prompt)
    confidence = score_confidence(small_resp, query)

    if confidence > 0.85:
        return small_resp  # good enough, save money

    # Step 3: escalate to large model
    return call_model("gpt-4o", query)

RouteLLM and open-source routers

RouteLLM (open-sourced by LMSYS) provides pre-trained routing classifiers that you can drop into any LLM application. They report 40–70% cost reduction with less than 5% quality degradation on standard benchmarks.

Configure a model router →: See how routing decisions change based on query type and how cost changes at scale.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →