GenAI Systems Lab Open interactive version →
AI Engineering 9 min read

Cost vs. Latency Tradeoffs in LLM Systems: How to Budget Both

TTFT, tokens-per-second, and end-to-end latency explained. How to set SLAs, model latency against user tolerance, and build a cost/latency budget.

Every production AI decision is a tradeoff between what the system costs to run and how fast it responds. Getting this wrong in either direction is expensive: over-spend on a frontier model for a simple classification task, and you burn 10× what you need to. Under-provision latency on a user-facing chat interface, and you lose users.

The cost structure of an LLM call

For API-based models, cost is driven by token counts. Input tokens (your prompt) and output tokens (the model's response) are priced separately, with output tokens typically costing 3–5× more than input tokens. A 2,000-token RAG prompt with a 500-token response at GPT-4o pricing costs roughly $0.005. At 100K requests/day, that's $500/day — $15K/month.

Cost driverTypical rangeHow to reduce
Input token countHigh for RAG (500–5000 tokens)Smaller chunks, better retrieval precision (fewer chunks needed)
Output token countModerate (100–1000 tokens)Set max_tokens, use concise output instructions
Model tier10–100× difference between tiersRoute simple queries to smaller models
Request volumeLinear with usageCache responses for identical or near-identical queries
System promptRepeated on every requestUse prompt caching (80–90% savings on cached prefix)

The latency structure

LLM latency has two components: Time to First Token (TTFT) — how long until the first output token arrives — and Time to Last Token (TTLT) — total generation time. For user-facing applications, TTFT determines perceived responsiveness. Streaming hides TTLT by showing tokens as they generate.

Latency driverTypical rangeHow to reduce
Model sizeSmaller = fasterRoute to smaller models where quality permits
Input lengthLonger = slower TTFTReduce prompt length, use caching
Output lengthLonger = slower TTLTLimit max_tokens, stream to user
Provider loadVariableBatch less-urgent requests during off-peak
Cold startFirst request in sessionKeep-alive connections, pre-warm

The model routing strategy

Not all requests need the same model. A well-designed system classifies incoming requests by complexity and routes to the cheapest model that can handle it. Simple factual lookups → small fast model. Complex reasoning → frontier model. Borderline → try small model, escalate on low-confidence.

def route_request(query, context_length):
    # Route to cheaper model for simple patterns
    simple_patterns = [
        len(query.split()) < 15,           # Short query
        context_length < 500,              # Minimal context
        is_classification_task(query),     # Simple classification
        has_cached_response(query),        # Already computed
    ]

    if sum(simple_patterns) >= 2:
        return call_model("gpt-4o-mini", query)   # ~10× cheaper
    else:
        return call_model("gpt-4o", query)        # Full capability

Caching strategies

Exact response caching

Cache the full response for identical inputs. Works well for high-repetition use cases (FAQ bots, standard report templates). Use a hash of the input as cache key. TTL depends on how often your knowledge base changes.

Semantic caching

Embed incoming queries and check for near-duplicate cached responses (cosine similarity > 0.95). Hits questions semantically similar to previously answered ones. Can reduce LLM calls by 20–40% on high-volume consumer applications. Tools: GPTCache, semantic caching layer in most vector stores.

Prompt caching

Cache the KV cache for a repeated prompt prefix (system prompt + static RAG context). Anthropic's prompt caching saves 90% on cached input tokens. For a system with a 4,000-token system prompt called 1M times/day, caching saves ~$36,000/month at standard pricing.

Quantisation and self-hosted tradeoffs

For very high volume, self-hosting open-source models becomes cost-competitive. A 70B parameter Llama model quantised to INT4 runs on 2× A100 GPUs — ~$5/hour on most cloud providers. At 50 requests/minute with 2,000 token average, that's roughly $0.0014 per request vs. $0.005 for GPT-4o-mini. At scale, the 3.5× difference is significant.

Self-hosting looks cheaper per token but adds engineering overhead: serving infrastructure, scaling, model updates, compliance. Below $50K/month in API spend, self-hosting usually doesn't pencil out when you factor in engineering time.

Setting budgets and alerts

Model cost calculator →: Estimate monthly costs across model tiers and request volumes in the Systems module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →