AI Engineering 9 min read

Cost vs. Latency Tradeoffs in LLM Systems: How to Budget Both

TTFT, tokens-per-second, and end-to-end latency explained. How to set SLAs, model latency against user tolerance, and build a cost/latency budget.

Every production AI decision is a tradeoff between what the system costs to run and how fast it responds. Getting this wrong in either direction is expensive: over-spend on a frontier model for a simple classification task, and you burn 10× what you need to. Under-provision latency on a user-facing chat interface, and you lose users.

The cost structure of an LLM call

For API-based models, cost is driven by token counts. Input tokens (your prompt) and output tokens (the model's response) are priced separately, with output tokens typically costing 3–5× more than input tokens. A 2,000-token RAG prompt with a 500-token response at GPT-4o pricing costs roughly $0.005. At 100K requests/day, that's $500/day — $15K/month.

Cost driver	Typical range	How to reduce
Input token count	High for RAG (500–5000 tokens)	Smaller chunks, better retrieval precision (fewer chunks needed)
Output token count	Moderate (100–1000 tokens)	Set max_tokens, use concise output instructions
Model tier	10–100× difference between tiers	Route simple queries to smaller models
Request volume	Linear with usage	Cache responses for identical or near-identical queries
System prompt	Repeated on every request	Use prompt caching (80–90% savings on cached prefix)

The latency structure

LLM latency has two components: Time to First Token (TTFT) — how long until the first output token arrives — and Time to Last Token (TTLT) — total generation time. For user-facing applications, TTFT determines perceived responsiveness. Streaming hides TTLT by showing tokens as they generate.

Latency driver	Typical range	How to reduce
Model size	Smaller = faster	Route to smaller models where quality permits
Input length	Longer = slower TTFT	Reduce prompt length, use caching
Output length	Longer = slower TTLT	Limit max_tokens, stream to user
Provider load	Variable	Batch less-urgent requests during off-peak
Cold start	First request in session	Keep-alive connections, pre-warm

The model routing strategy

Not all requests need the same model. A well-designed system classifies incoming requests by complexity and routes to the cheapest model that can handle it. Simple factual lookups → small fast model. Complex reasoning → frontier model. Borderline → try small model, escalate on low-confidence.

def route_request(query, context_length):
    # Route to cheaper model for simple patterns
    simple_patterns = [
        len(query.split()) < 15,           # Short query
        context_length < 500,              # Minimal context
        is_classification_task(query),     # Simple classification
        has_cached_response(query),        # Already computed
    ]

    if sum(simple_patterns) >= 2:
        return call_model("gpt-4o-mini", query)   # ~10× cheaper
    else:
        return call_model("gpt-4o", query)        # Full capability

Caching strategies

Exact response caching

Cache the full response for identical inputs. Works well for high-repetition use cases (FAQ bots, standard report templates). Use a hash of the input as cache key. TTL depends on how often your knowledge base changes.

Semantic caching

Embed incoming queries and check for near-duplicate cached responses (cosine similarity > 0.95). Hits questions semantically similar to previously answered ones. Can reduce LLM calls by 20–40% on high-volume consumer applications. Tools: GPTCache, semantic caching layer in most vector stores.

Prompt caching

Cache the KV cache for a repeated prompt prefix (system prompt + static RAG context). Anthropic's prompt caching saves 90% on cached input tokens. For a system with a 4,000-token system prompt called 1M times/day, caching saves ~$36,000/month at standard pricing.

Quantisation and self-hosted tradeoffs

For very high volume, self-hosting open-source models becomes cost-competitive. A 70B parameter Llama model quantised to INT4 runs on 2× A100 GPUs — ~$5/hour on most cloud providers. At 50 requests/minute with 2,000 token average, that's roughly $0.0014 per request vs. $0.005 for GPT-4o-mini. At scale, the 3.5× difference is significant.

Self-hosting looks cheaper per token but adds engineering overhead: serving infrastructure, scaling, model updates, compliance. Below $50K/month in API spend, self-hosting usually doesn't pencil out when you factor in engineering time.

Setting budgets and alerts

Set per-user daily token limits to prevent runaway abuse
Alert at 50%, 80%, 100% of monthly budget — don't wait for the bill
Track cost per feature: not just overall spend, but which features drive it
Budget both per-request cost (for pricing decisions) and monthly spend (for planning)
Run weekly cost reviews for the first 3 months after a new feature launches

Model cost calculator →: Estimate monthly costs across model tiers and request volumes in the Systems module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →