Cost vs. Latency Tradeoffs in LLM Systems: How to Budget Both
TTFT, tokens-per-second, and end-to-end latency explained. How to set SLAs, model latency against user tolerance, and build a cost/latency budget.
Every production AI decision is a tradeoff between what the system costs to run and how fast it responds. Getting this wrong in either direction is expensive: over-spend on a frontier model for a simple classification task, and you burn 10× what you need to. Under-provision latency on a user-facing chat interface, and you lose users.
The cost structure of an LLM call
For API-based models, cost is driven by token counts. Input tokens (your prompt) and output tokens (the model's response) are priced separately, with output tokens typically costing 3–5× more than input tokens. A 2,000-token RAG prompt with a 500-token response at GPT-4o pricing costs roughly $0.005. At 100K requests/day, that's $500/day — $15K/month.
| Cost driver | Typical range | How to reduce |
|---|---|---|
| Input token count | High for RAG (500–5000 tokens) | Smaller chunks, better retrieval precision (fewer chunks needed) |
| Output token count | Moderate (100–1000 tokens) | Set max_tokens, use concise output instructions |
| Model tier | 10–100× difference between tiers | Route simple queries to smaller models |
| Request volume | Linear with usage | Cache responses for identical or near-identical queries |
| System prompt | Repeated on every request | Use prompt caching (80–90% savings on cached prefix) |
The latency structure
LLM latency has two components: Time to First Token (TTFT) — how long until the first output token arrives — and Time to Last Token (TTLT) — total generation time. For user-facing applications, TTFT determines perceived responsiveness. Streaming hides TTLT by showing tokens as they generate.
| Latency driver | Typical range | How to reduce |
|---|---|---|
| Model size | Smaller = faster | Route to smaller models where quality permits |
| Input length | Longer = slower TTFT | Reduce prompt length, use caching |
| Output length | Longer = slower TTLT | Limit max_tokens, stream to user |
| Provider load | Variable | Batch less-urgent requests during off-peak |
| Cold start | First request in session | Keep-alive connections, pre-warm |
The model routing strategy
Not all requests need the same model. A well-designed system classifies incoming requests by complexity and routes to the cheapest model that can handle it. Simple factual lookups → small fast model. Complex reasoning → frontier model. Borderline → try small model, escalate on low-confidence.
def route_request(query, context_length):
# Route to cheaper model for simple patterns
simple_patterns = [
len(query.split()) < 15, # Short query
context_length < 500, # Minimal context
is_classification_task(query), # Simple classification
has_cached_response(query), # Already computed
]
if sum(simple_patterns) >= 2:
return call_model("gpt-4o-mini", query) # ~10× cheaper
else:
return call_model("gpt-4o", query) # Full capability
Caching strategies
Exact response caching
Cache the full response for identical inputs. Works well for high-repetition use cases (FAQ bots, standard report templates). Use a hash of the input as cache key. TTL depends on how often your knowledge base changes.
Semantic caching
Embed incoming queries and check for near-duplicate cached responses (cosine similarity > 0.95). Hits questions semantically similar to previously answered ones. Can reduce LLM calls by 20–40% on high-volume consumer applications. Tools: GPTCache, semantic caching layer in most vector stores.
Prompt caching
Cache the KV cache for a repeated prompt prefix (system prompt + static RAG context). Anthropic's prompt caching saves 90% on cached input tokens. For a system with a 4,000-token system prompt called 1M times/day, caching saves ~$36,000/month at standard pricing.
Quantisation and self-hosted tradeoffs
For very high volume, self-hosting open-source models becomes cost-competitive. A 70B parameter Llama model quantised to INT4 runs on 2× A100 GPUs — ~$5/hour on most cloud providers. At 50 requests/minute with 2,000 token average, that's roughly $0.0014 per request vs. $0.005 for GPT-4o-mini. At scale, the 3.5× difference is significant.
Self-hosting looks cheaper per token but adds engineering overhead: serving infrastructure, scaling, model updates, compliance. Below $50K/month in API spend, self-hosting usually doesn't pencil out when you factor in engineering time.
Setting budgets and alerts
- Set per-user daily token limits to prevent runaway abuse
- Alert at 50%, 80%, 100% of monthly budget — don't wait for the bill
- Track cost per feature: not just overall spend, but which features drive it
- Budget both per-request cost (for pricing decisions) and monthly spend (for planning)
- Run weekly cost reviews for the first 3 months after a new feature launches
Model cost calculator →: Estimate monthly costs across model tiers and request volumes in the Systems module.
- LLM Inference Performance Engineering — MosaicML
- Scaling Laws for Neural Language Models (Kaplan et al., 2020)
- Model Routing with RouteLLM — LMSYS (2024)
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →