LLM Cost Optimization Playbook: From $10K to $1K Monthly
A systematic approach to cutting LLM inference costs by 80–90% without degrading quality. Caching, model routing, prompt compression, and batch processing.
Most teams discover their LLM cost problem the same way: they get the first month's bill and it is 10x what they expected. By then the product is live, the architecture is set, and 'we will optimise later' has become a $50K/month line item.
The good news: 80–90% cost reduction is achievable in most workloads without touching model quality. The work is systematic, not heroic. Here is the playbook.
Step 1: The cost audit — where is the money actually going?
Before optimising anything, measure everything. Most teams guess wrong about where their costs come from. Run this audit first:
| Cost category | How to measure | Typical % of total cost |
|---|---|---|
| Input tokens — system prompt | Token count x input price x request volume | 30–60% (often the biggest lever) |
| Input tokens — user content | Avg user message tokens x volume | 5–15% |
| Input tokens — retrieved context (RAG) | Avg chunks x chunk size x volume | 10–30% |
| Output tokens | Avg response length x output price x volume | 15–40% (output costs 3–5x input) |
| Embedding calls | Docs embedded x embedding price | 1–5% |
The most common finding: 40–60% of costs come from the system prompt being resent on every request with no caching. A 10,000-token system prompt at $3/1M tokens, sent 100,000 times/month = $3,000/month on the system prompt alone. Prompt caching fixes this in an afternoon.
The 5-lever framework
Lever 1: Cache aggressively
If any part of your prompt is repeated across requests — system prompt, knowledge base content, few-shot examples — cache it. Cache read cost is 10% of standard input cost on Claude, 50% on OpenAI. For a 50K-token system prompt, this is a 90% reduction on that token block.
Expected savings: 40–80% of total cost for workloads with large static system prompts. Implementation time: half a day.
Lever 2: Route by complexity
Not every query needs your most capable model. A routing layer classifies each request and routes it to the cheapest model that can handle it.
| Query type | Example | Routed to | Approx cost/1M queries |
|---|---|---|---|
| Simple lookup / FAQ | 'What are your business hours?' | Claude Haiku / GPT-4o mini | $0.15–0.40 |
| Standard reasoning | 'Summarise this doc and identify risks' | Claude Sonnet / GPT-4o | $2.50–3.00 |
| Complex analysis | 'Analyse this financial model for DCF errors' | Claude Opus / GPT-4o full | $15.00 |
Without routing: 100% queries to large model @ $15.00/1M = $15.00 blended
With routing (80% simple, 20% complex):
80% x $0.30 = $0.24
20% x $15.00 = $3.00
Blended = $3.24/1M queries
Savings: 78% cost reduction, quality preserved for complex queries
The router itself should be fast and cheap — a small classifier or a tiny LLM with a complexity-score prompt. If your router costs more than the savings it generates, you have over-engineered it. A simple heuristic (query length + keyword matching) often captures 70% of the routing benefit.
Lever 3: Compress prompts
LLM prompts written by humans are verbose. Technical documentation, legal text, and marketing copy often contain 3–5x more tokens than necessary to convey the meaning. Techniques that work:
- Remove filler phrases: 'Please note that', 'It is important to understand that', 'As mentioned above'
- Compress few-shot examples: rewrite them to be tighter without losing the demonstrated pattern
- Use structured formats: bullet points and tables over paragraphs for reference material
- LLMLingua: an open-source model specifically designed to compress prompts 2-4x with less than 5% quality degradation
- Semantic deduplication: if your RAG retrieves overlapping chunks, deduplicate before injecting into context
Expected savings: 20–40% reduction in input tokens on prompt-heavy workloads.
Lever 4: Batch offline work
Real-time APIs charge premium pricing for synchronous responses. Batch APIs (OpenAI Batch API at 50% discount, with 24-hour SLA) process requests asynchronously. Any workload that does not need a synchronous response is a candidate: document classification, bulk summarisation, nightly knowledge base updates, evaluation runs, report generation.
OpenAI Batch API: 50% discount on all inputs and outputs, results within 24 hours. If you are running nightly RAG index updates, document ingestion pipelines, or any bulk processing — there is no reason to use the synchronous API for these tasks.
Lever 5: Fine-tune for repetitive tasks
If you have a high-volume task with consistent structure — same input format, same output format, same reasoning pattern — fine-tuning a small model on that specific task is often 10–50x cheaper than using a large general model.
The economics: fine-tuning GPT-4o mini costs ~$0.008/1K tokens for training. At inference, a fine-tuned small model at $0.001/1K input vs. GPT-4o at $0.0025/1K input, the break-even on training cost is reached at roughly 1–2M inferences — typically 2–4 weeks for high-volume workloads.
Real savings examples
| Workload | Before | After | Primary lever |
|---|---|---|---|
| Customer support (50K token system prompt, 100K req/day) | $4,590/mo | $540/mo | Prompt caching |
| Document analysis pipeline (10M docs/month) | $12,000/mo | $2,400/mo | Batch API + model routing |
| RAG with verbose retrieved chunks | $8,000/mo | $3,200/mo | Prompt compression + smaller model for simple queries |
| Agent with uncapped step loops | $6,000/mo | $1,800/mo | Step budgets + route simple sub-tasks to small model |
Monitor cost per task completion, not cost per token
Cost per token incentivises shorter outputs that may be lower quality and does not account for retry costs from failures. The right metric is cost per successful task completion: tokens used, number of retries, routing overhead, and whether the output actually accomplished the goal.
A cheaper model that requires 3 retries may cost more per completion than a more expensive model that gets it right first time. Measure at the task level.
Build a cost dashboard showing: total spend by feature, cost per successful completion by feature, cost breakdown by token type (system/user/retrieved/output), and week-over-week trend. Instrument this before you optimise — you need a baseline to know whether your changes are working.
Priority order for most workloads
- 1. Audit first — measure where the money goes before changing anything
- 2. Add prompt caching — highest ROI, lowest risk, fastest win
- 3. Implement step budgets for agents — prevents runaway loops burning budget
- 4. Add model routing — route simple queries to small models
- 5. Compress prompts — rewrite verbose system prompts and few-shot examples
- 6. Move batch workloads to batch API — immediate 50% discount
- 7. Fine-tune for high-volume repetitive tasks — highest complexity, highest ROI at scale
Model Strategy Lab →: Compare cost profiles across models and routing strategies for your specific workload.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →