GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

LLM Cost Optimization Playbook: From $10K to $1K Monthly

A systematic approach to cutting LLM inference costs by 80–90% without degrading quality. Caching, model routing, prompt compression, and batch processing.

Most teams discover their LLM cost problem the same way: they get the first month's bill and it is 10x what they expected. By then the product is live, the architecture is set, and 'we will optimise later' has become a $50K/month line item.

The good news: 80–90% cost reduction is achievable in most workloads without touching model quality. The work is systematic, not heroic. Here is the playbook.

Step 1: The cost audit — where is the money actually going?

Before optimising anything, measure everything. Most teams guess wrong about where their costs come from. Run this audit first:

Cost categoryHow to measureTypical % of total cost
Input tokens — system promptToken count x input price x request volume30–60% (often the biggest lever)
Input tokens — user contentAvg user message tokens x volume5–15%
Input tokens — retrieved context (RAG)Avg chunks x chunk size x volume10–30%
Output tokensAvg response length x output price x volume15–40% (output costs 3–5x input)
Embedding callsDocs embedded x embedding price1–5%

The most common finding: 40–60% of costs come from the system prompt being resent on every request with no caching. A 10,000-token system prompt at $3/1M tokens, sent 100,000 times/month = $3,000/month on the system prompt alone. Prompt caching fixes this in an afternoon.

The 5-lever framework

Lever 1: Cache aggressively

If any part of your prompt is repeated across requests — system prompt, knowledge base content, few-shot examples — cache it. Cache read cost is 10% of standard input cost on Claude, 50% on OpenAI. For a 50K-token system prompt, this is a 90% reduction on that token block.

Expected savings: 40–80% of total cost for workloads with large static system prompts. Implementation time: half a day.

Lever 2: Route by complexity

Not every query needs your most capable model. A routing layer classifies each request and routes it to the cheapest model that can handle it.

Query typeExampleRouted toApprox cost/1M queries
Simple lookup / FAQ'What are your business hours?'Claude Haiku / GPT-4o mini$0.15–0.40
Standard reasoning'Summarise this doc and identify risks'Claude Sonnet / GPT-4o$2.50–3.00
Complex analysis'Analyse this financial model for DCF errors'Claude Opus / GPT-4o full$15.00
Without routing: 100% queries to large model @ $15.00/1M = $15.00 blended

With routing (80% simple, 20% complex):
  80% x $0.30  = $0.24
  20% x $15.00 = $3.00
  Blended      = $3.24/1M queries

Savings: 78% cost reduction, quality preserved for complex queries

The router itself should be fast and cheap — a small classifier or a tiny LLM with a complexity-score prompt. If your router costs more than the savings it generates, you have over-engineered it. A simple heuristic (query length + keyword matching) often captures 70% of the routing benefit.

Lever 3: Compress prompts

LLM prompts written by humans are verbose. Technical documentation, legal text, and marketing copy often contain 3–5x more tokens than necessary to convey the meaning. Techniques that work:

Expected savings: 20–40% reduction in input tokens on prompt-heavy workloads.

Lever 4: Batch offline work

Real-time APIs charge premium pricing for synchronous responses. Batch APIs (OpenAI Batch API at 50% discount, with 24-hour SLA) process requests asynchronously. Any workload that does not need a synchronous response is a candidate: document classification, bulk summarisation, nightly knowledge base updates, evaluation runs, report generation.

OpenAI Batch API: 50% discount on all inputs and outputs, results within 24 hours. If you are running nightly RAG index updates, document ingestion pipelines, or any bulk processing — there is no reason to use the synchronous API for these tasks.

Lever 5: Fine-tune for repetitive tasks

If you have a high-volume task with consistent structure — same input format, same output format, same reasoning pattern — fine-tuning a small model on that specific task is often 10–50x cheaper than using a large general model.

The economics: fine-tuning GPT-4o mini costs ~$0.008/1K tokens for training. At inference, a fine-tuned small model at $0.001/1K input vs. GPT-4o at $0.0025/1K input, the break-even on training cost is reached at roughly 1–2M inferences — typically 2–4 weeks for high-volume workloads.

Real savings examples

WorkloadBeforeAfterPrimary lever
Customer support (50K token system prompt, 100K req/day)$4,590/mo$540/moPrompt caching
Document analysis pipeline (10M docs/month)$12,000/mo$2,400/moBatch API + model routing
RAG with verbose retrieved chunks$8,000/mo$3,200/moPrompt compression + smaller model for simple queries
Agent with uncapped step loops$6,000/mo$1,800/moStep budgets + route simple sub-tasks to small model

Monitor cost per task completion, not cost per token

Cost per token incentivises shorter outputs that may be lower quality and does not account for retry costs from failures. The right metric is cost per successful task completion: tokens used, number of retries, routing overhead, and whether the output actually accomplished the goal.

A cheaper model that requires 3 retries may cost more per completion than a more expensive model that gets it right first time. Measure at the task level.

Build a cost dashboard showing: total spend by feature, cost per successful completion by feature, cost breakdown by token type (system/user/retrieved/output), and week-over-week trend. Instrument this before you optimise — you need a baseline to know whether your changes are working.

Priority order for most workloads

Model Strategy Lab →: Compare cost profiles across models and routing strategies for your specific workload.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →