AI Engineering 10 min read

LLM Cost Optimization Playbook: From $10K to $1K Monthly

A systematic approach to cutting LLM inference costs by 80–90% without degrading quality. Caching, model routing, prompt compression, and batch processing.

Most teams discover their LLM cost problem the same way: they get the first month's bill and it is 10x what they expected. By then the product is live, the architecture is set, and 'we will optimise later' has become a $50K/month line item.

The good news: 80–90% cost reduction is achievable in most workloads without touching model quality. The work is systematic, not heroic. Here is the playbook.

Step 1: The cost audit — where is the money actually going?

Before optimising anything, measure everything. Most teams guess wrong about where their costs come from. Run this audit first:

Cost category	How to measure	Typical % of total cost
Input tokens — system prompt	Token count x input price x request volume	30–60% (often the biggest lever)
Input tokens — user content	Avg user message tokens x volume	5–15%
Input tokens — retrieved context (RAG)	Avg chunks x chunk size x volume	10–30%
Output tokens	Avg response length x output price x volume	15–40% (output costs 3–5x input)
Embedding calls	Docs embedded x embedding price	1–5%

The most common finding: 40–60% of costs come from the system prompt being resent on every request with no caching. A 10,000-token system prompt at $3/1M tokens, sent 100,000 times/month = $3,000/month on the system prompt alone. Prompt caching fixes this in an afternoon.

The 5-lever framework

Lever 1: Cache aggressively

If any part of your prompt is repeated across requests — system prompt, knowledge base content, few-shot examples — cache it. Cache read cost is 10% of standard input cost on Claude, 50% on OpenAI. For a 50K-token system prompt, this is a 90% reduction on that token block.

Expected savings: 40–80% of total cost for workloads with large static system prompts. Implementation time: half a day.

Lever 2: Route by complexity

Not every query needs your most capable model. A routing layer classifies each request and routes it to the cheapest model that can handle it.

Query type	Example	Routed to	Approx cost/1M queries
Simple lookup / FAQ	'What are your business hours?'	Claude Haiku / GPT-4o mini	$0.15–0.40
Standard reasoning	'Summarise this doc and identify risks'	Claude Sonnet / GPT-4o	$2.50–3.00
Complex analysis	'Analyse this financial model for DCF errors'	Claude Opus / GPT-4o full	$15.00

Without routing: 100% queries to large model @ $15.00/1M = $15.00 blended

With routing (80% simple, 20% complex):
  80% x $0.30  = $0.24
  20% x $15.00 = $3.00
  Blended      = $3.24/1M queries

Savings: 78% cost reduction, quality preserved for complex queries

The router itself should be fast and cheap — a small classifier or a tiny LLM with a complexity-score prompt. If your router costs more than the savings it generates, you have over-engineered it. A simple heuristic (query length + keyword matching) often captures 70% of the routing benefit.

Lever 3: Compress prompts

LLM prompts written by humans are verbose. Technical documentation, legal text, and marketing copy often contain 3–5x more tokens than necessary to convey the meaning. Techniques that work:

Remove filler phrases: 'Please note that', 'It is important to understand that', 'As mentioned above'
Compress few-shot examples: rewrite them to be tighter without losing the demonstrated pattern
Use structured formats: bullet points and tables over paragraphs for reference material
LLMLingua: an open-source model specifically designed to compress prompts 2-4x with less than 5% quality degradation
Semantic deduplication: if your RAG retrieves overlapping chunks, deduplicate before injecting into context

Expected savings: 20–40% reduction in input tokens on prompt-heavy workloads.

Lever 4: Batch offline work

Real-time APIs charge premium pricing for synchronous responses. Batch APIs (OpenAI Batch API at 50% discount, with 24-hour SLA) process requests asynchronously. Any workload that does not need a synchronous response is a candidate: document classification, bulk summarisation, nightly knowledge base updates, evaluation runs, report generation.

OpenAI Batch API: 50% discount on all inputs and outputs, results within 24 hours. If you are running nightly RAG index updates, document ingestion pipelines, or any bulk processing — there is no reason to use the synchronous API for these tasks.

Lever 5: Fine-tune for repetitive tasks

If you have a high-volume task with consistent structure — same input format, same output format, same reasoning pattern — fine-tuning a small model on that specific task is often 10–50x cheaper than using a large general model.

The economics: fine-tuning GPT-4o mini costs ~$0.008/1K tokens for training. At inference, a fine-tuned small model at $0.001/1K input vs. GPT-4o at $0.0025/1K input, the break-even on training cost is reached at roughly 1–2M inferences — typically 2–4 weeks for high-volume workloads.

Real savings examples

Workload	Before	After	Primary lever
Customer support (50K token system prompt, 100K req/day)	$4,590/mo	$540/mo	Prompt caching
Document analysis pipeline (10M docs/month)	$12,000/mo	$2,400/mo	Batch API + model routing
RAG with verbose retrieved chunks	$8,000/mo	$3,200/mo	Prompt compression + smaller model for simple queries
Agent with uncapped step loops	$6,000/mo	$1,800/mo	Step budgets + route simple sub-tasks to small model

Monitor cost per task completion, not cost per token

Cost per token incentivises shorter outputs that may be lower quality and does not account for retry costs from failures. The right metric is cost per successful task completion: tokens used, number of retries, routing overhead, and whether the output actually accomplished the goal.

A cheaper model that requires 3 retries may cost more per completion than a more expensive model that gets it right first time. Measure at the task level.

Build a cost dashboard showing: total spend by feature, cost per successful completion by feature, cost breakdown by token type (system/user/retrieved/output), and week-over-week trend. Instrument this before you optimise — you need a baseline to know whether your changes are working.

Priority order for most workloads

1. Audit first — measure where the money goes before changing anything
2. Add prompt caching — highest ROI, lowest risk, fastest win
3. Implement step budgets for agents — prevents runaway loops burning budget
4. Add model routing — route simple queries to small models
5. Compress prompts — rewrite verbose system prompts and few-shot examples
6. Move batch workloads to batch API — immediate 50% discount
7. Fine-tune for high-volume repetitive tasks — highest complexity, highest ROI at scale

Model Strategy Lab →: Compare cost profiles across models and routing strategies for your specific workload.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →