AI Engineering 7 min read

Prompt Caching: Cut Your LLM Costs by 90% Without Changing Your Prompts

How prefix caching works in Anthropic Claude and OpenAI APIs, when it applies, how to structure prompts to maximise cache hits, and real cost savings math.

Prompt caching is the single highest-ROI optimisation available for most production LLM workloads. You do not change your model, you do not change your prompts, and you do not degrade quality. You just stop paying full price for tokens you have already sent.

In the right workload it cuts costs by 80–90%. In the wrong workload it does nothing. The difference is understanding exactly how it works.

What prefix caching is

Every time you send a request to an LLM API, the model processes all your input tokens from scratch — computing key-value (KV) pairs for each token in the attention layers. Prefix caching means: if the beginning of your prompt is identical to a recent previous request, the API reuses the pre-computed KV cache from that request instead of recomputing it.

The result: those cached tokens are dramatically cheaper to process. You still pay to read from cache, but at a fraction of the cost of a full compute pass.

Anthropic Claude pricing (2025): cache write = 25% of standard input token cost, cache read = 10% of standard input cost. OpenAI automatic caching: cached tokens cost 50% of standard input. The savings are real and substantial at scale.

How Claude cache_control works

With Claude, caching is explicit. You mark which parts of your prompt should be cached using the cache_control parameter. The API stores the KV cache for those blocks and reuses it on subsequent requests with the same prefix.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert analyst.

[50,000 tokens of docs here...]",
            "cache_control": {"type": "ephemeral"}  # mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": "What is the capital gains tax rate for equity held > 2 years?"}
    ]
)
# First call:  cache WRITE (25% of input cost for the cached block)
# Next calls with same prefix: cache READ (10% of input cost)

The 5-minute TTL

Claude's cache has a 5-minute TTL (time-to-live). If no request uses a cached prefix within 5 minutes, the cache entry expires and must be rewritten on the next request. This has important implications:

High-traffic endpoints benefit most — at 100+ requests/minute the cache never expires
Low-traffic endpoints may never get cache hits — requests coming in every 10 minutes always pay write price
Batch workloads should maximise prefix overlap and run within the TTL window
The TTL resets on each cache hit — sustained traffic keeps the cache alive indefinitely

OpenAI's caching is automatic and has a similar TTL mechanism, but without explicit cache_control markers. OpenAI caches prompts longer than 1,024 tokens automatically — you do not need to opt in, but you also cannot force a specific block to cache.

Cost breakdown

Token type	Claude 3.5 Sonnet (per 1M tokens)	vs. standard input
Standard input	$3.00	baseline
Cache write	$3.75	+25% (one-time per TTL window)
Cache read	$0.30	-90%
Output	$15.00	n/a

The write premium is paid once per cache entry per TTL window. After that, every read is 10x cheaper than standard input. Break-even: if the same cached block is read just 2 times in the same TTL window, you have already saved money versus not caching.

Prompt structure for maximum cache hits

Caching works on prefixes — identical leading tokens. To maximise cache hits, structure your prompts with the most static content first and the most dynamic content last:

1. Static system prompt (instructions, persona, output format) — cache this
2. Static knowledge (product docs, policy docs, knowledge base content) — cache this
3. Dynamic per-user context (user profile, session history) — not cached, comes after cached block
4. Dynamic per-request content (the actual user query) — never in cache

A common mistake: putting the user's name or any personalisation in the system prompt. 'You are helping John Smith' breaks caching for every other user. Keep the system prompt entirely generic and move personalisation into the dynamic section after the cache boundary.

Worked cost example

Customer support chatbot: 50,000-token system prompt + policy docs, 1,000 requests/day, ~200 output tokens each.

Scenario	Input cost/day	Output cost/day	Monthly
No caching	1,000 x 50K x $3/1M = $150	1,000 x 200 x $15/1M = $3	$4,590
With caching (1 write/hour + reads)	24 writes x $3.75 + 976 reads x $0.30 ≈ $15	$3	$540

That is $4,590 to $540 per month — an 88% cost reduction — on the same model, same quality, zero prompt changes. The only change is adding cache_control to the system prompt block.

When caching does not help

Short prompts (< 1,024 tokens): not enough tokens to make caching worthwhile
High-variation system prompts: if your system prompt changes per user or per request, there is no shared prefix to cache
Single-shot batch jobs: if each prompt is unique, cache entries are written but never read
Embeddings workloads: caching applies to generation calls, not embedding API calls
Very low traffic: if requests come less frequently than the TTL, you will mostly pay write prices

Try in Systems Lab →: Configure a caching strategy and see the projected cost savings for your workload.

→ Interactive: The Prompt Caching module in Systems Lab includes a savings calculator and prefix caching flow diagram.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →