AI Engineering 7 min read

Prompt Caching and Token Cost Optimisation at Scale

How prompt caching works, when it pays off, and how to reduce inference cost by 60-80% on repetitive system prompt workloads.

If you're calling an LLM thousands of times per day with the same system prompt, you're paying full input price for the same tokens every single time. Prompt caching lets you pay once and reuse — and the savings are dramatic.

How prompt caching works

When you make an API call, the provider checks whether the first N tokens of your prompt match a recently cached prefix. If they do, those tokens are served from cache at a fraction of the normal cost (Anthropic: 10%, OpenAI: 50%).

Anthropic cache_read_input_tokens are priced at $0.30/1M vs $3.00/1M for normal input tokens — 90% discount. OpenAI cached input tokens are 50% off. The cache window is typically 5 minutes, extended on each hit.

What to cache

System prompts — anything that's the same across all users and requests is a perfect cache candidate
Few-shot examples — if your prompt includes 10 examples that never change, cache them
Tool definitions — function schemas sent on every call can be expensive and are perfectly static
RAG context — if users frequently ask about the same documents, cache the document content

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": open("long_system_prompt.txt").read(),
            "cache_control": {"type": "ephemeral"}  # mark for caching
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)

# Check cache usage
usage = response.usage
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")

Real-world savings

System prompt size	Daily calls	Without cache	With cache (Anthropic)	Monthly saving
500 tokens	1,000	$0.75	$0.075	$20
2,000 tokens	10,000	$30	$3	$810
8,000 tokens	50,000	$600	$60	$16,200

Cache misses are more expensive than normal calls — you pay a cache write premium (~25% extra for Anthropic) on the first call that populates the cache. Design your prompt so the cacheable prefix is truly static, or you'll write more than you read.

Model caching ROI in Systems →: Calculate your potential caching savings based on your actual call volume and prompt structure.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →