GenAI Systems Lab Open interactive version →
AI Engineering 7 min read

Prompt Caching and Token Cost Optimisation at Scale

How prompt caching works, when it pays off, and how to reduce inference cost by 60-80% on repetitive system prompt workloads.

If you're calling an LLM thousands of times per day with the same system prompt, you're paying full input price for the same tokens every single time. Prompt caching lets you pay once and reuse — and the savings are dramatic.

How prompt caching works

When you make an API call, the provider checks whether the first N tokens of your prompt match a recently cached prefix. If they do, those tokens are served from cache at a fraction of the normal cost (Anthropic: 10%, OpenAI: 50%).

Anthropic cache_read_input_tokens are priced at $0.30/1M vs $3.00/1M for normal input tokens — 90% discount. OpenAI cached input tokens are 50% off. The cache window is typically 5 minutes, extended on each hit.

What to cache

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": open("long_system_prompt.txt").read(),
            "cache_control": {"type": "ephemeral"}  # mark for caching
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)

# Check cache usage
usage = response.usage
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")

Real-world savings

System prompt sizeDaily callsWithout cacheWith cache (Anthropic)Monthly saving
500 tokens1,000$0.75$0.075$20
2,000 tokens10,000$30$3$810
8,000 tokens50,000$600$60$16,200

Cache misses are more expensive than normal calls — you pay a cache write premium (~25% extra for Anthropic) on the first call that populates the cache. Design your prompt so the cacheable prefix is truly static, or you'll write more than you read.

Model caching ROI in Systems →: Calculate your potential caching savings based on your actual call volume and prompt structure.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →