Prompt Caching and Token Cost Optimisation at Scale
How prompt caching works, when it pays off, and how to reduce inference cost by 60-80% on repetitive system prompt workloads.
If you're calling an LLM thousands of times per day with the same system prompt, you're paying full input price for the same tokens every single time. Prompt caching lets you pay once and reuse — and the savings are dramatic.
How prompt caching works
When you make an API call, the provider checks whether the first N tokens of your prompt match a recently cached prefix. If they do, those tokens are served from cache at a fraction of the normal cost (Anthropic: 10%, OpenAI: 50%).
Anthropic cache_read_input_tokens are priced at $0.30/1M vs $3.00/1M for normal input tokens — 90% discount. OpenAI cached input tokens are 50% off. The cache window is typically 5 minutes, extended on each hit.
What to cache
- System prompts — anything that's the same across all users and requests is a perfect cache candidate
- Few-shot examples — if your prompt includes 10 examples that never change, cache them
- Tool definitions — function schemas sent on every call can be expensive and are perfectly static
- RAG context — if users frequently ask about the same documents, cache the document content
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": open("long_system_prompt.txt").read(),
"cache_control": {"type": "ephemeral"} # mark for caching
}
],
messages=[{"role": "user", "content": user_message}]
)
# Check cache usage
usage = response.usage
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
Real-world savings
| System prompt size | Daily calls | Without cache | With cache (Anthropic) | Monthly saving |
|---|---|---|---|---|
| 500 tokens | 1,000 | $0.75 | $0.075 | $20 |
| 2,000 tokens | 10,000 | $30 | $3 | $810 |
| 8,000 tokens | 50,000 | $600 | $60 | $16,200 |
Cache misses are more expensive than normal calls — you pay a cache write premium (~25% extra for Anthropic) on the first call that populates the cache. Design your prompt so the cacheable prefix is truly static, or you'll write more than you read.
Model caching ROI in Systems →: Calculate your potential caching savings based on your actual call volume and prompt structure.
- Prompt Caching — Anthropic Documentation
- KV Cache Compression Survey (Shi et al., 2024)
- Semantic Caching for LLMs — GPTCache
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →