GenAI Systems Lab Open interactive version →
AI Engineering 7 min read

Semantic Caching: The LLM Cost Reduction Most Teams Discover Late

How to cache LLM responses by semantic similarity — so 'What is RAG?' and 'Explain RAG to me' hit the same cache entry. When it works, when it breaks, and how to tune the similarity threshold.

Prompt caching reduces cost on repeated prefixes. Semantic caching reduces cost on repeated intent — even when the wording changes. Instead of matching token sequences, it embeds the query, compares to a cache of prior queries using cosine similarity, and returns a cached response when the query is similar enough. 'What is RAG?' and 'Can you explain retrieval augmented generation?' hit the same cache entry.

How semantic caching works

Semantic caching bypasses the LLM entirely — not just the input tokens. A cache hit on a 500-token query costs ~$0.0001 (embedding call) vs ~$0.01 (LLM call). In high-volume FAQ or customer support use cases, hit rates of 30–60% are common, reducing cost by the same margin.

The threshold calibration problem

The similarity threshold is the critical parameter. Too high (0.98): only exact paraphrases hit the cache — low hit rate, minimal savings. Too low (0.80): semantically adjacent but factually different queries hit the same cache — wrong answers delivered with confidence. The correct threshold depends on your task: FAQ systems can tolerate 0.90–0.93 because answers to similar questions are usually the same; complex reasoning tasks should not use semantic caching at all.

ThresholdHit rateRiskBest for
0.98+~5%Minimal — only near-identical queries hit cacheSensitive queries where slight differences matter
0.92–0.9720–40%Low — paraphrases safely cachedFAQ, customer support, product documentation
0.85–0.9140–60%Medium — some semantically different queries mergedHigh-volume, low-stakes, repetitive queries
< 0.8560%+High — topic drift causes wrong cached answersNot recommended for most tasks

When semantic caching breaks

Never use semantic caching for personalised, user-specific, or time-sensitive queries. The cache is shared across users — a hit returns a previous user's response. This is not just a quality problem; it is a data leakage problem in regulated environments.

Tooling

Combining with prompt caching

Semantic caching and prompt caching are complementary, not competing. Semantic caching reduces LLM calls entirely. Prompt caching (Anthropic/OpenAI) reduces input token cost on calls that do reach the LLM. Apply both: semantic cache at the application layer for repeated intent, prefix cache at the API layer for shared system prompts. The combination can reduce total inference cost by 70–85% in high-volume FAQ deployments.

Explore LLM cost optimisation strategies →: See how semantic caching fits in the full inference cost reduction stack.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →