AI Engineering 7 min read

Semantic Caching: The LLM Cost Reduction Most Teams Discover Late

How to cache LLM responses by semantic similarity — so 'What is RAG?' and 'Explain RAG to me' hit the same cache entry. When it works, when it breaks, and how to tune the similarity threshold.

Prompt caching reduces cost on repeated prefixes. Semantic caching reduces cost on repeated intent — even when the wording changes. Instead of matching token sequences, it embeds the query, compares to a cache of prior queries using cosine similarity, and returns a cached response when the query is similar enough. 'What is RAG?' and 'Can you explain retrieval augmented generation?' hit the same cache entry.

How semantic caching works

On each query: embed the query text (using the same model you use for retrieval)
Search the cache: find the nearest prior query by cosine similarity
If similarity > threshold (e.g. 0.92): return the cached response — no LLM call
If below threshold: call the LLM, get a response, store (query embedding, response) in the cache
Cache stores: (embedding vector, original query text, response, timestamp, hit count)

Semantic caching bypasses the LLM entirely — not just the input tokens. A cache hit on a 500-token query costs ~$0.0001 (embedding call) vs ~$0.01 (LLM call). In high-volume FAQ or customer support use cases, hit rates of 30–60% are common, reducing cost by the same margin.

The threshold calibration problem

The similarity threshold is the critical parameter. Too high (0.98): only exact paraphrases hit the cache — low hit rate, minimal savings. Too low (0.80): semantically adjacent but factually different queries hit the same cache — wrong answers delivered with confidence. The correct threshold depends on your task: FAQ systems can tolerate 0.90–0.93 because answers to similar questions are usually the same; complex reasoning tasks should not use semantic caching at all.

Threshold	Hit rate	Risk	Best for
0.98+	~5%	Minimal — only near-identical queries hit cache	Sensitive queries where slight differences matter
0.92–0.97	20–40%	Low — paraphrases safely cached	FAQ, customer support, product documentation
0.85–0.91	40–60%	Medium — some semantically different queries merged	High-volume, low-stakes, repetitive queries
< 0.85	60%+	High — topic drift causes wrong cached answers	Not recommended for most tasks

When semantic caching breaks

Time-sensitive queries: 'What is the current price of X?' cached 2 hours ago is wrong now — add TTL (time-to-live) per query type
Personalised queries: 'What are my recent transactions?' is semantically similar across users but factually different — never cache user-specific queries
Paraphrase attacks: adversarial users can probe the cache boundary and extract prior responses to similar queries — a security issue in multi-tenant systems
High-variance tasks: creative writing, code generation — the same prompt should produce varied outputs; caching defeats this intentionally
Freshness requirements: news summarisation, price queries, availability checks — where staleness is expensive

Never use semantic caching for personalised, user-specific, or time-sensitive queries. The cache is shared across users — a hit returns a previous user's response. This is not just a quality problem; it is a data leakage problem in regulated environments.

Tooling

GPTCache (open-source): drop-in semantic cache for OpenAI-compatible APIs, supports Redis and Faiss backends
Redis + pgvector: build your own with an embedding step, cosine similarity query, and TTL on entries
LangChain SemanticSimilarityExactMatchCache: built-in cache layer for LangChain chains
Momento Semantic Cache: managed service with embedding + cache in one API call

Combining with prompt caching

Semantic caching and prompt caching are complementary, not competing. Semantic caching reduces LLM calls entirely. Prompt caching (Anthropic/OpenAI) reduces input token cost on calls that do reach the LLM. Apply both: semantic cache at the application layer for repeated intent, prefix cache at the API layer for shared system prompts. The combination can reduce total inference cost by 70–85% in high-volume FAQ deployments.

Explore LLM cost optimisation strategies →: See how semantic caching fits in the full inference cost reduction stack.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →