Context Bleed: When One User's Data Poisons Another's Response
The subtle session isolation failure that affects multi-tenant LLM apps. How system prompt fragments, conversation history, and cached context leak across users — and what it costs you.
In February 2024, a large enterprise SaaS company discovered that some users were seeing fragments of other users' system prompts in their responses. The product used a multi-tenant architecture where each customer had a customized system prompt. A caching bug caused one customer's prompt to leak into another's conversation context. Nobody noticed for three weeks because the leaked fragments were syntactically valid — they just happened to mention a competitor's product name.
Context bleed is the class of failures where information that should be scoped to one session, user, or tenant appears in another's context. It's different from a data breach — there's no unauthorized access, just unintended mixing. But the consequences can be just as serious.
Context bleed failures are especially dangerous because they often pass automated testing. Your test suite uses isolated test users, so it never catches cross-user contamination. The only signals are in production logs — which you may not be watching.
The three mechanisms of context bleed
Context bleed happens through three distinct mechanisms, each requiring a different fix.
1. Prompt caching without proper invalidation
Many production LLM systems cache compiled system prompts to avoid reprocessing them on every request. If that cache uses a key that doesn't include the tenant/user identifier — or if the key is computed incorrectly — two different users will get the same cached prompt.
The insidious part: caching libraries often use content hashes as keys. If two users happen to have the same base system prompt with different variable substitutions, and the variable substitution step runs after the cache lookup, the second user gets the first user's fully-substituted prompt.
2. Conversation history in shared message buffers
Streaming LLM APIs write tokens to a buffer. If that buffer isn't fully flushed and reset between requests — or if a worker process handles a new request before a previous one has fully cleaned up — you can get history fragments from a previous conversation appearing in a new one. This is more common with serverless workers that are reused across requests than with fresh-container deployments.
3. RAG retrieval across tenant boundaries
If your vector store doesn't enforce tenant isolation at query time, a metadata filter bug can make a query retrieve documents from a different tenant's namespace. The model then happily incorporates that information into its response. This is particularly common when metadata filters are optional — a bug where the filter is accidentally omitted will silently expose cross-tenant data.
Detection is hard — here's what works
Standard functional testing won't catch context bleed. What does:
- Canary users: maintain a set of synthetic 'sentinel' users with unique, easily-detectable strings in their system prompts (e.g. a UUID). Log any response containing a sentinel string issued to a non-sentinel user.
- Cross-tenant retrieval audit: periodically run queries as tenant A and check if any retrieved documents have tenant B metadata. Alert if count > 0.
- Response PII scanning: run all LLM responses through a PII detector before returning them to the user. If someone else's email address appears in your response, that's a signal.
- Buffer fence tests: in integration tests, run two back-to-back requests with different system prompts and assert that response 2 contains no content unique to prompt 1.
The fix hierarchy
Fix context bleed in this order, from most to least impactful:
- Tenant isolation at the vector store layer: every document write should include a tenant_id field; every query should include a mandatory metadata filter on that field. Make the filter non-optional in your client code.
- Session-scoped prompt compilation: never cache a compiled system prompt longer than a single session. The savings from caching are usually less than 5% of total latency; the risk isn't worth it.
- Worker process isolation: use fresh processes or at minimum explicit buffer flushes between requests in any serverless or worker-pool setup.
- Audit log with lineage: every LLM call should log which documents were retrieved, which system prompt was used, and which user ID triggered it. This makes post-incident forensics tractable.
The rule of thumb: treat every piece of context you inject as potentially visible to any user of your system. Design your isolation architecture against that threat model.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →