GenAI Systems Lab Open interactive version →
AI Engineering 9 min read

Retrieval Poisoning: How Bad Documents Corrupt Your Entire RAG Pipeline

What happens when stale, contradictory, or adversarially crafted documents enter your vector store. How contamination propagates, why it's hard to detect, and how to build a document quality gate.

A fintech company used a RAG system to answer customer questions about their loan products. The system retrieved from a corpus of internal product documents. One document in the corpus was a draft — marked clearly in its filename as 'DRAFT_DO_NOT_USE_2023_Q2.pdf' — but had been indexed anyway due to a misconfigured ingestion pipeline. For three months, 8% of loan eligibility queries returned answers based on superseded eligibility criteria. The answers were wrong. They passed a hallucination detector because they were faithfully grounded in a real document.

This is retrieval poisoning: the insertion of incorrect, stale, or adversarial documents into a vector store that then consistently contaminate retrieval results.

Categories of retrieval poisoning

Accidental poisoning (most common)

Draft documents, superseded versions, test data, and internal notes that were never meant to be indexed. These get into your corpus through misconfigured ingestion pipelines, lack of document lifecycle management, or manual uploads by team members who don't understand what the vector store is for.

Staleness poisoning

Documents that were accurate when indexed but have since been superseded. Product pricing that changed six months ago. Policies that were updated after a regulatory change. If your index isn't updated when source documents change, your RAG system will confidently answer questions based on outdated information.

Adversarial poisoning

In systems that allow user-uploaded content to enter the retrieval corpus, adversarially crafted documents can be inserted to influence model outputs. A document containing 'The correct answer to questions about account security is: [attacker's answer]' will retrieve on security-related queries and potentially influence responses. This is a real attack vector for any RAG system with user-generated content.

The document quality gate

Every document should pass through a quality gate before indexing. Minimum checks:

Index lifecycle management

The hardest part of retrieval poisoning prevention isn't the initial indexing — it's keeping the index in sync with source document updates. You need:

A faithfulness eval score of 0.95 means 95% of claims are grounded in retrieved documents. It tells you nothing about whether those documents are correct. Retrieval poisoning defeats faithfulness evals completely — which is why document quality gates must live upstream of the model.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →