GenAI Systems Lab Open interactive version →
AI Engineering 8 min read

Case Study: How Stale Documents Made a Compliance Chatbot Confidently Wrong

Two policy versions in the corpus, top_k=1, no freshness filter. The chatbot answered with 3-year-old data. How to reproduce and fix this exact failure.

This is a real story. A compliance chatbot at a mid-sized financial firm was RAGging over its internal policy documents. For nine months, it worked perfectly. Then, quietly, the compliance team updated their trading restriction policy. The document was updated in their CMS. The RAG index was not.

For three weeks, the chatbot continued answering questions about trading restrictions based on the old policy. Nobody noticed — until someone made a trade based on the chatbot's guidance that violated the new rules. The investigation cost more than the entire AI project budget for the year.

Stale documents are not an edge case. They are the inevitable result of any knowledge base that isn't actively maintained — and almost none of them are.

Why stale documents are insidious

Unlike a hallucination, stale document failures produce confident, well-sourced answers. The model isn't making things up — it's accurately describing what the document says. The document is just wrong. This makes the failure much harder to catch in evals, because your golden dataset was built when the document was correct.

A RAG system without document freshness monitoring is a time bomb. The longer it runs without maintenance, the higher the probability that at least one retrieved document contains outdated information — and the higher the stakes of that staleness becoming a user-facing answer.

Prevention: the document freshness architecture

1. Timestamp every chunk

Every chunk in your vector store should have a metadata field: `last_updated` (when the source document was last modified) and `indexed_at` (when this chunk was embedded). These are different: you want to know when the *source* was last updated, not when you processed it.

2. Change detection at ingestion

Hash the content of each source document. On each ingestion run, compare the current hash to the stored hash. Only re-embed documents that have changed. This makes your index up-to-date without a full re-index, and creates an audit trail of what changed and when.

import hashlib

def get_document_hash(content: str) -> str:
    return hashlib.sha256(content.encode()).hexdigest()

def sync_document(doc_id: str, current_content: str, vector_store, hash_store):
    current_hash = get_document_hash(current_content)
    stored_hash = hash_store.get(doc_id)

    if stored_hash == current_hash:
        return "unchanged"

    # Document changed — re-index
    vector_store.delete_by_metadata({"doc_id": doc_id})
    chunks = chunk_document(current_content)
    embeddings = embed_chunks(chunks)
    vector_store.upsert(chunks, embeddings, metadata={
        "doc_id": doc_id,
        "last_updated": datetime.utcnow().isoformat(),
        "content_hash": current_hash
    })
    hash_store.set(doc_id, current_hash)
    return "re-indexed"

3. Freshness scoring in retrieval

Weight retrieval scores by document freshness. A highly relevant chunk from a 2-year-old document should score lower than a moderately relevant chunk from last week, especially for fast-changing domains like compliance, pricing, and product documentation.

4. Cite the source date in responses

Instruct the model to include the source document's last-updated date in its response: 'According to the trading policy last updated March 2025...' This makes staleness visible to users and creates a natural feedback loop when they notice the date is old.

Detection: freshness monitoring

Build a document freshness monitor →: Implement change detection and staleness alerting in the RAG lab.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →