AI Engineering 8 min read

Why Your RAG System Lies

Faithfulness failures, hallucination in retrieval-augmented contexts, and the five production mitigations that actually work. Why high retrieval recall doesn't prevent confident wrong answers.

RAG is supposed to solve hallucination. You ground the model in retrieved facts — it should only say what the documents say. In practice, RAG systems confidently produce wrong answers even when the correct answer is sitting in the retrieved context. This post is about why that happens and what you can actually do about it.

RAG reduces hallucination from parametric memory but introduces a new failure mode: faithfulness failures. The model generates claims that aren\'t supported by — or directly contradict — the retrieved context. High retrieval recall does not prevent this.

The five ways RAG lies

1. Context ignored under pressure

When the retrieved context conflicts with the model\'s parametric knowledge (what it learned during pretraining), the model sometimes ignores the context and generates from memory. This is worse for high-confidence parametric facts — the model \'knows\' something strongly and the retrieval doesn\'t override it.

Example: your HR policy doc says notice period is 30 days. The model was trained on data where standard notice is 2 weeks. Query: \'How much notice do I need to give?\' Answer: \'Two weeks.\' The correct document was retrieved. The model didn\'t use it.

2. Aggregation hallucination

The model retrieves 5 chunks from 5 different documents. The answer requires synthesising across all five. Instead, it takes signals from each and produces a plausible-sounding synthesis that doesn\'t accurately reflect any of them. The individual chunks are all true. The synthesised answer is fabricated.

This is particularly common for questions like \'What are the main themes across all feedback?\' or \'Summarise what our policy says about X.\' The model aggregates rather than cites.

3. Over-extrapolation

The retrieved chunk contains a true statement. The model extends it beyond what the document says. Example: document says \'Product A supports API version 2.\' Query: \'Does Product A support API version 3?\' Answer: \'Yes, Product A supports API versions 2 and 3.\' The extension is plausible but fabricated.

4. Stale context with confident generation

The retrieved document is outdated. The model generates confidently from the stale content without flagging uncertainty. RAG doesn\'t know your document is three years old unless you tell it. A pricing document from 2021 will produce confident wrong prices in 2024.

5. Lost-in-the-middle faithfulness failure

The correct evidence is in the retrieved context but positioned in the middle of a long context window. The model anchors on the beginning and end of context (lost-in-the-middle effect) and generates from those anchors, ignoring the correct middle passage.

Why recall doesn\'t fix this

A common mistake: \'our retrieval recall is 94%, so faithfulness should be high.\' Recall measures whether the correct document is present in the top-k. Faithfulness measures whether the model\'s answer is grounded in the retrieved context. These are independent. You can have 94% recall and 60% faithfulness. The document is there — the model just didn\'t use it correctly.

Eval trap: if you only measure retrieval metrics (recall, precision, MRR), you are not measuring faithfulness. A system with perfect retrieval can still lie 40% of the time. You need a separate faithfulness eval.

Five mitigations that actually work

Mitigation	What it does	When to use
RAGAS Faithfulness eval	LLM checks each claim in the answer against retrieved context. Claims_supported / total_claims.	Production monitoring — run on 5–10% of live traffic
Citation requirement in prompt	\'Every claim must be followed by the source chunk number.\' Forces the model to attribute rather than synthesise freely.	Most RAG systems — low cost, high impact
Answer abstention policy	\'If the context does not contain sufficient information, respond: I don\'t have enough information.\' Reduces confident wrong answers.	High-stakes domains — legal, medical, financial
Freshness metadata filtering	Filter retrieved chunks by document date before generation. Exclude documents older than X days for time-sensitive queries.	Any corpus with version-controlled or time-sensitive documents
Chunk position reranking	Reorder retrieved chunks so highest-similarity chunks appear first in context window. Reduces lost-in-the-middle faithfulness failures.	When using large context windows with many retrieved chunks

Evaluating faithfulness in production

The cleanest faithfulness eval uses an LLM judge (GPT-4o or Claude Sonnet) to check each claim in the generated answer against the retrieved context. The RAGAS framework implements this as: decompose the answer into atomic claims, check each claim against context, report faithfulness = supported_claims / total_claims.

Run this on a random sample of production traffic (5–10%) and set an alert threshold. A faithfulness score below 0.85 in a business-critical RAG system means users are receiving wrong answers roughly 1 in 7 queries. That number gets management\'s attention faster than any benchmark.

The leading indicator to track: faithfulness score over time, correlated with document corpus changes. When someone adds or removes documents from the knowledge base, faithfulness often drops. The eval catches it before users do.

Quick audit: take 20 questions you know the answers to, run them through your RAG system, manually check each answer against the retrieved chunks (not against ground truth — against what was actually retrieved). Count how many answers make claims not in the context. If it\'s more than 3/20, you have a faithfulness problem.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →