RAG & Retrieval 5 min read

Why RAG Fails in the Middle of Documents

Transformers attend better to context at the start and end than to tokens in the middle. When RAG retrieves multiple chunks and concatenates them, the chunks in the middle get systematically less attention — even when the right answer is there.

The RAG system has measured top-5 retrieval precision of 84%. The chunk containing the correct answer is retrieved in 91% of test cases. Yet end-to-end accuracy on multi-document queries is 61%. Someone retrieves all five chunks manually for a failing query, pastes them into the prompt in order, and asks the model directly. The answer is wrong. The relevant chunk is chunk 3 of 5. The text is right there. The model misses it.

Retrieval is not the problem. The problem is what happens after retrieval, when all the chunks are concatenated and handed to the model as a long context.

Transformer attention is position-dependent. The model computes attention weights over all tokens in the context window, but those weights are not uniformly distributed across positions. Research — most directly the Lost in the Middle paper from Stanford (Liu et al., 2023) — showed that models consistently attend more strongly to tokens at the very beginning and at the very end of the context, and significantly less to tokens in the middle. The effect is robust across model families and context lengths. It is a structural property of how transformers process long sequences, not an artifact of any particular model's training.

Context: 5 retrieved chunks, ~2000 tokens total

Position    Content                 Relative attention weight
──────────────────────────────────────────────────────────────
0–400       Chunk 1 (background)    HIGH   ████████████  0.81
400–800     Chunk 2 (related)       low    ████          0.43
800–1200    Chunk 3 ← answer here   low    ███           0.38  ← trough
1200–1600   Chunk 4 (related)       med    █████         0.52
1600–2000   Chunk 5 (background)    HIGH   ███████████   0.79

Correct answer sits at the minimum attention position.
Retrieved correctly. Effectively invisible to the model.

The naive fix — retrieve more chunks to increase recall — makes this worse. More chunks means the middle of the context is longer, and the gradient of attention from each boundary reaches fewer tokens before decaying. Retrieving 10 chunks instead of 5 does not increase the probability that the relevant chunk lands in an attended position. It increases the probability that it lands deeper in the middle, where attention is lowest.

Three interventions actually help and require no model changes. First: use reranker scores not just to filter chunks but to set their position in the context — place the highest-ranked chunk first, not somewhere in the middle. Second: reduce chunk count aggressively. Top-3 retrieval with a good reranker often beats top-10 for end-to-end accuracy, because fewer chunks means the relevant one is more likely to be near a boundary. Third: for critical queries, place the most relevant chunk at both the start and end of the context. It costs tokens but the accuracy gain is consistent.

The system that ran at 61% accuracy reached 79% with one change: retrieve top-3 instead of top-5 and place the highest-scoring chunk first in the context. No retraining. No architecture change. Only position.

Retrieving the right chunk is necessary but not sufficient — transformers attend best at context boundaries, so a correctly retrieved chunk placed in the middle of a long concatenated context can be effectively invisible to the model even though it is present in the prompt.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →