AI Engineering 8 min read

Reranker Inversion: When Your Reranker Makes Retrieval Worse

The counterintuitive failure where adding a reranker reduces end-to-end RAG quality. Why cross-encoder rerankers underperform on domain-specific queries, and how to measure before you deploy.

We added a cross-encoder reranker to our RAG pipeline expecting a 15% improvement in retrieval quality. End-to-end RAG quality, measured by our eval suite, dropped by 8%. The reranker was making retrieval worse. We had just spent two weeks integrating it.

Reranker inversion — where adding a reranker reduces end-to-end quality — is more common than the RAG literature acknowledges. Most published results for rerankers come from information retrieval benchmarks like MS MARCO, not from domain-specific RAG applications. The benchmark results don't transfer reliably.

Why rerankers fail in domain-specific RAG

Distribution mismatch

Cross-encoder rerankers (like Cohere Rerank, bge-reranker, or cross-encoder/ms-marco-MiniLM) are typically trained on general web search data. They learn to prefer documents that match query keywords and topic coherence in general English prose. In a specialized domain (legal, medical, financial, technical documentation), the vocabulary and relevance signals are different. The reranker penalizes technically correct documents for not matching its learned notion of relevance.

Re-ranking a small candidate set

Rerankers add value when they're re-ordering a large candidate set (top-100 or top-50) down to a smaller set (top-5). If your initial retrieval is already returning only top-5 results and you're asking a reranker to re-order those 5, you're adding latency and potential for degradation without the benefit of a large candidate pool.

Rerankers don't improve recall — they improve precision

If your underlying retrieval is missing the relevant document entirely (it's not in the top-50), a reranker won't help. Rerankers can only improve precision within the candidate set. A system with low recall needs better embeddings or hybrid search, not a reranker.

Measuring before you deploy

Before adding a reranker to your production pipeline, measure its effect specifically on your data:

Build a retrieval eval set with ground-truth relevant documents for ~100-200 representative queries
Measure retrieval metrics (NDCG@5, MRR@10, Recall@10) with and without the reranker
Measure end-to-end RAG quality (faithfulness, answer relevance, factual accuracy) with and without the reranker
If retrieval metrics improve but end-to-end quality doesn't — or if end-to-end quality drops — the reranker is hurting you

The most common finding: rerankers improve NDCG (ordering quality) but hurt Recall@5 by pushing relevant documents from rank 4-5 to rank 6-7. Your LLM only sees rank 1-5, so the relevant context disappears from its view.

When rerankers do help

Rerankers reliably improve quality when: (1) your initial retrieval returns a large candidate set (top-50+), (2) the reranker is domain-adapted or fine-tuned on your data, (3) you have heterogeneous document types where raw embedding similarity is unreliable, (4) your queries are verbose and the reranker can use cross-attention to find fine-grained relevance signals the embedding missed.

The rule: never add a reranker to production without measuring its effect on your specific data and eval suite. Benchmark numbers from MS MARCO don't predict what will happen in your deployment. Measure first, ship second.

Configure reranking in RAG Lab →: Toggle reranking on and off in the RAG Lab to measure its effect on retrieval quality.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →