Reranking: Why Top-K Retrieval Isn't Enough
How cross-encoders and rerankers improve precision after initial retrieval, when the latency cost is worth it, and how to evaluate reranker quality.
Your vector retriever returns the top-K most semantically similar chunks. But semantic similarity is not the same as answer relevance. Reranking is the second pass that reorders those K chunks by how likely each one is to actually answer the query.
Why top-K retrieval isn't enough
Bi-encoder embeddings (the kind used in vector search) are optimised for speed — you compute one vector per query and one per document, then do cheap dot products. But they miss subtle relevance signals. A chunk about "Python performance optimisation" will score highly for "how do I make my code faster?" even if it's about a different framework than the user is asking about.
Retrieval is a recall problem: get all potentially relevant chunks. Reranking is a precision problem: from those, find the actually relevant ones. Separating these concerns and using the right tool for each is the core insight.
Cross-encoders: how rerankers work
A cross-encoder takes the query and a document together as a single input and produces a relevance score. Unlike bi-encoders, it can see the interaction between query tokens and document tokens — producing much higher precision at the cost of O(n) forward passes (one per candidate document).
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
query = "What is the return policy for damaged items?"
candidates = [chunk.text for chunk in retrieved_chunks] # top-20 from vector search
scores = reranker.predict([(query, doc) for doc in candidates])
# Sort by score, take top-5
ranked = sorted(zip(scores, retrieved_chunks), reverse=True)
top_5 = [chunk for _, chunk in ranked[:5]]
LLM-based reranking
Use an LLM to score each candidate for relevance. More expensive than cross-encoders but handles complex queries, multi-part questions, and domain-specific relevance better. Cohere Rerank and Jina Reranker provide hosted APIs.
When reranking is worth the latency cost
| Scenario | Reranking benefit | Worth it? |
|---|---|---|
| Simple factual Q&A | Low — top-1 vector result is usually right | No |
| Complex multi-part queries | High — different chunks answer different sub-questions | Yes |
| Legal / medical / finance | High — wrong context is dangerous | Always |
| High-volume consumer chat | Marginal — latency cost hurts UX | Maybe (async) |
| Enterprise search | High — precision matters, latency tolerance is higher | Yes |
Toggle reranking in RAG Lab →: See how reranking changes which chunks are selected and how answer quality changes.
- Cross-Encoder vs Bi-Encoder for Sentence Similarity — SBERT.net
- ColBERT: Efficient and Effective Passage Search (Khattab & Zaharia, 2020)
- RankGPT: Large Language Models as Rerankers (Sun et al., 2023)
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →