GenAI Systems Lab Open interactive version →
RAG & Retrieval 7 min read

Reranking: Why Top-K Retrieval Isn't Enough

How cross-encoders and rerankers improve precision after initial retrieval, when the latency cost is worth it, and how to evaluate reranker quality.

Your vector retriever returns the top-K most semantically similar chunks. But semantic similarity is not the same as answer relevance. Reranking is the second pass that reorders those K chunks by how likely each one is to actually answer the query.

Why top-K retrieval isn't enough

Bi-encoder embeddings (the kind used in vector search) are optimised for speed — you compute one vector per query and one per document, then do cheap dot products. But they miss subtle relevance signals. A chunk about "Python performance optimisation" will score highly for "how do I make my code faster?" even if it's about a different framework than the user is asking about.

Retrieval is a recall problem: get all potentially relevant chunks. Reranking is a precision problem: from those, find the actually relevant ones. Separating these concerns and using the right tool for each is the core insight.

Cross-encoders: how rerankers work

A cross-encoder takes the query and a document together as a single input and produces a relevance score. Unlike bi-encoders, it can see the interaction between query tokens and document tokens — producing much higher precision at the cost of O(n) forward passes (one per candidate document).

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

query = "What is the return policy for damaged items?"
candidates = [chunk.text for chunk in retrieved_chunks]  # top-20 from vector search

scores = reranker.predict([(query, doc) for doc in candidates])

# Sort by score, take top-5
ranked = sorted(zip(scores, retrieved_chunks), reverse=True)
top_5 = [chunk for _, chunk in ranked[:5]]

LLM-based reranking

Use an LLM to score each candidate for relevance. More expensive than cross-encoders but handles complex queries, multi-part questions, and domain-specific relevance better. Cohere Rerank and Jina Reranker provide hosted APIs.

When reranking is worth the latency cost

ScenarioReranking benefitWorth it?
Simple factual Q&ALow — top-1 vector result is usually rightNo
Complex multi-part queriesHigh — different chunks answer different sub-questionsYes
Legal / medical / financeHigh — wrong context is dangerousAlways
High-volume consumer chatMarginal — latency cost hurts UXMaybe (async)
Enterprise searchHigh — precision matters, latency tolerance is higherYes

Toggle reranking in RAG Lab →: See how reranking changes which chunks are selected and how answer quality changes.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →