GenAI Systems Lab Open interactive version →
RAG & Retrieval 9 min read

Two-Stage Retrieval: Why a Reranker Exists

Bi-encoders maximize recall by independently embedding query and documents — fast enough for millions of docs, but similarity is not relevance. Cross-encoders maximize precision by scoring query-document pairs with full attention. Two-stage retrieval combines both: bi-encoder handles recall (top-N candidates), cross-encoder handles precision (final top-K). Each stage has distinct failure modes, different latency profiles, and a specific role the other cannot fill.

**Prerequisite: Step 10 (Bi-Encoder vs Cross-Encoder).** After this post you'll understand how to compose a production retrieval stack: recall at scale with a bi-encoder, then precision with a cross-encoder reranker — and where each stage fails.

Vector search returns documents that are similar to your query — and similarity is not relevance. A bi-encoder embeds your query and every document independently, then ranks by cosine similarity. It is fast enough to search millions of documents in milliseconds. But because query and document never see each other during encoding, the model cannot detect relevance signals that only emerge when you read them together. A cross-encoder fixes this — but introduces a latency cost that rules it out as a first-stage retriever.

Vector search maximizes recall. A reranker maximizes precision. They solve different problems and fail in different ways. Two-stage retrieval is what happens when you need both.

Stage 1 — Bi-Encoder: Recall at Speed

A bi-encoder encodes the query and each document independently into a single dense vector. Relevance is approximated by cosine similarity between these vectors. This is fast — vectors are precomputed at index time, so retrieval is a nearest-neighbor lookup. The tradeoff: because query and document are encoded separately, the model cannot attend to the interaction between them. It sees 'Is this document similar in embedding space?' not 'Does this document answer this specific question?'

Stage 2 — Cross-Encoder: Precision at Cost

A cross-encoder takes a query-document pair as a single concatenated input and produces a relevance score using full attention across both sequences. Because the model reads query and document together, it detects relevance signals invisible to a bi-encoder — nuance, negation, partial match, the difference between a document that mentions a topic and one that actually answers the question. The cost: you cannot precompute anything. Every query-document pair is scored at inference time, which is why cross-encoders are only viable for small candidate sets, not full-corpus retrieval.

DimensionBi-Encoder (Stage 1)Cross-Encoder (Stage 2)
InputQuery and docs encoded separatelyQuery + doc concatenated together
SpeedMilliseconds — precomputed doc vectors50–200ms per pair at inference time
Scales toMillions of documents100–1,000 candidates max
What it optimizesRecall — finds similar documentsPrecision — ranks by true relevance
Fails whenQuery/doc vocabulary gapRelevant doc not in candidate set
Typical modelssentence-transformers, E5, BGEcross-encoder/ms-marco-MiniLM, Cohere Rerank

The Two-Stage Pipeline

Two-stage retrieval combines both in sequence. Stage 1 (bi-encoder) retrieves recall_k candidates — typically 50 to 200 — using fast vector search. Stage 2 (cross-encoder) reranks those candidates and returns the final top-K. The bi-encoder handles scale; the cross-encoder handles quality. The most common production mistake is treating them as interchangeable, or assuming that improving Stage 1 fixes Stage 2 failures.

The critical failure mode: setting recall_k too small. If the relevant document is at position 120 in the bi-encoder output and recall_k is 100, the cross-encoder never sees it. No amount of reranker quality fixes this. Recall is a Stage 1 problem.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →