Bi-Encoder vs Cross-Encoder: The Retrieval Architecture Decision That Determines Latency
Bi-encoders pre-compute document vectors (fast, scalable). Cross-encoders score pairs jointly (accurate, slow). Why production retrieval is always two-stage: bi-encoder recall + cross-encoder rerank.
**Prerequisite: Steps 8–9 (RAG + Vector DBs).** After this post you'll understand why fast retrieval and accurate ranking are fundamentally in tension, and why almost every production retrieval system uses both a bi-encoder and a cross-encoder.
The Retrieval Architecture Decision That Determines Latency
You have a query and 10 million documents. You need the most relevant ones, fast. The architecture you choose — bi-encoder, cross-encoder, or a combination — determines whether your system responds in 50ms or 5 seconds. This is the most important retrieval design decision, and the wrong choice at scale is not fixable with hardware.
Bi-Encoder Architecture
A bi-encoder uses two independent encoders (typically the same pretrained model) — one for queries, one for documents. Each input is encoded into a single dense vector independently. Similarity is then computed between these vectors, most commonly using cosine similarity or dot product.
# Bi-encoder inference
query_vec = encoder(query) # [768]
doc_vecs = encoder(documents) # [N, 768] — precomputed at index time
scores = cosine_sim(query_vec, doc_vecs) # [N]
# Key: doc_vecs are computed ONCE and stored in a vector index
# Query encoding: ~10ms. ANN search over 10M docs: ~50ms.
# Total: ~60ms regardless of corpus size.
The critical property: document embeddings can be computed offline and indexed. At query time, only the query is encoded (fast), then ANN search finds nearest neighbors (fast). This scales to hundreds of millions of documents.
Cross-Encoder Architecture
A cross-encoder takes the query and a document concatenated as a single input, processes them jointly through the full transformer, and outputs a relevance score. The key difference: query and document attention interact at every layer, allowing the model to capture fine-grained interactions between them.
# Cross-encoder inference
for doc in candidates:
input = tokenize('[CLS] ' + query + ' [SEP] ' + doc + ' [SEP]')
score = cross_encoder(input) # single forward pass per pair
# 10M documents × 20ms each = 200,000 seconds. Not viable.
# Even 1,000 candidates × 20ms = 20 seconds. Too slow.
Cross-encoders cannot scale to large corpora because they require a separate forward pass for every (query, document) pair. You cannot pre-compute the document representations because the document representation depends on the query. The computation is fundamentally online.
Why Cross-Encoders Are More Accurate
Cross-encoders outperform bi-encoders on relevance because joint attention allows the model to capture semantic interactions that a simple vector similarity score misses. Consider a query 'bank robbery' and a document about 'financial institutions near rivers.' A bi-encoder may score this highly (shared token representations). A cross-encoder can distinguish the financial vs. geographical meaning in context.
On MSMARCO benchmarks, a well-trained cross-encoder achieves NDCG@10 around 0.72. A bi-encoder achieves around 0.64. The 8-point gap is significant. But the cross-encoder is 1000x slower at scale.
The Production Pattern: Two-Stage Retrieval
Production retrieval is almost always a two-stage pipeline: bi-encoder retrieves the top-K candidates (fast, recall-focused), cross-encoder reranks those K candidates (slow, precision-focused). This gives you near cross-encoder quality at near bi-encoder latency.
# Two-stage production pattern
# Stage 1: Bi-encoder retrieval — fast, high recall
candidates = ann_index.search(query_embedding, k=100) # top-100
# Stage 2: Cross-encoder reranking — accurate, bounded cost
scores = cross_encoder.predict([(query, doc) for doc in candidates])
results = sorted(zip(scores, candidates), reverse=True)[:10]
# Latency: ~50ms (ANN) + ~200ms (100 pairs × 2ms) = ~250ms
# Quality: approaches full cross-encoder on the retrieved set
Choosing K in the Two-Stage Pipeline
K — the number of candidates passed from stage 1 to stage 2 — directly controls the latency-quality tradeoff. K=10 is fast but you risk missing relevant documents not in the top-10 bi-encoder results. K=1000 is slow but maximizes recall. Production systems typically use K=50 to K=200 for a 250–500ms total budget.
The quality ceiling for the two-stage pipeline is: if a relevant document is not in the bi-encoder's top-K, the cross-encoder never sees it. Bi-encoder recall@K is therefore the critical metric to monitor. If recall@100 is 0.85, 15% of relevant documents are lost before reranking begins.
When to Use Each
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →