AI Engineering 12 min read

How I'd Build AI Search: Query Understanding, Retrieval, and Ranking at Scale

The full architecture for a production AI search system — query intent classification, hybrid retrieval, LLM reranking, freshness signals, and why the first version almost always gets the retrieval layer wrong.

I've built or reviewed AI search systems at three different companies. The same three mistakes keep appearing: retrieving the right documents for the wrong reason, ranking without a feedback signal, and ignoring freshness until users start complaining. Here's how I'd build it if I were starting today.

Start With Query Understanding

The retrieval layer is not where AI search begins. Query understanding is. Before you hit any index, classify the query: is this navigational (user wants a specific page), informational (user wants an answer), or transactional (user wants to do something)? Each intent type gets a different retrieval strategy.

Navigational → exact match / BM25, metadata filter by URL or title
Informational → hybrid search (dense + sparse), reranking, RAG-style answer generation
Transactional → structured lookup, not retrieval

Build a query intent classifier first. A fine-tuned small model (Llama 3.1 8B) classifying 4 intent types adds <50ms and routes queries to the right retrieval pipeline. Without this, your vector index handles navigational queries it was never meant to serve.

Hybrid Retrieval Is Non-Negotiable

Pure vector search fails on product codes, version numbers, proper nouns, and exact terminology that doesn't appear in training data. Pure BM25 fails on synonyms, paraphrases, and intent. Hybrid search with Reciprocal Rank Fusion is the default for production systems. I start at alpha=0.5 (equal weight) and tune from there with offline NDCG experiments.

The Reranking Trap

The instinct is to add a cross-encoder reranker on top of everything. The trap: cross-encoders are expensive (200-400ms for top-100 reranking) and you may not need them. Before adding a reranker, measure your precision@1 and NDCG@5. If precision@1 is already above 75%, the retrieval layer is your bottleneck, not ranking. Rerankers help when your retriever gets the right documents but wrong order — not when it retrieves the wrong documents to begin with.

Freshness Is a First-Class Signal

Vector similarity doesn't know about time. A document about a product bug that was fixed two years ago retrieves at high similarity for current bug queries. I add a freshness decay multiplier to final scores: score_final = score_retrieval × decay(days_since_modified). The decay function: exponential with half-life of 90 days for support content, 365 days for evergreen docs.

The Feedback Loop That Compounds

The system that improves without retraining: log every query, every retrieved result, every user click (or absence of click after 5 seconds of display). Build a click-through rate model on top of this. Use CTR as the ranking signal that corrects the initial retriever. Relevance labeling from query logs is 100× cheaper than manual annotation.

Survivorship bias: you only see clicks on what you showed. To avoid the retriever trapping itself in its own blindspots, run random result injection at 5% of traffic — show a random document in a lower position and observe its click rate. This uncovers high-relevance documents your retriever consistently misses.

The Answer Generation Layer

If you're generating AI answers (not just returning links), the answer layer needs its own quality signals: groundedness (is every claim in the answer attributable to a retrieved chunk?), coverage (does the answer address all aspects of the query?), and confidence (does the system know it doesn't know?). I use LLM-as-judge for groundedness at 10% sample rate in production — it's too expensive to run on everything.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →