Production & LLMOps 13 min read

How I'd Build E-Commerce Search Ranking (BM25 + Neural + LTR)

Hybrid retrieval (BM25 + bi-encoder), LambdaMART for learning-to-rank, position bias correction with IPS, and sponsored item injection as a constrained optimization. The architecture interview question that separates senior from staff engineers in Bangalore.

E-commerce Search Is Not Information Retrieval

When a user searches 'running shoes under 2000' on Flipkart, they're not looking for the most relevant documents about running shoes. They're looking for the shoes most likely to result in a purchase they'll be happy with. These are different objectives. A shoe that ranks first in BM25 relevance (many mentions of 'running', 'shoe') might have a 30% return rate and terrible reviews. A shoe with fewer keyword matches but a 4.8 rating and free delivery might be the right answer.

Query Understanding: Before Retrieval

The query 'running shoes' needs to be understood before any retrieval happens. Query classification: is this a navigational query (user knows what they want), informational, or transactional? For transactional queries, extraction matters: extract price constraint (under 2000), category (running shoes), implicit constraints (user's historical size from past orders, preferred brands from browsing). Spell correction and synonym expansion: 'nike runing shoes' → 'nike running shoes', 'shoes' → also includes 'footwear', 'sneakers'.

Two-Stage Retrieval

Stage 1 — Recall: BM25 on product text (title, description, attributes) for keyword matches. Dense retrieval (bi-encoder) for semantic matches — catches 'running shoes' when the product description says 'athletic footwear for jogging'. Category/attribute filtering: if query intent is classified as 'running shoes', filter to footwear category, apply attribute filters from extracted constraints. Union of signals, target 500-1000 candidates.

Stage 2 — Ranking: LTR (Learning to Rank) model with hundreds of features. Query-item relevance features (BM25 score, semantic similarity score, category match). Item quality features (rating, review count, return rate, seller reliability). Business features (is it Prime-eligible? Current delivery estimate? Price competitiveness vs. similar items). User-item personalization features (has this user bought from this brand? What's their price sensitivity?). Historical CTR and conversion rate of this item for similar queries.

# Learning to Rank with LightGBM
import lightgbm as lgb

# Features for each (query, item) pair
features = [
    "bm25_score", "semantic_similarity", "category_match",
    "item_rating", "review_count", "return_rate",
    "user_brand_affinity", "price_vs_budget_ratio",
    "historical_ctr_for_similar_queries",
    "delivery_eta_minutes", "is_prime_eligible"
]

# Training data: (query_id, item_features, label)
# Labels: 0=not relevant, 1=clicked, 2=purchased, 3=repurchased
# LambdaMART directly optimizes NDCG

params = {
    "objective": "lambdarank",  # directly optimizes ranking metric
    "metric": "ndcg",
    "ndcg_eval_at": [5, 10, 20],
    "num_leaves": 127,
    "learning_rate": 0.05,
}

model = lgb.train(
    params, 
    train_data,
    valid_sets=[val_data],
    num_boost_round=500
)

Dealing With Position Bias

Items ranked higher get more clicks, not because they're better but because they're seen more often. Training on click data without correcting for position bias teaches the model to recommend popular positions, not good items. Inverse Propensity Scoring (IPS): upweight clicks on lower-ranked items (where click probability is lower) and downweight clicks on top-ranked items. Propensity estimation: from randomization experiments where you occasionally show items in random positions and measure the click rate by position.

Query-Item Feature Freshness

The historical CTR of an item for a query is a powerful ranking feature but goes stale. An item's CTR for 'winter jackets' in summer is irrelevant. A new product has no historical CTR. Solution: stratify CTR estimates by season, recency-weight historical observations (exponential decay), use category-level priors for new items with no query-specific history. This is the feature engineering work that takes months to get right and is what differentiates mature search systems from basic ones.

The system design question interviewers love: 'how do you handle a brand new query that has never appeared before in your system?' Your LTR model has no historical features for this query. Fallback: semantic similarity from dense retrieval, global item quality features (rating, sales velocity), category-level signals from query classification. This is why the retrieval stage must be robust to cold-start queries before the ranking stage can do its job.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →