Production & LLMOps 13 min read

How I'd Build a Recommendation Feed (Bangalore Scale)

Two-tower retrieval + LightGBM ranker + multi-objective reranker, deployed on 10M daily active users. Feature store design, real-time vs batch feature split, cold start handling, A/B test framework, and the exact failure modes that kill production recommenders.

The Problem: 50M Users, 200M Items, 100ms Budget

The brief: build a personalized feed for a consumer app with 50 million daily active users. Each user opens the app and expects to see items they actually want. You have 100ms end-to-end latency budget. Your catalog has 200 million items. The naive solution — score every item for every user — is computationally impossible. This is the design problem Meesho, Flipkart, and every large consumer app solves every day.

Architecture: Three Stages

Stage 1 — Candidate Generation (10ms budget): multiple retrieval signals in parallel. Two-tower ANN retrieval pulls 300 personalized candidates. Collaborative filtering retrieves 100 from similar-user history. Trending/popularity retrieves 100 high-engagement items in the user's category affinity. Rule-based retrieval adds 50 re-engagement items (items user viewed but didn't purchase in the last 7 days). Total: ~550 candidates, deduplicated to ~400.

Stage 2 — Ranking (50ms budget): a LightGBM or DNN ranker scores all 400 candidates. Features: user-item cross features (has the user viewed this brand before?), real-time context (device, time of day, session length so far), item quality signals (return rate, seller rating, price competitiveness), and position bias correction features. Output: 400 ranked candidates.

Stage 3 — Re-ranking (10ms budget): business rules applied. Diversity constraints (no more than 3 items from the same seller in top 20). Sponsored items inserted at contracted positions. Filter items currently out of stock. Apply safety filters. Output: final 20-item feed.

The Feature Store

Real-time features (current delivery ETA, live inventory, session signals) come from an online feature store (Redis or DynamoDB). Batch features (historical click rates, user embedding, item embedding) come from an offline feature store (Hive or BigQuery) refreshed daily. The critical discipline: training-serving skew prevention. Every feature used at serving time must be constructed identically during training, with point-in-time correctness (no leakage from future signals).

What to Optimize For

CTR is easy to measure but produces clickbait feeds. Purchase rate is better but has delayed labels and is sparse. Optimizing for purchase value (GMV) risks showing only expensive items. The right objective for a consumer feed is typically a blend: weighted combination of click probability, purchase probability given click, and a long-term engagement signal (did the user return tomorrow?). Each component requires its own model and separate training data.

Evaluation: Offline vs Online

Offline: NDCG@20, Recall@100 (did the items the user actually engaged with appear in the top 100 candidates from retrieval?). Hit rate per category. But offline metrics are unreliable predictors of online performance — the test set doesn't capture position effects, impression effects, or novelty effects. Always shadow-test before A/B: run the new model in shadow mode, log its recommendations, then compare with held-out user cohort.

The question every interviewer asks: how do you prevent the feedback loop? Your model recommends item X → X gets more clicks → X gets more training signal → X gets recommended more. Left unchecked, the feed collapses to a small set of viral items. Solutions: exploration budget (Thompson sampling on 5% of feed slots), counterfactual debiasing in training (inverse propensity scoring), diversity constraints in re-ranking.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →