GenAI Systems Lab Open interactive version →
AI Engineering 12 min read

Two-Tower Recommendation Architecture: In-Batch Negatives, L2 Norm, and ANN Serving

The architecture behind YouTube, Spotify, and Pinterest recommendations. User tower + item tower trained with in-batch negatives, L2 normalization for cosine similarity at dot-product speed, and approximate nearest neighbor serving at billion-item scale.

Why Recommendation Is a Retrieval Problem

Flipkart has 500 million products. Swiggy has 200,000 restaurants. Meesho has 100 million items. When a user opens the app, you have 100 milliseconds to surface 20 things they might want. You cannot score 500 million candidates with any model that takes more than a microsecond per item. This is the fundamental constraint that shapes every large-scale recommender system: the problem must be decomposed into two stages, and the first stage must be fast.

The two-tower architecture is the dominant solution. One tower encodes the user. One tower encodes the item. Both towers independently produce embedding vectors. Similarity between user and item is computed as a dot product. The item embeddings are pre-computed offline and indexed in a vector store. At serving time, you embed the user and run approximate nearest neighbour search — the same HNSW or IVF index you use in RAG. You retrieve 500-1000 candidates in milliseconds. Then a slower, more accurate ranking model scores just those candidates.

The Two-Tower Architecture

Each tower is a neural network that takes features as input and produces a dense embedding as output. The user tower takes user features: user ID (as a learned embedding), recent interaction history, demographic signals, device type, time of day. The item tower takes item features: item ID embedding, category, price range, textual description (optionally encoded with a language model), seller rating, historical engagement rates.

Training objective: given a user and an item they interacted with (a positive pair), the model should produce similar embeddings. Given a user and a random item (a negative pair), the model should produce dissimilar embeddings. The standard loss is in-batch softmax: treat every other item in the training batch as a negative. With batch size 4096, each positive gets 4095 negatives for free. This is called in-batch negative sampling and is what makes two-tower training tractable at scale.

class UserTower(nn.Module):
    def __init__(self, n_users, n_items, embed_dim=128):
        super().__init__()
        self.user_embed = nn.Embedding(n_users, embed_dim)
        self.item_history_embed = nn.Embedding(n_items, embed_dim)
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim * 2, 256),
            nn.ReLU(),
            nn.Linear(256, embed_dim)
        )
    
    def forward(self, user_ids, history_item_ids):
        user_emb = self.user_embed(user_ids)
        history_emb = self.item_history_embed(history_item_ids).mean(dim=1)
        combined = torch.cat([user_emb, history_emb], dim=-1)
        return F.normalize(self.mlp(combined), dim=-1)  # L2 normalize

class ItemTower(nn.Module):
    def __init__(self, n_items, embed_dim=128):
        super().__init__()
        self.item_embed = nn.Embedding(n_items, embed_dim)
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim, 256),
            nn.ReLU(),
            nn.Linear(256, embed_dim)
        )
    
    def forward(self, item_ids):
        item_emb = self.item_embed(item_ids)
        return F.normalize(self.mlp(item_emb), dim=-1)

# In-batch softmax loss
def two_tower_loss(user_embs, item_embs, temperature=0.07):
    logits = torch.matmul(user_embs, item_embs.T) / temperature
    labels = torch.arange(len(user_embs)).to(user_embs.device)
    return F.cross_entropy(logits, labels)

The In-Batch Negative Problem

In-batch negatives are efficient but introduce a bias: popular items appear more often in batches and therefore appear as negatives more often. The model learns to penalize popular items, which suppresses legitimate recommendations. The fix is popularity-based negative correction: downweight the loss contribution from negatives that are popular items, since their appearance as negatives may be misleading.

A more fundamental problem is false negatives: items that appear as negatives in the batch but the user would actually click if shown them. At scale, with millions of items and thousands of users, any given batch will contain many false negatives. Hard negative mining — deliberately sampling items the user nearly interacted with — improves model quality but must be done carefully to avoid making the task too hard during early training.

Serving: Offline Index + Online Lookup

Once trained, the item tower runs offline over the entire item catalog. Every item gets an embedding. These embeddings are indexed in FAISS (IVF or HNSW depending on corpus size). At serving time: user features arrive, user tower runs online (single forward pass, ~5ms), FAISS ANN search retrieves top-500 candidates (~1ms), ranking model scores the 500 candidates (~50-100ms). Total retrieval latency: under 10ms. This is why the two stages can have completely different model architectures and cost budgets.

What Interviewers Actually Ask

The single biggest mistake in two-tower interviews: confusing the retrieval recall metric with the final ranking metric. A retrieval model that puts the right item at rank 450 out of 500 has succeeded — the ranker's job is to find it. Conflating the two stages shows you haven't shipped a real recommender.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →