Explore-Exploit in Recommendations: Thompson Sampling, UCB, and ε-Greedy
The explore-exploit tradeoff in recommendation systems. Thompson Sampling with Beta distribution posteriors, UCB confidence bounds, ε-greedy simplicity, and contextual bandits when user features matter. Why pure exploitation starves the system of signal.
The Feedback Loop Problem
A recommender system trained on its own outputs creates a feedback loop. It recommends popular items → popular items get more interactions → interactions train the model → model recommends popular items more strongly. Over time, the diversity of recommendations collapses. The system becomes confidently narrow, showing every user a slight variation of the same popular items. Long-tail items that might be perfect for specific users never get shown, never accumulate data, and the model never learns they're good.
The explore-exploit dilemma is the core tension: exploiting known good recommendations maximizes short-term engagement, but exploring unknown items is necessary to avoid long-term stagnation. Pure exploitation → filter bubble. Pure exploration → random recommendations. The question is how to balance them.
ε-Greedy: The Baseline
With probability ε, show a random item instead of the top-ranked recommendation. With probability 1-ε, show the best predicted item. Simple to implement, easy to tune. ε = 0.1 means 10% of recommendations are random exploration. Problems: random exploration is inefficient — you're equally likely to explore a clearly bad item as a potentially good one. And ε is a global constant that doesn't adapt to uncertainty.
Upper Confidence Bound (UCB)
UCB adds an exploration bonus to items with high uncertainty. Score(item) = estimated_value(item) + c × √(log(t) / n_i), where t is total impressions, n_i is impressions for item i, and c controls exploration aggressiveness. Items shown rarely get a high exploration bonus. As items accumulate data, the bonus shrinks. UCB naturally shifts from exploration to exploitation as confidence grows.
Thompson Sampling: The Production Standard
Thompson sampling maintains a probability distribution over the true value of each item. At decision time, sample from each distribution and recommend the item with the highest sampled value. Items with high uncertainty have wide distributions — they'll sometimes get high samples and get shown, generating the data needed to narrow the distribution. Items with low uncertainty have narrow distributions — they get shown when they're actually good.
import numpy as np
from scipy.stats import beta
class ThompsonSamplingBandit:
def __init__(self, n_items):
# Beta distribution parameters for each item
# α = successes (clicks/orders), β = failures (impressions without click)
self.alpha = np.ones(n_items) # prior: 1 success
self.beta = np.ones(n_items) # prior: 1 failure
def select_item(self, candidate_items):
# Sample from each item's beta distribution
samples = np.array([
np.random.beta(self.alpha[i], self.beta[i])
for i in candidate_items
])
return candidate_items[np.argmax(samples)]
def update(self, item_id, reward):
# reward = 1 for click/order, 0 for impression without interaction
if reward:
self.alpha[item_id] += 1
else:
self.beta[item_id] += 1
Contextual Bandits: The Real Production Setup
Pure bandits treat all users the same. Contextual bandits condition on user context: given this user's features, which item's true value is highest? LinUCB models the expected reward as a linear function of context features and maintains a confidence interval. Neural contextual bandits (NeuralUCB, NeuralTS) use a neural network for the reward model and UCB/TS for exploration. At scale, contextual bandits are typically applied to a subset of recommendation slots (e.g., the top-3 positions) while the rest use the standard ranking model.
What Gets Asked in Interviews
- Why not just A/B test everything? (A/B test is offline experiment design — it assigns users to treatments upfront. Bandits adapt in real time. A/B test is better for measuring, bandit is better for optimizing while measuring.)
- How do you handle delayed rewards? (In food delivery, the outcome — did they order and enjoy it? — is known 45-90 minutes after the recommendation. Thompson sampling with delayed updates — buffer the outcome and update when it arrives.)
- How do you prevent exploration from hurting key business metrics? (Cap exploration rate, apply exploration only in designated slots, ensure explored items meet minimum quality thresholds.)
- What's the regret bound of Thompson sampling? (O(√(KT log T)) where K is items, T is timesteps — theoretically near-optimal.)
The most common interview mistake: treating exploration as a nice-to-have diversity feature. It's not. Without systematic exploration, a recommender's performance degrades over time as the training data distribution narrows. Exploration is what keeps the feedback loop from collapsing the system.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →