GenAI Systems Lab Open interactive version →
Production & LLMOps 9 min read

Why Classic A/B Testing Breaks for AI — and What to Do Instead

Classic A/B was designed for click-through rates. AI quality metrics break five of its assumptions. Interleaved testing, switchback, multi-armed bandits, and permanent holdouts — when each applies and why most teams only learn this the hard way.

The assumption mismatch

Classic A/B testing was designed for click-through rates, conversion percentages, and page dwell time. These metrics have low variance and respond to changes quickly. A 5% improvement in click-through rate is measurable in days with thousands of users.

AI quality metrics break this. CSAT scores have high variance. Task completion rates require longer sessions to observe. The effect of a better model may only become visible after a user has built trust over multiple sessions. Run a classic A/B test on a model swap for two weeks and you will likely see no statistically significant result — not because the model is not better, but because the experiment was under-powered for the metric you care about.

Classic A/B is not wrong — it is misapplied. It is the right tool for session-level binary metrics. It is the wrong tool for AI quality metrics that take sessions, not clicks, to manifest.

Interleaved testing: 50x more efficient

For ranking and retrieval tasks, interleaved testing achieves roughly 50 times the statistical efficiency of classic A/B. Instead of splitting users between model A and model B, you mix results from both models in every response. Each user interaction contributes signal for both models simultaneously.

Airbnb published data showing that the same statistical power they achieved with interleaved tests required 50 times fewer user sessions than equivalent A/B tests on search ranking. The intuition is simple: in classic A/B, half your traffic produces no signal for the variant you care about. In interleaved testing, every impression is informative for both models.

The constraint is real: interleaving only works for ranked outputs — search results, recommendations, document retrieval. You cannot interleave two free-form generations of an AI summary. But for the RAG retrieval layer, the recommendation engine, or the reranker, it is the right default.

Switchback testing: when users are not independent

Classic A/B assumes users are independent. In marketplace systems, they are not. If you run Uber and assign 50% of drivers to a new route-optimization model, the drivers on the old model are competing with drivers on the new model for the same riders. The treatment group affects the control group's outcomes. User-level splits produce biased results.

Switchback testing solves this by treating time windows as the experimental unit instead of users. The entire system alternates between treatment and control on a fixed schedule — hourly, daily. Every user sees only one model at a time, eliminating cross-group interference. You measure the difference in aggregate metrics between treatment windows and control windows.

The design challenge is carryover effects: if treatment window effects bleed into the following control window, results are biased. Switchback windows must be long enough for effects to clear between switches, and the analysis must account for time-of-day patterns that correlate with window assignments.

Multi-armed bandits: minimising regret

Classic A/B allocates traffic 50/50 until the experiment ends, then switches everyone to the winner. For the duration of the experiment, 50% of your users are on an inferior experience. Multi-armed bandits continuously update traffic allocation as results come in, shifting more traffic toward the better-performing variant while maintaining enough exploration to reach statistical confidence.

For AI systems, MAB is appropriate when you have multiple known variants — different prompt versions, different model router configurations — and cannot afford to waste traffic on weak variants. The downside: the non-uniform traffic allocation makes clean frequentist significance testing harder. Bayesian approaches (Thompson Sampling) are more natural for MAB analysis.

Permanent holdouts: the experiment most teams skip

After six months of running A/B experiments and declaring winners, most teams cannot answer: has our AI product actually improved? Each experiment was run in isolation, declared a winner against the previous baseline, and shipped. The baselines keep shifting. There is no fixed reference point.

A permanent holdout is a group of users — typically 5 to 10 percent — who are excluded from all experiments permanently. This group never receives any treatment. At any point in time, you can compare current product metrics against the holdout and measure the cumulative quality lift from everything you shipped in the last 6 months. It is the only way to answer the cumulative question.

The ethical tension is real: users in the holdout receive a deliberately degraded experience by design. For most consumer products, this is an acceptable trade-off. For safety-critical systems, it may not be. Decide this explicitly before setting up the holdout.

The five ways classic A/B breaks for AI

The A/B Testing for AI Systems module walks through all five strategies with concrete scenario matching — which approach for which situation, and the anti-patterns that indicate you picked the wrong one.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →