A/B Testing LLM Systems: Statistical Significance and Evaluation Metrics
How to run controlled experiments on LLM outputs, which metrics to use (win-rate, NDCG, preference), and how to avoid common A/B traps.
Congratulations — you have two prompts and no idea which one is better. You've eyeballed 20 examples, your teammate prefers the other one, and your PM wants a number. This is the moment that separates rigorous AI teams from teams flying blind.
A/B testing LLM systems is harder than testing a button colour. The output is text — subjective, variable, and impossible to compare with a simple ==. Here's how to do it right.
Why LLM A/B tests are different
- Non-deterministic: the same prompt produces different outputs. Run each condition multiple times.
- No ground truth: 'better' is defined by a rubric, not an exact match
- Correlated samples: the same user sees both versions, so standard t-tests aren't valid without care
- Multivariate confounds: model, prompt, temperature, and context all change quality simultaneously
- Latency and cost are dimensions too — a 'better' response that's 2× slower may be worse for your users
Offline A/B testing: the right default
Before touching production traffic, run your A/B test offline on a golden eval set. Prepare 200–500 representative inputs. Run both variants on every input. Score both outputs with an LLM judge. Compare mean scores and test for statistical significance. Only put the better-performing variant in production.
import scipy.stats as stats
import numpy as np
def run_ab_eval(eval_set, variant_a, variant_b, judge):
scores_a, scores_b = [], []
for example in eval_set:
out_a = variant_a(example["input"])
out_b = variant_b(example["input"])
scores_a.append(judge(example, out_a)["score"])
scores_b.append(judge(example, out_b)["score"])
mean_a = np.mean(scores_a)
mean_b = np.mean(scores_b)
# Paired t-test (same inputs, so samples are paired)
t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
return {
"mean_a": mean_a, "mean_b": mean_b,
"delta": mean_b - mean_a,
"p_value": p_value,
"significant": p_value < 0.05,
"winner": "B" if mean_b > mean_a and p_value < 0.05 else
"A" if mean_a > mean_b and p_value < 0.05 else "inconclusive"
}
p < 0.05 is a starting point, not a finish line. With a small eval set (<100), even a real difference may not reach significance. With a large set (>2000), tiny meaningless differences will be 'significant'. Always look at the effect size (delta), not just the p-value.
Online A/B testing: when you need it
Online testing — splitting live traffic — is necessary when: you need real user behaviour signals (engagement, task completion, thumbs-up rate), your task is too subjective to eval offline reliably, or you need to measure business metrics alongside quality. Use canary deployment: route 5% of traffic to variant B, monitor for 48–72 hours, check both quality metrics and error rates.
Measuring semantic similarity
For cases where outputs should be similar (same content, just better phrased), cosine similarity between embeddings of variant A and B outputs can surface regressions. If variant B produces outputs with cosine similarity < 0.85 to variant A, something substantive changed — worth manual review.
The interrater reliability problem
LLM judges are not perfectly consistent. Run the same (example, output) pair through your judge 5 times and check variance. High variance means your judge rubric is underspecified. Tighten the rubric with specific criteria and examples until the judge's variance is low enough to trust.
Run an A/B eval in the Systems module →: Compare two prompt variants on a golden eval set with built-in statistical tests.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →