Evaluation 9 min read

A/B Testing LLM Systems: Statistical Significance and Evaluation Metrics

How to run controlled experiments on LLM outputs, which metrics to use (win-rate, NDCG, preference), and how to avoid common A/B traps.

Congratulations — you have two prompts and no idea which one is better. You've eyeballed 20 examples, your teammate prefers the other one, and your PM wants a number. This is the moment that separates rigorous AI teams from teams flying blind.

A/B testing LLM systems is harder than testing a button colour. The output is text — subjective, variable, and impossible to compare with a simple ==. Here's how to do it right.

Why LLM A/B tests are different

Non-deterministic: the same prompt produces different outputs. Run each condition multiple times.
No ground truth: 'better' is defined by a rubric, not an exact match
Correlated samples: the same user sees both versions, so standard t-tests aren't valid without care
Multivariate confounds: model, prompt, temperature, and context all change quality simultaneously
Latency and cost are dimensions too — a 'better' response that's 2× slower may be worse for your users

Offline A/B testing: the right default

Before touching production traffic, run your A/B test offline on a golden eval set. Prepare 200–500 representative inputs. Run both variants on every input. Score both outputs with an LLM judge. Compare mean scores and test for statistical significance. Only put the better-performing variant in production.

import scipy.stats as stats
import numpy as np

def run_ab_eval(eval_set, variant_a, variant_b, judge):
    scores_a, scores_b = [], []
    for example in eval_set:
        out_a = variant_a(example["input"])
        out_b = variant_b(example["input"])
        scores_a.append(judge(example, out_a)["score"])
        scores_b.append(judge(example, out_b)["score"])

    mean_a = np.mean(scores_a)
    mean_b = np.mean(scores_b)

    # Paired t-test (same inputs, so samples are paired)
    t_stat, p_value = stats.ttest_rel(scores_a, scores_b)

    return {
        "mean_a": mean_a, "mean_b": mean_b,
        "delta": mean_b - mean_a,
        "p_value": p_value,
        "significant": p_value < 0.05,
        "winner": "B" if mean_b > mean_a and p_value < 0.05 else
                  "A" if mean_a > mean_b and p_value < 0.05 else "inconclusive"
    }

p < 0.05 is a starting point, not a finish line. With a small eval set (<100), even a real difference may not reach significance. With a large set (>2000), tiny meaningless differences will be 'significant'. Always look at the effect size (delta), not just the p-value.

Online A/B testing: when you need it

Online testing — splitting live traffic — is necessary when: you need real user behaviour signals (engagement, task completion, thumbs-up rate), your task is too subjective to eval offline reliably, or you need to measure business metrics alongside quality. Use canary deployment: route 5% of traffic to variant B, monitor for 48–72 hours, check both quality metrics and error rates.

Measuring semantic similarity

For cases where outputs should be similar (same content, just better phrased), cosine similarity between embeddings of variant A and B outputs can surface regressions. If variant B produces outputs with cosine similarity < 0.85 to variant A, something substantive changed — worth manual review.

The interrater reliability problem

LLM judges are not perfectly consistent. Run the same (example, output) pair through your judge 5 times and check variance. High variance means your judge rubric is underspecified. Tighten the rubric with specific criteria and examples until the judge's variance is low enough to trust.

Run an A/B eval in the Systems module →: Compare two prompt variants on a golden eval set with built-in statistical tests.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →