Statistical Testing for ML: McNemar's Test, Multiple Comparisons, and Power Analysis
Why t-tests are wrong for comparing classifiers. McNemar's test for paired classifier comparison, Bonferroni correction when you're testing 20 model variants, and power analysis to know before you run the experiment whether you'll have enough data to detect the difference.
Your Model Improved. Did It Actually Improve?
A model's offline metric goes from 0.82 to 0.84. Is this real? It depends entirely on how much variance there is in your measurement, how many comparisons you've made, and whether your test set is representative. Applied Scientist interviews go hard on statistical testing because bad statistical practice is the leading cause of models that look good offline but fail in production.
The Null Hypothesis Framework
Null hypothesis H₀: there is no difference between model A and model B. You compute a test statistic from your data. The p-value is the probability of observing a test statistic at least as extreme as yours if H₀ were true. If p < α (typically 0.05), you reject H₀ and conclude the difference is statistically significant. If p ≥ α, you fail to reject H₀ — but this does NOT mean the models are equivalent. You might just not have enough data to detect the difference.
Comparing Two Models on the Same Test Set
McNemar's test: for binary classification, compare the error patterns of two models on the same examples. Build a 2×2 table: how many examples did A get right and B wrong? How many did B get right and A wrong? The test asks whether these disagreements are symmetric. If they're not, one model is systematically better. More powerful than comparing accuracies because it uses the paired structure of the data.
from scipy.stats import chi2
import numpy as np
def mcnemar_test(model_a_correct, model_b_correct):
"""
model_a_correct: boolean array, True if model A correct on example i
model_b_correct: boolean array, True if model B correct on example i
"""
# Build 2x2 contingency table
a_right_b_wrong = np.sum(model_a_correct & ~model_b_correct) # b
a_wrong_b_right = np.sum(~model_a_correct & model_b_correct) # c
# McNemar statistic (with continuity correction)
statistic = (abs(a_right_b_wrong - a_wrong_b_right) - 1)**2 / (a_right_b_wrong + a_wrong_b_right)
p_value = 1 - chi2.cdf(statistic, df=1)
return statistic, p_value
# Paired t-test for continuous metrics (e.g., RMSE per user)
from scipy.stats import ttest_rel
per_user_metric_a = [...] # RMSE for each user from model A
per_user_metric_b = [...] # RMSE for each user from model B
statistic, p_value = ttest_rel(per_user_metric_a, per_user_metric_b)
Multiple Comparisons Problem
You train 20 model variants and compare each to baseline. Even if none is truly better, the probability that at least one appears better by chance (p < 0.05) is 1 - 0.95^20 = 64%. Running many comparisons inflates your false positive rate. Bonferroni correction: require p < α/n where n is the number of comparisons. FDR control (Benjamini-Hochberg): controls the expected proportion of false positives among significant results. More powerful than Bonferroni when many comparisons are made.
Statistical Power and Sample Size
Power = probability of detecting a real effect if one exists (= 1 - probability of false negative). Typical target: 0.8. Power depends on: effect size (how big is the real difference?), sample size (more data → more power), significance threshold α, and variance. Before running an experiment: power analysis tells you how much data you need to detect the effect size you care about. Running underpowered experiments wastes resources and produces false negatives.
The most common statistical mistake in ML: reporting 'model A is 2% better than model B on our test set' with no statistical test. With a test set of 500 examples, a 2% accuracy difference is not statistically significant (p ≈ 0.3). With 10,000 examples, the same difference is highly significant (p < 0.001). Always report confidence intervals or p-values alongside metric improvements.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →