Foundations & Architecture 10 min read

Experimental Design for ML: Ablation Studies, Baselines, and What Makes a Result Credible

How to design an ML experiment that produces trustworthy results. Ablation study requirements (one variable at a time, leave-one-out design), choosing the right baseline, multiple seeds for variance, and the common credibility killers that get papers rejected.

Ablation Studies Are Arguments, Not Experiments

An ablation study removes components of a system one at a time and measures the performance impact. The goal is to attribute performance gains to specific design choices. But an ablation study is only as good as the claim it's making. A bad ablation: 'we added components A, B, C and performance improved 5%.' A good ablation: 'component B accounts for 3.2% of the 5% improvement (±0.4%, p=0.02), component A accounts for 1.1% (±0.3%, p=0.01), and component C has no statistically significant effect (p=0.31).' The good version is an argument with evidence. The bad version is a story.

Designing a Clean Ablation

The key requirement: each ablation should change exactly one thing. If you change two things simultaneously, you can't attribute the performance difference. Common violations: removing component A requires also changing the architecture (now you're measuring A + architecture change). Removing A changes the number of parameters (now you're measuring A + capacity change). Good ablation design controls for these confounders explicitly.

Isolate the variable: remove only what you're testing. If removing a component changes model size, add a dummy component to restore the parameter count.
Run multiple seeds: a single run has high variance. Report mean and standard deviation across 3-5 random seeds for stable metrics.
Use the same test set: changing the test set between ablations introduces evaluation variance. Freeze the test set before any ablations.
Report the right metric: ablate on the metric that matters for your research claim, not the easiest metric to improve.

Baselines: The Other Half of the Argument

Your proposed method needs to be compared against the right baselines. A strong baseline makes a convincing paper and a convincing interview answer. Common mistakes: (1) Weak baselines — comparing against a method from 5 years ago that newer work has superseded. (2) Poorly tuned baselines — your method with extensive hyperparameter search vs. baseline with default settings. (3) Missing oracle baseline — a baseline that represents an upper bound (e.g., using ground-truth features instead of predicted features) that tells you how much headroom exists.

Hyperparameter Search and Reporting

If you searched over hyperparameters, you must report the search strategy and the final values used. Grid search over {0.001, 0.01, 0.1} is a 3-point search. Random search over log-uniform(1e-4, 1e-1) is more thorough. Bayesian optimization (Optuna, Ray Tune) is the production standard for expensive experiments. Critical: use a separate validation set for hyperparameter search and a held-out test set for final evaluation. Using the test set for hyperparameter tuning inflates reported metrics.

import optuna

def objective(trial):
    lr = trial.suggest_float("lr", 1e-4, 1e-1, log=True)
    dropout = trial.suggest_float("dropout", 0.1, 0.5)
    hidden_size = trial.suggest_categorical("hidden_size", [128, 256, 512])
    
    model = build_model(hidden_size=hidden_size, dropout=dropout)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    
    # Train on train set, evaluate on VALIDATION set
    val_metric = train_and_evaluate(model, optimizer, train_data, val_data)
    return val_metric  # Optuna maximizes this

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
best_params = study.best_params

# Only now evaluate on test set — once, with best_params
final_model = build_model(**best_params)
test_metric = evaluate(final_model, test_data)

What Applied Scientist Interviewers Are Testing

The interviewer isn't asking you to memorize statistical tests. They're testing whether you understand the difference between a result and evidence of a result. 'Our model improved' is not a research finding. 'Our model improved by X on metric Y, statistically significant at p=0.02 over N experiments, with ablations showing component Z accounts for most of the improvement' is a research finding. The discipline of experimental design is what separates applied research from exploratory hacking.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →