GenAI Systems Lab Open interactive version →
Foundations & Architecture 11 min read

Reading ML Papers Critically: Baselines, Contamination, Missing Ablations, and Compute Opacity

The 5 questions to ask immediately when reading any ML paper. How to spot under-tuned baselines, test set contamination, missing ablations, and compute opacity. The 20-minute reading protocol used by researchers at frontier labs.

How to Read an ML Paper's Evaluation Section — and Spot What's Missing

Research taste is the ability to distinguish papers that advance the field from papers that advance careers. Both get published. Senior AI engineers at frontier labs and AI-native startups are expected to have this skill. Most engineers don't develop it until year 3-4 of reading papers actively.

The Five Questions to Ask Immediately

Baseline Selection: The Most Common Manipulation

Authors choose their own baselines. This is a conflict of interest. The strongest baseline is whatever their method barely beats. Weaker baselines get included; strong ones that make the method look bad get excluded or minimally tuned.

Red flag: a paper proposes method X and compares against vanilla BERT, AdaGrad, and some 2020 method — but doesn't compare against the obvious 2023 competitor. Either they missed it (unlikely) or it beats them (likely).

What to check: are the baselines from the same year as the paper? Did they use the same data preprocessing, hyperparameter search budget, and compute for baselines? Is there a citation to where the baseline numbers come from — their own run or the original paper? (Original paper numbers are often cherry-picked too.)

Test Set Contamination

If any part of the test data was visible during training, validation, or model selection, the reported numbers are optimistic. This is surprisingly common and sometimes unintentional.

Missing Ablations

An ablation study removes one component at a time to show each contributes. If the paper has 5 components and only shows the full model vs. no components, you don't know which ones actually matter. The honest version: full model, remove A, remove B, remove C, remove A+B, etc.

Common ablation omissions: they ablate on a toy dataset, not the main benchmark. They only ablate in favorable conditions. The ablation shows one component 'helps' but the baseline they ablated from is already their own non-standard variant.

Statistical Significance in ML Results

A 0.3% improvement on BLEU from one run is noise. A 0.3% improvement with p < 0.01 over 5 seeds on a 5K-sample test set might be real. Most ML papers don't report the latter.

# What rigorous reporting looks like:
# Method A: 72.3 ± 0.4 (mean ± std, n=5 seeds)
# Method B: 72.8 ± 0.6 (mean ± std, n=5 seeds)
# McNemar's test p=0.03 on held-out test set

# What you usually see:
# Method A: 72.3
# Method B: 72.8 (+0.5)  ← single run, no test, bold in table

Compute Opacity

A method that achieves 1% improvement with 5× the training compute is usually worse when you normalize by compute budget. Papers rarely normalize this way. What to look for: GPU hours or A100-hours reported, training time reported, parameter count reported. If none of these appear, the method is probably expensive and the authors know it.

How to Read a Paper Efficiently (20-Minute Protocol)

The Research Taste Interview Format

Cohere, Anthropic, and Mistral-type interviews sometimes hand you a paper and ask 'what do you think of this evaluation?' The answer they want is not a summary — it's a critique. Walk through: what claim are they making, what evidence do they provide, what would make you more or less confident in that evidence, and what experiment would you run to falsify the core claim.

The tell of a strong candidate: they immediately ask 'what dataset did the baseline numbers come from — did the authors re-run it or take it from the original paper?' Weak candidates summarize results. Strong candidates interrogate methodology.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →