Reading ML Papers Critically: Baselines, Contamination, Missing Ablations, and Compute Opacity
The 5 questions to ask immediately when reading any ML paper. How to spot under-tuned baselines, test set contamination, missing ablations, and compute opacity. The 20-minute reading protocol used by researchers at frontier labs.
How to Read an ML Paper's Evaluation Section — and Spot What's Missing
Research taste is the ability to distinguish papers that advance the field from papers that advance careers. Both get published. Senior AI engineers at frontier labs and AI-native startups are expected to have this skill. Most engineers don't develop it until year 3-4 of reading papers actively.
The Five Questions to Ask Immediately
- What baselines did they choose, and why those? A weak baseline makes any method look good. Did they tune the baseline as carefully as their method? Most papers don't, and this is where gains evaporate. What dataset did they train and test on? If they're from the same distribution, or the test set was used for development, the numbers don't transfer. Are confidence intervals reported? A 0.5% improvement without standard deviation over 3 seeds is noise. Is compute disclosed? A method that requires 10× the training compute for 2% improvement is usually not an improvement.
Baseline Selection: The Most Common Manipulation
Authors choose their own baselines. This is a conflict of interest. The strongest baseline is whatever their method barely beats. Weaker baselines get included; strong ones that make the method look bad get excluded or minimally tuned.
Red flag: a paper proposes method X and compares against vanilla BERT, AdaGrad, and some 2020 method — but doesn't compare against the obvious 2023 competitor. Either they missed it (unlikely) or it beats them (likely).
What to check: are the baselines from the same year as the paper? Did they use the same data preprocessing, hyperparameter search budget, and compute for baselines? Is there a citation to where the baseline numbers come from — their own run or the original paper? (Original paper numbers are often cherry-picked too.)
Test Set Contamination
If any part of the test data was visible during training, validation, or model selection, the reported numbers are optimistic. This is surprisingly common and sometimes unintentional.
- Data leakage: test examples similar or identical to training examples. Common when datasets are scraped from the web and the test set was published years earlier. Benchmark contamination in LLMs: models trained on internet text have almost certainly seen benchmark questions. MMLU and HumanEval contamination is documented. Any LLM paper reporting MMLU scores should be read skeptically. Development overfitting: multiple rounds of hyperparameter tuning on validation performance effectively makes the validation set part of training. The held-out test set should be touched exactly once. Selection bias: reporting only the best run out of 10 seeds. The right thing is median or mean ± std over 3-5 seeds.
Missing Ablations
An ablation study removes one component at a time to show each contributes. If the paper has 5 components and only shows the full model vs. no components, you don't know which ones actually matter. The honest version: full model, remove A, remove B, remove C, remove A+B, etc.
Common ablation omissions: they ablate on a toy dataset, not the main benchmark. They only ablate in favorable conditions. The ablation shows one component 'helps' but the baseline they ablated from is already their own non-standard variant.
Statistical Significance in ML Results
A 0.3% improvement on BLEU from one run is noise. A 0.3% improvement with p < 0.01 over 5 seeds on a 5K-sample test set might be real. Most ML papers don't report the latter.
# What rigorous reporting looks like:
# Method A: 72.3 ± 0.4 (mean ± std, n=5 seeds)
# Method B: 72.8 ± 0.6 (mean ± std, n=5 seeds)
# McNemar's test p=0.03 on held-out test set
# What you usually see:
# Method A: 72.3
# Method B: 72.8 (+0.5) ← single run, no test, bold in table
Compute Opacity
A method that achieves 1% improvement with 5× the training compute is usually worse when you normalize by compute budget. Papers rarely normalize this way. What to look for: GPU hours or A100-hours reported, training time reported, parameter count reported. If none of these appear, the method is probably expensive and the authors know it.
How to Read a Paper Efficiently (20-Minute Protocol)
- Minutes 0-3: Abstract + conclusion. What did they claim to do and what did they claim to show? Minutes 3-8: Figure 1 + main results table. What's the headline number and against what baseline? Minutes 8-14: Experimental setup section. Dataset, baselines, evaluation metric, how many seeds, is the test set clean? Minutes 14-19: Ablation table. Does removing each component actually hurt? Is the baseline in the ablation sensible? Minute 19-20: Related work — who did they not cite that they should have? That's usually where the strongest comparison lives.
The Research Taste Interview Format
Cohere, Anthropic, and Mistral-type interviews sometimes hand you a paper and ask 'what do you think of this evaluation?' The answer they want is not a summary — it's a critique. Walk through: what claim are they making, what evidence do they provide, what would make you more or less confident in that evidence, and what experiment would you run to falsify the core claim.
The tell of a strong candidate: they immediately ask 'what dataset did the baseline numbers come from — did the authors re-run it or take it from the original paper?' Weak candidates summarize results. Strong candidates interrogate methodology.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →