AI Engineering 12 min read

The Research Engineer Interview: Paper Implementation, Research Taste, Experimental Design, Open Problems

Four rounds, all different from MLE: implement a paper contribution from scratch in 45 minutes, critique an evaluation setup, design a rigorous experiment, and name an important open problem with a research direction. What each round tests and how to prepare.

The Research Engineer Interview: What It Actually Tests

Research Engineer is the most misunderstood role in AI. Candidates prepare like it's a senior MLE role with a paper reading component. It's not. The interview probes a specific combination: deep mathematical understanding, the ability to implement a paper from scratch under time pressure, research taste to critique methodology, and enough engineering rigor to make research code reliable in production.

How It Differs from MLE and SWE Interviews

Round 1: Implement the Paper

You're given a 2-3 page excerpt from a paper — just the method section, not the results. You have 45-60 minutes to implement the core contribution in Python/NumPy. No framework. Tests provided.

What they're testing: can you translate mathematical notation into working code? Do you understand the method deeply enough to handle the edge cases the paper doesn't mention? Common papers used: attention variants (linear attention, sparse attention), contrastive learning losses (SimCLR, NT-Xent), simple fine-tuning methods (LoRA, prompt tuning), retrieval components (BM25, two-tower). Where candidates fail: they understand the concept but can't implement the normalization step, handle the batching correctly, or pass the numerical tolerance test against the reference implementation. What to practice: read a method section, close it, implement it. Check against PyTorch or the official repo. Do this 10 times.

Round 2: Research Taste and Critique

You're handed a paper — sometimes a real published paper, sometimes a fabricated one with deliberate flaws — and asked: 'What do you think of this?' The answer they want is not a summary. It's a structured critique.

Walk through in this order: what claim is the paper making? What experiment would falsify that claim? Did they run it? If not, why might they have avoided it? Baseline check: are the baselines from the same year? Were they tuned as carefully as the proposed method? Ablation check: can you tell from the ablation table which component actually drives the improvement? Contamination check: if it's an LLM paper, what's the contamination risk on the benchmark? The question that separates strong from weak: 'What experiment would you run first to check if this result is real?' Weak candidates summarize. Strong candidates propose a falsification test.

Round 3: Experimental Design

Given a research hypothesis, design the experiment that would test it rigorously. This is the inverse of the paper critique round — instead of finding what's missing, you're designing from scratch.

Hypothesis example: 'We believe instruction tuning improves zero-shot performance more than few-shot examples for low-resource languages.' Strong answer structure: define the metric (what does 'improves' mean, and on what tasks?), define the baseline (instruction-tuned vs. what?), define the evaluation set (must be held out, must cover the low-resource languages in question), define the control variables (model size, pretraining data, same prompt template), plan the statistical test (how many seeds, what's the power analysis?), name the failure mode (what result would disprove the hypothesis?). Weak answer: proposes an experiment without specifying what 'works' means or how many seeds they'd run.

Round 4: Open Problems

'What do you think is the most important unsolved problem in [retrieval / alignment / efficient inference / multimodal learning]?' This round has no right answer. It's testing whether you've thought seriously about the field.

The answer structure that works: name the problem, say why current approaches fail to solve it (be specific — name the papers), say what a solution would look like and what would need to be true for it to exist, name the experiment you'd run if you had 3 months. The worst answer: names a problem and immediately says 'it's really hard.' That tells them nothing about how you think.

What to Build to Prepare

Implement 10 paper contributions from scratch. Start with: attention (Vaswani et al.), NT-Xent loss (SimCLR), LoRA (Hu et al.), BPR (Rendle et al.), BM25 (Robertson et al.). Critique 5 papers using the structured framework: claim, falsification test, baseline quality, ablation completeness, contamination risk. Pick 3 open problems and write 2 paragraphs each: current approaches, why they fail, what a solution requires. Read the methods section of one paper per week and implement it before reading the results.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →