AI Engineering 12 min read

The AI Engineering Interview Sprint: 1 Week, 3 Days, 1 Day, 2 Hours

A time-horizon prep guide for AI engineering interviews. What to study when, the 20 highest-signal questions, exact numbers to memorize, and what interviewers actually penalize — organized by how much time you have.

The AI Engineering Interview Sprint: A Time-Horizon Prep Guide

Most interview prep fails not because the candidate doesn't know enough, but because they studied the wrong things in the wrong order. This post gives you a concrete plan — organized by how much time you have left. The structure assumes you're interviewing for mid-level to senior AI engineer roles covering RAG, agents, evaluation, LLMOps, and NLP foundations.

This post maps directly to GSL's content. Each section links to specific modules and PrepLab question clusters so you know exactly what to open next. Work through it top to bottom based on when your interview is.

One Week Out: Build the Architecture Map

With a week, you have time to build genuine understanding — not just memorize answers. The goal this week is to internalize the mental models that let you answer novel questions, not just rehearsed ones. Interviewers at senior levels ask about tradeoffs, not facts.

RAG pipeline: Can you trace a query from raw text to final answer and name what can break at each step? Know: chunking strategies, embedding models (bi-encoder), ANN index choice, reranking (cross-encoder), prompt assembly. Read: how-rag-works, bi-encoder-vs-cross-encoder, ann-algorithms-deep-dive.
Agents: Do you understand ReAct, memory patterns, tool use, and where agents fail? Know: the four memory problem types, when to use LangGraph vs simple orchestration, how tool poisoning works. Read: react-pattern, agent-memory-architecture, agent-failure-modes.
Evaluation: Can you pick the right metric for a given task? Know: NDCG vs MRR (when each applies), LLM-as-judge bias modes, when human eval is required, calibration. Read: ndcg-mrr-from-scratch, llm-judge-calibration, calibration-ece-from-scratch.
Production: Can you describe a monitoring stack for an LLM system? Know: what to log, PSI/KS/MMD for drift detection, canary vs shadow deployment, retraining triggers. Read: llm-observability, drift-detection-production, deployment-patterns-ml.
NLP foundations: Do you know the difference between bi-encoder and cross-encoder, why [CLS] pooling fails, and when to use T5 vs a decoder-only model? Read: bert-internals-explained, sentence-transformers-production, encoder-decoder-architecture.

End-of-week check: open PrepLab, pick 10 questions across all topics, and track your accuracy. Identify your two weakest topic clusters. Those become your focus for the next phase.

Three Days Out: Harden the Weak Spots and Run Scenarios

With three days, you should no longer be doing broad surveys. You now know what you don't know. Spend these three days on: (1) drilling your weak topic clusters in PrepLab, (2) working through at least two scenario questions, and (3) running the Systems modules for your weakest areas.

Run at least one full PrepLab scenario (tools: scenario-4 tool poisoning, scenario-5 catastrophic forgetting, scenario-6 eval distribution mismatch). These are 4-step walkthroughs of production incidents — exactly the structure some interviews use.

The most common three-day mistake: drilling MCQs and calling it prep. MCQs build recognition, not production reasoning. Make yourself explain your answers aloud. If you can't say why the other three options are wrong, you haven't learned it yet.

One Day Out: The 20 Highest-Signal Questions

With 24 hours left, stop acquiring new information. Every hour spent learning something new is an hour not spent sharpening what you already know. Your job today is to make your existing knowledge retrieval-fast under pressure.

These are the questions most likely to appear in a mid-to-senior AI engineering loop, based on interview signal data from 22 practitioners. Have a crisp answer to each:

Walk me through how you'd build a RAG system from scratch. What are the failure modes at each step?
When would you use a bi-encoder vs a cross-encoder? What's the production pattern?
Your LLM system's quality has degraded in production. How do you diagnose it?
How does fine-tuning differ from RAG? When would you choose one over the other?
What is NDCG and when do you use it instead of accuracy?
Explain how an agent can fail even when the underlying LLM is working correctly.
What is training-serving skew and how do feature stores prevent it?
You're using LLM-as-judge for evaluation. What biases should you control for?
How does KV cache reduce LLM inference latency?
What does temperature do to an LLM's output distribution? When do you set it to 0?
How does BERT's masked language modeling differ from GPT's causal language modeling?
You want to fine-tune a sentence transformer for medical retrieval. What training data and loss function?
How does speculative decoding improve throughput without degrading quality?
A canary deployment reveals a regression for a specific user segment. What does this tell you about routing?
How would you detect that your embedding model has drifted for your domain?
You need to search 10M documents with filtering by category and date. Which vector database and why?
Explain the difference between data drift, concept drift, and label drift.
How does LoRA reduce fine-tuning memory cost without reducing model expressiveness significantly?
What is a two-tower model and why does it dominate large-scale retrieval?
Your cross-encoder reranker is accurate but too slow for your latency SLA. What's the fix?

Two Hours Out: The Mental Model Refresher

Stop drilling questions. You're in consolidation mode. Read these numbers and frameworks once, slowly, and let them settle. These are the facts that create instant credibility when stated precisely.

The two-hour rule: if you don't know something in the next two hours, you won't learn it in time. Stop trying to patch knowledge gaps. Focus entirely on communicating clearly what you already know. Clarity > coverage at this stage.

What Interviewers Actually Penalize

Based on 22 practitioner interview experiences in the GSL Interview Signal database, the most common reasons candidates fail AI engineering loops are not knowledge gaps — they're communication patterns:

Giving the definition instead of the tradeoff. Interviewers know what NDCG is. They want to know when you'd use it instead of accuracy, and when you wouldn't.
Treating every problem as a fine-tuning problem or every problem as a RAG problem. The correct answer is almost always 'it depends, and here are the factors I'd consider.'
Not naming the failure mode. When asked about any system, the candidate who says 'and here's where this breaks' is immediately more credible than the candidate who only describes the happy path.
Saying 'we used X at my company' without explaining why X was chosen. Process answers without reasoning signals pattern-following, not engineering judgment.
Not knowing the numbers. Vague answers like 'HNSW is fast' are much weaker than 'HNSW recall@10 at efSearch=16 is typically 0.85 on 1M vectors, and you tune efSearch up to trade latency for recall.'

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →