AI Engineering 12 min read

The No-Spec System Design Round: A Framework for Ambiguous ML Problems

Most candidates fail ambiguous system design by jumping to boxes immediately. The 4-phase framework: clarification (5 non-negotiable minutes), explicit assumption surfacing, one clear architecture argument with defended tradeoffs, preemptive failure mode analysis. With the ML design question that separates senior from staff.

The No-Spec System Design Round: A Framework for Ambiguous Problems

The highest-signal system design questions have no clean spec. 'Design a recommendation system for 500M users.' 'Build a search ranking system from scratch.' 'Design an eval framework for our LLM product.' The interviewer is not withholding information to be difficult — they're watching how you handle the thing that's normal in their job: ambiguity.

Why Candidates Fail This Round

Most candidates start drawing architecture boxes immediately. This is the wrong move. You're designing a system for a problem you don't understand yet. The first 5 minutes should produce no boxes — only questions.

The failure mode: candidate jumps to 'okay so we'd have a two-tower retrieval model, then a ranking model, then a reranker' before knowing the product type, traffic pattern, latency requirement, or what 'recommendation' means in this context. The interviewer mentally downgrades the candidate immediately.

Phase 1: Clarification (5 minutes, non-negotiable)

Before any design work, surface the constraints that will define the architecture. Four categories:

Scale: DAU/MAU, QPS at peak, data size (items, users, events). 500M users served synchronously vs. 500M users with precomputed recommendations are completely different problems. Latency budget: p50/p99 SLA. 100ms total budget with 50ms for retrieval means a different architecture than 500ms total. Real-time vs. batch vs. precomputed. Quality definition: what does 'good recommendation' mean here? Click-through? Purchase? Watch time? Session length? The metric determines the training objective. Constraints: existing infra, team size, on-call burden tolerance, budget. 'We have 3 ML engineers' means no 7-model ensemble.

Phase 2: State Your Assumptions Explicitly

After clarification, some things will still be unknown. Don't skip them — name them and make a bet.

// Good assumption surfacing:
"I'll assume:
- Implicit feedback only (clicks/purchases), no explicit ratings
- Latency budget: 150ms p99 end-to-end
- Cold start is a real problem: 20% of users are <1 week old
- We're optimizing for 7-day retention, not single-session CTR

If any of these are wrong, the architecture changes significantly.
Should I proceed on these or correct them?"

// What most candidates do:
"Okay so I'll design this system..."

Phase 3: One Clear Architecture Argument

Don't present options. Make a decision and defend it. Interviewers at high-TC companies want to see you have taste — the ability to look at constraints and arrive at a design, not present a menu.

State the core insight first: 'Given 150ms p99 and 500M users, we can't afford joint scoring of all items at query time. This forces a two-stage architecture: fast approximate retrieval (ANN over user-item embeddings) then precision ranking on a small candidate set.' Name the architecture: 'Two-tower retrieval trained with in-batch negatives → ANN index (HNSW or FAISS) → LightGBM ranker on dense features → diversity/business logic reranker.' Defend the key tradeoffs: 'I chose LightGBM over a deep ranker because (1) inference latency — GBDT is ~5ms vs. 40ms for a small DNN, and we need to rank 500 candidates; (2) feature importance interpretability matters for business rules injection; (3) easier to debug when CTR drops unexpectedly.'

Phase 4: Preempt the Follow-Ups

After presenting the design, proactively identify the two or three things most likely to fail in production.

Cold start: 'We have 20% new users. For them, the user tower embedding is random. Fix: (1) use geographic/demographic cluster centroid as initialization, (2) session-based model (GRU4Rec) that works from the first click with no user history.' Distribution shift: 'Item embeddings go stale as item metadata and interaction patterns change. Fix: item embedding refresh on a 6-hour schedule; detect staleness via embedding drift monitoring.' Position bias: 'Our training data has position bias — item in slot 1 gets clicked more regardless of relevance. Fix: IPS weighting by propensity score (1/rank) during ranker training.'

The ML System Design Question That Separates Senior from Staff

'How do you evaluate whether the system is actually getting better?' This is the question most candidates answer wrong at senior level and right at staff level.

Senior answer: 'We'd track CTR, conversion, NDCG@10.' Staff answer: 'Offline metrics are necessary but not sufficient — they optimize for the metric, not the user outcome. We'd run an A/B test with a proper power analysis before assuming any offline improvement is real. For the online experiment: primary metric is 7-day retention, guardrail metrics are session length and return rate, secondary metrics are CTR and conversion. We'd run for 2 weeks minimum to avoid novelty effect. And we'd build a holdout group at 5% to measure the cumulative effect of model improvements over quarters.'

Common Ambiguous Design Questions and the First Clarification to Ask

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →