AI Engineering 10 min read

How to Answer 'Design a RAG System' in a System Design Interview

A complete framework for tackling RAG system design questions: how to scope requirements, walk through the architecture, discuss failure modes, and show depth on retrieval quality vs. latency tradeoffs.

System design interviews at AI-focused companies increasingly include RAG. The question usually sounds like: 'Design a question-answering system over our internal documentation' or 'How would you build a support bot that uses our knowledge base?' The interviewer wants to see if you can scope a real system — not just recite the acronym.

Here's a framework for answering this question well, including what separates a strong answer from a weak one.

Step 1: Scope the requirements (2–3 minutes)

Before drawing anything, ask questions. This is not stalling — it's what senior engineers do. The answers will determine every architectural decision.

What's the document corpus? Size, update frequency, types of documents (PDFs, wiki, code, structured data)?
What's the latency requirement? <1s? <3s? Does it vary by use case?
Do we need citations/sources in the response?
Is there access control — some users can only see some documents?
What's the accuracy requirement? Is a wrong answer worse than no answer?

Interviewers give extra credit for candidates who distinguish between 'I need exact keyword match' (use BM25) vs 'I need semantic similarity' (use dense retrieval) rather than defaulting to 'vector database' for everything.

Step 2: Walk through the architecture top-down

Structure your answer in two pipelines: ingestion (offline) and query (online).

Ingestion pipeline

Document loading: connectors for each document source (S3, Confluence, Google Drive). Handle format diversity (PDF parsing with PyMuPDF, HTML stripping, code extraction).
Chunking: fixed-size with overlap for simple docs; semantic chunking for prose-heavy content. Chunk size is a tunable hyperparameter — mention that 256–512 tokens is a common starting range.
Metadata extraction: document ID, source URL, section heading, timestamp, access control tags.
Embedding: use an embedding model (text-embedding-3-large, BGE-M3, or similar). Store in vector DB with metadata.
Re-ingestion strategy: full re-index or delta updates? How do you detect document changes?

Query pipeline

Query understanding: optionally rewrite the user query (HyDE, step-back, multi-query expansion) to improve retrieval.
Retrieval: hybrid search — dense vector retrieval for semantic similarity + BM25 for exact match. Merge with RRF (Reciprocal Rank Fusion).
Reranking: cross-encoder reranker to re-score top-20 → top-5 before passing to the LLM.
Prompt assembly: system prompt + retrieved context + user question. Handle token limits — truncate or summarize long passages.
Generation: LLM produces answer with source citations.
Response: return answer + source list with confidence scores.

Step 3: Discuss failure modes

This is where most candidates go shallow. Senior engineers talk about failure modes proactively.

Missing context: the answer exists in the corpus but wasn't retrieved. Solution: improve chunking granularity, add metadata filtering, tune embedding model.
Hallucination: the LLM generates plausible-sounding content not grounded in retrieved context. Solution: faithfulness scoring, citation grounding (NLI-based or cosine check), temperature reduction.
Stale data: ingestion lag means the answer is outdated. Solution: version-aware retrieval, timestamp filtering in metadata.
Query mismatch: the user's question is underspecified or ambiguous. Solution: query clarification dialog, multi-query expansion.

Step 4: Discuss evaluation

Mention RAGAS metrics: faithfulness (is the answer grounded?), answer relevance (does it answer the question?), context precision (is the retrieved context relevant?). Distinguish between offline eval (benchmark dataset) and online eval (user thumbs down, rephrasing rate, session abandonment).

What makes a strong vs. weak answer

Weak answer	Strong answer
Jumps to implementation	Asks scoping questions first
Only mentions vector DB	Discusses hybrid search + reranking
No failure modes	Proactively lists 3–4 failure modes with mitigations
No evaluation plan	Mentions specific metrics (RAGAS, faithfulness)
No access control	Notes document-level permissions in metadata + retrieval filter

Interactive lab:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →