AI Engineering 10 min read

How to Answer 'How Would You Evaluate This LLM System?' in an Interview

Evaluation is where most candidates go blank. This post gives you a reusable framework: task decomposition, metric selection, offline vs. online eval, human labelling, and how to talk about hallucination measurement.

When an interviewer asks 'how would you evaluate this LLM system?', most candidates name a few metrics and go quiet. Strong candidates pull out a framework. Here's one that works.

Step 1: Decompose the task

Before naming metrics, ask: what does 'good output' mean for this specific system? A customer support bot, a code assistant, and a RAG document Q&A system have different quality definitions. Break the task into measurable dimensions first.

Correctness: is the answer factually accurate? Does it solve the task?
Relevance: does the output address what was asked, or does it answer a related but different question?
Groundedness (for RAG): are claims supported by retrieved context, or is the model hallucinating?
Completeness: does the output cover all required aspects, or does it omit critical information?
Safety/Appropriateness: does the output violate guidelines? Is tone appropriate?

Step 2: Offline vs. online evaluation

Separate your evaluation strategy into two phases. Interviewers want to see you think about both.

Offline evaluation

Golden dataset: 200–500 human-labeled prompt-response pairs. Ground truth for core metrics.
Model-based eval: use a stronger LLM (GPT-4, Claude) as a judge. Prompt it to score responses on specific dimensions. Calibrate against human ratings.
RAG-specific: RAGAS metrics — faithfulness, answer relevance, context precision, context recall.
Regression suite: every production bug becomes a test case. Run this before every model swap.

Online evaluation

Explicit signals: thumbs up/down, helpful/not helpful buttons. High-signal but low-volume.
Implicit signals: session abandonment, query rephrasing (user asks the same question differently = failure), escalation to human agent.
A/B testing: route a fraction of traffic to a new model/prompt. Statistical significance testing on key metrics.
LLM-as-judge at scale: run a smaller judge model on sampled production outputs. Cheap enough to run continuously.

Step 3: Hallucination specifically

Hallucination deserves special treatment because it's what non-technical stakeholders care about most. Explain your hallucination detection strategy:

For RAG systems: NLI-based faithfulness check — verify each claim in the output is entailed by retrieved passages.
For general LLMs: self-consistency (sample N times, check if answers agree), factual benchmarks (TruthfulQA, HELM).
Calibration: does the model's expressed confidence match its actual accuracy? Overconfident wrong answers are worse than uncertain correct answers.

Step 4: Evaluation infrastructure

Mention the tooling: LangSmith or Weights & Biases for trace logging and experiment tracking. RAGAS for automated RAG evaluation. Human annotation platforms (Scale AI, Surge) for building golden datasets. Evals as a CI step — no model swap ships without passing the eval suite.

The strongest answers describe an evaluation flywheel: production failures → new test cases → offline regression suite → model improvement → back to production. Show you're thinking about evaluation as a continuous process, not a one-time audit.

Interactive lab:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →