GenAI Systems Lab Open interactive version →
AI Engineering 9 min read

How to Answer 'How Would You Reduce LLM Costs by 50%?' in a Senior Interview

Prompt caching, model routing, KV cache, quantisation, batching, context compression — a systematic framework for attacking LLM cost. Structured to walk through in an interview in under 10 minutes.

This is a favourite question at senior ML engineering interviews. It tests systems thinking — can you go beyond 'use a cheaper model' to articulate a systematic cost optimization strategy? Here's how to answer it in under 10 minutes with depth.

Frame it right first (30 seconds)

State the cost breakdown you're targeting: inference cost (tokens × price/token × volume) is the dominant term for most LLM applications. Storage and training are typically smaller. Inference cost has three levers: reduce tokens, reduce price per token, or reduce volume (fewer API calls for the same user outcome).

Lever 1: Reduce input tokens

Lever 2: Reduce output tokens

Lever 3: Model routing

Not every query needs GPT-4 or Claude Opus. Build a router that classifies query complexity and routes to the right model tier.

Lever 4: Caching

Lever 5: Batching and quantization

Putting it together: the audit process

Frame this as a priority order: (1) prompt caching — usually the fastest win. (2) output length constraints — trivial to implement. (3) model routing — high ROI at medium scale. (4) context compression — good for RAG and long-context apps. (5) quantization / self-hosting — for high-volume, ops-mature teams.

Mentioning that you'd measure before optimizing — logging token usage per request type, identifying the expensive queries, targeting the Pareto-dominant cost sources — signals engineering maturity that most interviewees skip.

Interactive lab:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →