How to Answer 'How Would You Reduce LLM Costs by 50%?' in a Senior Interview
Prompt caching, model routing, KV cache, quantisation, batching, context compression — a systematic framework for attacking LLM cost. Structured to walk through in an interview in under 10 minutes.
This is a favourite question at senior ML engineering interviews. It tests systems thinking — can you go beyond 'use a cheaper model' to articulate a systematic cost optimization strategy? Here's how to answer it in under 10 minutes with depth.
Frame it right first (30 seconds)
State the cost breakdown you're targeting: inference cost (tokens × price/token × volume) is the dominant term for most LLM applications. Storage and training are typically smaller. Inference cost has three levers: reduce tokens, reduce price per token, or reduce volume (fewer API calls for the same user outcome).
Lever 1: Reduce input tokens
- Prompt compression: RAGLite, LLMLingua — compress the prompt by removing low-value tokens while preserving meaning. 3–5x compression ratios with <5% quality loss on many tasks.
- Context summarization: for long conversations, summarize older turns rather than passing the full history. Compress episodic context to fit more interactions within the same budget.
- Retrieval precision: in RAG systems, passing 3 highly-relevant chunks beats passing 10 mediocre ones. Better reranking = fewer tokens to the LLM.
Lever 2: Reduce output tokens
- Explicit output length constraints: 'respond in under 200 words' in the system prompt. LLMs pad their responses to match perceived expected length without this constraint.
- Structured outputs (JSON mode): parsing structured output from free text requires reprompting on parse errors. JSON mode eliminates this retry cost.
- Speculative decoding: use a small draft model to generate candidate tokens, verified by the large model. Reduces generation latency without changing output quality.
Lever 3: Model routing
Not every query needs GPT-4 or Claude Opus. Build a router that classifies query complexity and routes to the right model tier.
- Simple queries (FAQ lookup, format conversion, templated responses): Haiku / GPT-4o-mini / Llama-8B.
- Medium complexity (multi-step reasoning, document analysis): Sonnet / GPT-4o.
- Hard queries (code architecture, ambiguous requests, high-stakes): Opus / o1 / GPT-4.
- Typical cost reduction: 40–60% with well-calibrated routing and <2% quality degradation on the complex-query subset.
Lever 4: Caching
- Prompt caching (Anthropic, OpenAI): cache the prefix of your system prompt. For applications with long system prompts (RAG context, code context), this alone reduces cost 60–80% on cached tokens.
- Semantic caching: cache responses by embedding similarity, not exact string match. Tools: GPTCache, Momento. High hit rate for FAQ-style applications.
- KV cache management: at the serving infrastructure level, ensure your KV cache hit rate is high. Reuse attention caches across requests with the same prefix.
Lever 5: Batching and quantization
- Batch inference: group requests together for GPU-efficient batch processing. Latency increases but cost per token drops significantly.
- Quantization: INT8 models run at ~80% the cost of FP16 with <1% quality degradation on most tasks. INT4 (with QLoRA-style quantization) at ~50% cost.
- Self-hosted models: for high-volume, latency-tolerant workloads, self-hosting Llama 3 or Mistral on A100s can be 5–10x cheaper than API at sufficient scale.
Putting it together: the audit process
Frame this as a priority order: (1) prompt caching — usually the fastest win. (2) output length constraints — trivial to implement. (3) model routing — high ROI at medium scale. (4) context compression — good for RAG and long-context apps. (5) quantization / self-hosting — for high-volume, ops-mature teams.
Mentioning that you'd measure before optimizing — logging token usage per request type, identifying the expensive queries, targeting the Pareto-dominant cost sources — signals engineering maturity that most interviewees skip.
Interactive lab:
- LLMLingua: Compressing Prompts for LLMs (Jiang et al., 2023)
- Anthropic prompt caching docs
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →