AI Engineering 9 min read

How to Answer 'How Would You Reduce LLM Costs by 50%?' in a Senior Interview

Prompt caching, model routing, KV cache, quantisation, batching, context compression — a systematic framework for attacking LLM cost. Structured to walk through in an interview in under 10 minutes.

This is a favourite question at senior ML engineering interviews. It tests systems thinking — can you go beyond 'use a cheaper model' to articulate a systematic cost optimization strategy? Here's how to answer it in under 10 minutes with depth.

Frame it right first (30 seconds)

State the cost breakdown you're targeting: inference cost (tokens × price/token × volume) is the dominant term for most LLM applications. Storage and training are typically smaller. Inference cost has three levers: reduce tokens, reduce price per token, or reduce volume (fewer API calls for the same user outcome).

Lever 1: Reduce input tokens

Prompt compression: RAGLite, LLMLingua — compress the prompt by removing low-value tokens while preserving meaning. 3–5x compression ratios with <5% quality loss on many tasks.
Context summarization: for long conversations, summarize older turns rather than passing the full history. Compress episodic context to fit more interactions within the same budget.
Retrieval precision: in RAG systems, passing 3 highly-relevant chunks beats passing 10 mediocre ones. Better reranking = fewer tokens to the LLM.

Lever 2: Reduce output tokens

Explicit output length constraints: 'respond in under 200 words' in the system prompt. LLMs pad their responses to match perceived expected length without this constraint.
Structured outputs (JSON mode): parsing structured output from free text requires reprompting on parse errors. JSON mode eliminates this retry cost.
Speculative decoding: use a small draft model to generate candidate tokens, verified by the large model. Reduces generation latency without changing output quality.

Lever 3: Model routing

Not every query needs GPT-4 or Claude Opus. Build a router that classifies query complexity and routes to the right model tier.

Simple queries (FAQ lookup, format conversion, templated responses): Haiku / GPT-4o-mini / Llama-8B.
Medium complexity (multi-step reasoning, document analysis): Sonnet / GPT-4o.
Hard queries (code architecture, ambiguous requests, high-stakes): Opus / o1 / GPT-4.
Typical cost reduction: 40–60% with well-calibrated routing and <2% quality degradation on the complex-query subset.

Lever 4: Caching

Prompt caching (Anthropic, OpenAI): cache the prefix of your system prompt. For applications with long system prompts (RAG context, code context), this alone reduces cost 60–80% on cached tokens.
Semantic caching: cache responses by embedding similarity, not exact string match. Tools: GPTCache, Momento. High hit rate for FAQ-style applications.
KV cache management: at the serving infrastructure level, ensure your KV cache hit rate is high. Reuse attention caches across requests with the same prefix.

Lever 5: Batching and quantization

Batch inference: group requests together for GPU-efficient batch processing. Latency increases but cost per token drops significantly.
Quantization: INT8 models run at ~80% the cost of FP16 with <1% quality degradation on most tasks. INT4 (with QLoRA-style quantization) at ~50% cost.
Self-hosted models: for high-volume, latency-tolerant workloads, self-hosting Llama 3 or Mistral on A100s can be 5–10x cheaper than API at sufficient scale.

Putting it together: the audit process

Frame this as a priority order: (1) prompt caching — usually the fastest win. (2) output length constraints — trivial to implement. (3) model routing — high ROI at medium scale. (4) context compression — good for RAG and long-context apps. (5) quantization / self-hosting — for high-volume, ops-mature teams.

Mentioning that you'd measure before optimizing — logging token usage per request type, identifying the expensive queries, targeting the Pareto-dominant cost sources — signals engineering maturity that most interviewees skip.

Interactive lab:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →