When Your LLM Is Too Slow: Diagnosing and Fixing Latency Regressions
How to identify whether latency is in TTFT, TPS, retrieval, or network. A step-by-step latency triage guide with the Latency Planner tool.
Your LLM feature launched and users love it. Then someone looks at the P99 latency chart and goes pale. 12 seconds. Your product manager sets a meeting. Your PM's manager sets a meeting. Everyone wants to know why it's slow and what you're going to do about it.
LLM latency diagnosis is a skill. The causes are different from regular API latency, the debugging tools are different, and the fixes require understanding what's actually happening inside the request lifecycle.
The LLM request lifecycle
A single LLM request has five sequential phases, each with its own latency budget:
| Phase | Typical time | What causes it to be slow |
|---|---|---|
| Pre-processing | 0–200ms | PII scrubbing, input validation, rate limit checks |
| Retrieval (RAG) | 100–2000ms | Embedding the query, vector search, reranking — each adds up |
| LLM network + queue | 50–500ms | Provider API overhead, cold start, queue depth under high load |
| TTFT (prefill) | 200–3000ms | Proportional to input token count — longer context = slower TTFT |
| Generation (decode) | 1–30s | Proportional to output length — how many tokens are generated |
Diagnosing where time is spent
You cannot fix what you cannot measure. Instrument every phase with a timer. Log them per request. Then look at your P50 and P99 breakdowns — the slow requests will tell you which phase is your bottleneck.
import time
from dataclasses import dataclass
@dataclass
class LatencyTrace:
preprocess_ms: float = 0
retrieval_ms: float = 0
llm_ttft_ms: float = 0
llm_total_ms: float = 0
async def traced_request(query, context):
trace = LatencyTrace()
t0 = time.perf_counter()
cleaned_query = preprocess(query)
trace.preprocess_ms = (time.perf_counter() - t0) * 1000
t1 = time.perf_counter()
chunks = await retrieve(cleaned_query)
trace.retrieval_ms = (time.perf_counter() - t1) * 1000
t2 = time.perf_counter()
first_token = False
async for token in stream_llm(cleaned_query, chunks):
if not first_token:
trace.llm_ttft_ms = (time.perf_counter() - t2) * 1000
first_token = True
yield token
trace.llm_total_ms = (time.perf_counter() - t2) * 1000
log_latency(trace) # Send to your observability stack
The fixes by phase
Slow retrieval
- Switch from exact search to approximate nearest-neighbour (HNSW index in Pinecone/Qdrant/Weaviate)
- Cache embeddings for repeated or near-duplicate queries
- Reduce top-K — fetching 20 chunks and reranking is slower than fetching 5 without reranking
- Move to a faster embedding model — text-embedding-3-small is 5× faster than large with modest quality loss
Slow TTFT (long input context)
- Reduce context: fewer retrieved chunks, compressed conversation history, tighter system prompt
- Enable prompt caching — if your system prompt is static, cached prefill is nearly instant
- Consider a smaller model — Haiku/GPT-4o-mini process inputs 3–5× faster than frontier models
- Parallelise preprocessing and retrieval — start retrieval while pre-processing is still running
Slow generation (long output)
- Always stream — don't wait for the full response before showing anything
- Set max_tokens aggressively — if you only need 200 tokens, cap at 300
- Instruct the model to be concise: 'Answer in 2-3 sentences maximum'
- Use speculative decoding (vLLM) for self-hosted models — 2–3× generation speedup
Streaming is the single highest-impact latency improvement for user-facing applications. It doesn't reduce total latency — it changes perceived latency. A response that streams its first token in 0.8s feels fast even if total generation takes 8s. Users read at the pace the model generates.
Profile your LLM pipeline →: Measure and diagnose latency across every phase in the Explore module.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →