AI Engineering 8 min read

When Your LLM Is Too Slow: Diagnosing and Fixing Latency Regressions

How to identify whether latency is in TTFT, TPS, retrieval, or network. A step-by-step latency triage guide with the Latency Planner tool.

Your LLM feature launched and users love it. Then someone looks at the P99 latency chart and goes pale. 12 seconds. Your product manager sets a meeting. Your PM's manager sets a meeting. Everyone wants to know why it's slow and what you're going to do about it.

LLM latency diagnosis is a skill. The causes are different from regular API latency, the debugging tools are different, and the fixes require understanding what's actually happening inside the request lifecycle.

The LLM request lifecycle

A single LLM request has five sequential phases, each with its own latency budget:

Phase	Typical time	What causes it to be slow
Pre-processing	0–200ms	PII scrubbing, input validation, rate limit checks
Retrieval (RAG)	100–2000ms	Embedding the query, vector search, reranking — each adds up
LLM network + queue	50–500ms	Provider API overhead, cold start, queue depth under high load
TTFT (prefill)	200–3000ms	Proportional to input token count — longer context = slower TTFT
Generation (decode)	1–30s	Proportional to output length — how many tokens are generated

Diagnosing where time is spent

You cannot fix what you cannot measure. Instrument every phase with a timer. Log them per request. Then look at your P50 and P99 breakdowns — the slow requests will tell you which phase is your bottleneck.

import time
from dataclasses import dataclass

@dataclass
class LatencyTrace:
    preprocess_ms: float = 0
    retrieval_ms: float = 0
    llm_ttft_ms: float = 0
    llm_total_ms: float = 0

async def traced_request(query, context):
    trace = LatencyTrace()

    t0 = time.perf_counter()
    cleaned_query = preprocess(query)
    trace.preprocess_ms = (time.perf_counter() - t0) * 1000

    t1 = time.perf_counter()
    chunks = await retrieve(cleaned_query)
    trace.retrieval_ms = (time.perf_counter() - t1) * 1000

    t2 = time.perf_counter()
    first_token = False
    async for token in stream_llm(cleaned_query, chunks):
        if not first_token:
            trace.llm_ttft_ms = (time.perf_counter() - t2) * 1000
            first_token = True
        yield token
    trace.llm_total_ms = (time.perf_counter() - t2) * 1000

    log_latency(trace)  # Send to your observability stack

The fixes by phase

Slow retrieval

Switch from exact search to approximate nearest-neighbour (HNSW index in Pinecone/Qdrant/Weaviate)
Cache embeddings for repeated or near-duplicate queries
Reduce top-K — fetching 20 chunks and reranking is slower than fetching 5 without reranking
Move to a faster embedding model — text-embedding-3-small is 5× faster than large with modest quality loss

Slow TTFT (long input context)

Reduce context: fewer retrieved chunks, compressed conversation history, tighter system prompt
Enable prompt caching — if your system prompt is static, cached prefill is nearly instant
Consider a smaller model — Haiku/GPT-4o-mini process inputs 3–5× faster than frontier models
Parallelise preprocessing and retrieval — start retrieval while pre-processing is still running

Slow generation (long output)

Always stream — don't wait for the full response before showing anything
Set max_tokens aggressively — if you only need 200 tokens, cap at 300
Instruct the model to be concise: 'Answer in 2-3 sentences maximum'
Use speculative decoding (vLLM) for self-hosted models — 2–3× generation speedup

Streaming is the single highest-impact latency improvement for user-facing applications. It doesn't reduce total latency — it changes perceived latency. A response that streams its first token in 0.8s feels fast even if total generation takes 8s. Users read at the pace the model generates.

Profile your LLM pipeline →: Measure and diagnose latency across every phase in the Explore module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →