GenAI Systems Lab Open interactive version →
AI Engineering 8 min read

When Your LLM Is Too Slow: Diagnosing and Fixing Latency Regressions

How to identify whether latency is in TTFT, TPS, retrieval, or network. A step-by-step latency triage guide with the Latency Planner tool.

Your LLM feature launched and users love it. Then someone looks at the P99 latency chart and goes pale. 12 seconds. Your product manager sets a meeting. Your PM's manager sets a meeting. Everyone wants to know why it's slow and what you're going to do about it.

LLM latency diagnosis is a skill. The causes are different from regular API latency, the debugging tools are different, and the fixes require understanding what's actually happening inside the request lifecycle.

The LLM request lifecycle

A single LLM request has five sequential phases, each with its own latency budget:

PhaseTypical timeWhat causes it to be slow
Pre-processing0–200msPII scrubbing, input validation, rate limit checks
Retrieval (RAG)100–2000msEmbedding the query, vector search, reranking — each adds up
LLM network + queue50–500msProvider API overhead, cold start, queue depth under high load
TTFT (prefill)200–3000msProportional to input token count — longer context = slower TTFT
Generation (decode)1–30sProportional to output length — how many tokens are generated

Diagnosing where time is spent

You cannot fix what you cannot measure. Instrument every phase with a timer. Log them per request. Then look at your P50 and P99 breakdowns — the slow requests will tell you which phase is your bottleneck.

import time
from dataclasses import dataclass

@dataclass
class LatencyTrace:
    preprocess_ms: float = 0
    retrieval_ms: float = 0
    llm_ttft_ms: float = 0
    llm_total_ms: float = 0

async def traced_request(query, context):
    trace = LatencyTrace()

    t0 = time.perf_counter()
    cleaned_query = preprocess(query)
    trace.preprocess_ms = (time.perf_counter() - t0) * 1000

    t1 = time.perf_counter()
    chunks = await retrieve(cleaned_query)
    trace.retrieval_ms = (time.perf_counter() - t1) * 1000

    t2 = time.perf_counter()
    first_token = False
    async for token in stream_llm(cleaned_query, chunks):
        if not first_token:
            trace.llm_ttft_ms = (time.perf_counter() - t2) * 1000
            first_token = True
        yield token
    trace.llm_total_ms = (time.perf_counter() - t2) * 1000

    log_latency(trace)  # Send to your observability stack

The fixes by phase

Slow retrieval

Slow TTFT (long input context)

Slow generation (long output)

Streaming is the single highest-impact latency improvement for user-facing applications. It doesn't reduce total latency — it changes perceived latency. A response that streams its first token in 0.8s feels fast even if total generation takes 8s. Users read at the pace the model generates.

Profile your LLM pipeline →: Measure and diagnose latency across every phase in the Explore module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →