Production & LLMOps 5 min read

KV Cache: Why Adding One User Slows Everyone

KV cache stores K and V matrices for every previous token, growing linearly with sequence length. In batched inference every user's cache lives in GPU memory simultaneously. One long conversation shrinks memory for all other batches.

It is 3am. Traffic is low — three concurrent users. One of them opened a support ticket eight months ago and has been appending to the same conversation thread ever since. Their history is 50,000 tokens. Latency for everyone spikes. The on-call engineer checks GPU utilization: normal. Checks request queue: normal. Checks memory: there it is.

To understand what happened, you need to understand what the KV cache actually is and where it lives.

During autoregressive generation, the attention mechanism needs the Key and Value matrices for every previous token to compute the next one. Without caching, generating token N requires recomputing K and V for all N-1 previous tokens on every step — quadratic in sequence length and prohibitively slow. The KV cache solves this by storing those matrices in GPU memory after the first forward pass, so each new token only requires one incremental computation rather than a full recompute.

The cache is not small. For each token in the sequence, for each layer in the model, you store two matrices of size d_model. A 32-layer model with d_model = 4096 using float16 (2 bytes per value) stores 32 layers × 2 matrices × 4096 values × 2 bytes = 524,288 bytes per token — roughly 0.5 MB per token. A 10,000-token conversation consumes 5 GB of KV cache. A 50,000-token conversation consumes 25 GB.

Per token:
  32 layers × 2 (K + V) × 4096 dims × 2 bytes = 524,288 bytes ≈ 0.5 MB

User A: 50,000 tokens → 50,000 × 0.5 MB = 25.0 GB
User B:  2,000 tokens →  2,000 × 0.5 MB =  1.0 GB
User C:    500 tokens →    500 × 0.5 MB =  0.25 GB

GPU: 80 GB VRAM
  Model weights (70B, int8):   ~70 GB
  Remaining for KV caches:     ~10 GB

Without User A: B + C cache = 1.25 GB → fits comfortably
With    User A: all caches  = 26.25 GB → exceeds budget → eviction or queue

Batched inference means all active users' KV caches must coexist in GPU memory simultaneously. This is not a per-user problem. User A's 25 GB of cache does not just slow down User A — it consumes memory that would otherwise be used to keep more of Users B and C's activations on-device, or to batch more requests together. When memory pressure hits, the system either evicts cache entries (forcing expensive recomputation), reduces batch size (reducing throughput), or queues new requests (increasing latency for everyone).

The fix is not simple. You can cap maximum sequence length and evict the oldest tokens from very long contexts. You can use paged attention — vLLM's approach — to store KV cache in non-contiguous memory blocks, reducing fragmentation and making eviction cheaper. You can offload cache to CPU RAM at a latency penalty. But the fundamental constraint remains: KV cache is a shared GPU resource, and one user's sequence length directly reduces what is available for everyone else in the batch.

The 3am spike had nothing to do with request volume. It had everything to do with one conversation that had been growing for eight months.

KV cache lives in shared GPU memory, so one user with a 50K-token history does not just pay their own latency tax — they shrink the memory available for every concurrent request running in the same batch, causing system-wide latency spikes even when overall traffic is low.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →