Foundations & Architecture 11 min read

KV Cache From Scratch: The Memory Trick That Makes Generation Practical

Build the KV cache: compute K and V once, append to the cache, reuse for every subsequent step. Time the difference. Calculate the memory cost for production models. Understand why GQA exists and what PagedAttention solves.

Without a KV cache, generating a 100-token response recomputes K and V for the entire sequence 100 times. With a KV cache, each K and V is computed once and reused. This is not an optimisation — it is the difference between a practical system and an unusable one. Understanding it precisely means you can calculate exactly how much memory your KV cache consumes and design systems around that constraint.

Why recomputation is wasteful

At step t you have generated tokens [t1...t_{t-1}] and predict t_t. You compute Q, K, V for all t tokens. At step t+1, the K, V vectors for t1...t_{t-1} are identical — those tokens did not change. You are recomputing them from scratch every step, wasting compute proportional to sequence length.

import numpy as np
import time

def sdpa(Q, K, V):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    w = np.exp(scores - scores.max(axis=-1, keepdims=True))
    w /= w.sum(axis=-1, keepdims=True)
    return w @ V

class AttentionLayer:
    def __init__(self, d_model, d_k):
        self.W_q = np.random.randn(d_model, d_k) * 0.1
        self.W_k = np.random.randn(d_model, d_k) * 0.1
        self.W_v = np.random.randn(d_model, d_k) * 0.1
        self.cache_K = None
        self.cache_V = None

    def forward_no_cache(self, x):
        Q = x @ self.W_q
        K = x @ self.W_k
        V = x @ self.W_v
        return sdpa(Q, K, V)

    def forward_with_cache(self, x_new):
        k_new = x_new @ self.W_k
        v_new = x_new @ self.W_v
        q_new = x_new @ self.W_q
        self.cache_K = k_new if self.cache_K is None else np.vstack([self.cache_K, k_new])
        self.cache_V = v_new if self.cache_V is None else np.vstack([self.cache_V, v_new])
        return sdpa(q_new, self.cache_K, self.cache_V)

    def reset_cache(self):
        self.cache_K = None; self.cache_V = None

d_model, d_k = 512, 64
layer = AttentionLayer(d_model, d_k)
tokens = [np.random.randn(1, d_model) for _ in range(50)]

t0 = time.perf_counter()
for t in range(1, 51):
    _ = layer.forward_no_cache(np.vstack(tokens[:t]))
no_cache = time.perf_counter() - t0

layer.reset_cache()
t0 = time.perf_counter()
for tok in tokens:
    _ = layer.forward_with_cache(tok)
cache = time.perf_counter() - t0

print(f"No cache: {no_cache*1000:.1f}ms  |  With cache: {cache*1000:.1f}ms  |  Speedup: {no_cache/cache:.1f}x")
print(f"Cache K shape: {layer.cache_K.shape}")

The memory calculation

KV cache memory per request: 2 × n_layers × n_kv_heads × d_head × seq_len × bytes_per_float. For LLaMA-3-8B (32 layers, 8 KV heads from GQA, d_head=128, FP16): 2 × 32 × 8 × 128 × seq_len × 2 = 131,072 × seq_len bytes. At 4k context: 0.5GB per request. At 128k context: 16GB per request — on an 80GB A100, that is only 4 simultaneous long-context requests before VRAM is full.

This is why context length is a cost and deployment decision, not just a feature. Every token in context costs memory that could serve another request. PagedAttention in vLLM solves this: instead of pre-allocating a contiguous block for the maximum sequence length, it allocates KV cache in pages and maps them dynamically, dramatically increasing concurrent request capacity.

GQA and MQA: shrinking the cache

Multi-Query Attention uses 1 K/V head shared across all Q heads → 32× smaller KV cache. Grouped Query Attention (LLaMA-3's approach): 32 Q heads, 8 K/V heads → 4× smaller cache with less quality loss than MQA. This is a direct engineering response to the memory arithmetic above, traded against a small quality reduction.

Calculate KV cache for GPT-4 (if known), LLaMA-3-8B, and LLaMA-3-70B at 4k and 128k context. This tells you exactly which configuration fits on which GPU — the most practical constraint in production LLM deployment planning.

Efficient Memory Management for LLM Serving with PagedAttention

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →