KV Cache From Scratch: The Memory Trick That Makes Generation Practical
Build the KV cache: compute K and V once, append to the cache, reuse for every subsequent step. Time the difference. Calculate the memory cost for production models. Understand why GQA exists and what PagedAttention solves.
Without a KV cache, generating a 100-token response recomputes K and V for the entire sequence 100 times. With a KV cache, each K and V is computed once and reused. This is not an optimisation — it is the difference between a practical system and an unusable one. Understanding it precisely means you can calculate exactly how much memory your KV cache consumes and design systems around that constraint.
Why recomputation is wasteful
At step t you have generated tokens [t1...t_{t-1}] and predict t_t. You compute Q, K, V for all t tokens. At step t+1, the K, V vectors for t1...t_{t-1} are identical — those tokens did not change. You are recomputing them from scratch every step, wasting compute proportional to sequence length.
import numpy as np
import time
def sdpa(Q, K, V):
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
w = np.exp(scores - scores.max(axis=-1, keepdims=True))
w /= w.sum(axis=-1, keepdims=True)
return w @ V
class AttentionLayer:
def __init__(self, d_model, d_k):
self.W_q = np.random.randn(d_model, d_k) * 0.1
self.W_k = np.random.randn(d_model, d_k) * 0.1
self.W_v = np.random.randn(d_model, d_k) * 0.1
self.cache_K = None
self.cache_V = None
def forward_no_cache(self, x):
Q = x @ self.W_q
K = x @ self.W_k
V = x @ self.W_v
return sdpa(Q, K, V)
def forward_with_cache(self, x_new):
k_new = x_new @ self.W_k
v_new = x_new @ self.W_v
q_new = x_new @ self.W_q
self.cache_K = k_new if self.cache_K is None else np.vstack([self.cache_K, k_new])
self.cache_V = v_new if self.cache_V is None else np.vstack([self.cache_V, v_new])
return sdpa(q_new, self.cache_K, self.cache_V)
def reset_cache(self):
self.cache_K = None; self.cache_V = None
d_model, d_k = 512, 64
layer = AttentionLayer(d_model, d_k)
tokens = [np.random.randn(1, d_model) for _ in range(50)]
t0 = time.perf_counter()
for t in range(1, 51):
_ = layer.forward_no_cache(np.vstack(tokens[:t]))
no_cache = time.perf_counter() - t0
layer.reset_cache()
t0 = time.perf_counter()
for tok in tokens:
_ = layer.forward_with_cache(tok)
cache = time.perf_counter() - t0
print(f"No cache: {no_cache*1000:.1f}ms | With cache: {cache*1000:.1f}ms | Speedup: {no_cache/cache:.1f}x")
print(f"Cache K shape: {layer.cache_K.shape}")
The memory calculation
KV cache memory per request: 2 × n_layers × n_kv_heads × d_head × seq_len × bytes_per_float. For LLaMA-3-8B (32 layers, 8 KV heads from GQA, d_head=128, FP16): 2 × 32 × 8 × 128 × seq_len × 2 = 131,072 × seq_len bytes. At 4k context: 0.5GB per request. At 128k context: 16GB per request — on an 80GB A100, that is only 4 simultaneous long-context requests before VRAM is full.
This is why context length is a cost and deployment decision, not just a feature. Every token in context costs memory that could serve another request. PagedAttention in vLLM solves this: instead of pre-allocating a contiguous block for the maximum sequence length, it allocates KV cache in pages and maps them dynamically, dramatically increasing concurrent request capacity.
GQA and MQA: shrinking the cache
Multi-Query Attention uses 1 K/V head shared across all Q heads → 32× smaller KV cache. Grouped Query Attention (LLaMA-3's approach): 32 Q heads, 8 K/V heads → 4× smaller cache with less quality loss than MQA. This is a direct engineering response to the memory arithmetic above, traded against a small quality reduction.
Calculate KV cache for GPT-4 (if known), LLaMA-3-8B, and LLaMA-3-70B at 4k and 128k context. This tells you exactly which configuration fits on which GPU — the most practical constraint in production LLM deployment planning.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →