Foundations & Architecture 11 min read

vLLM and PagedAttention: How Production LLM Serving Actually Works

Why pre-allocating contiguous KV cache blocks wastes GPU memory, how PagedAttention borrows virtual memory paging to fix it, and how continuous batching keeps GPU utilization above 80%. Plus: when to use vLLM vs TGI vs TorchServe vs Ollama.

Before vLLM, most LLM serving systems pre-allocated a contiguous memory block for the maximum possible sequence length of each request. A request with a maximum length of 2048 tokens reserved 2048 tokens of KV cache memory — even if it only needed 50. GPU memory utilization was typically 20-40%. vLLM changed this with PagedAttention, borrowing the virtual memory paging idea from operating systems and applying it to the KV cache. GPU utilization went to 80-90%.

The problem with contiguous pre-allocation

A KV cache must be contiguous in memory for efficient access patterns. But you do not know in advance how long a sequence will be — you only discover its length as you generate. The standard approach: allocate for the maximum possible length upfront. For a batch of 8 requests, each allocated 4096 tokens, you reserve 8×4096 = 32,768 token-slots, even if the actual generated lengths are 100, 200, 150... The unused pre-allocated memory is wasted GPU VRAM.

PagedAttention: virtual KV cache pages

PagedAttention divides the KV cache into fixed-size pages (e.g., 16 tokens per page). A sequence allocates pages on demand as it generates. A block table maps logical page numbers (sequence position 0-15, 16-31, etc.) to physical page slots in GPU memory. The attention kernel knows how to use this non-contiguous mapping — it follows the block table to collect the right K and V values from wherever they are stored.

class PagedKVCache:
    """
    Conceptual model of PagedAttention memory management.
    Real vLLM uses CUDA kernels for the actual attention computation.
    """
    def __init__(self, total_pages=1000, page_size=16, n_layers=32, d_head=128, n_heads=8):
        self.page_size   = page_size
        self.n_layers    = n_layers
        self.total_pages = total_pages
        # Physical storage: each page holds page_size K/V vectors per layer
        # Shape: (total_pages, page_size, 2, n_layers, n_heads, d_head)
        print(f"Physical KV store: {total_pages} pages × {page_size} tokens")

        self.free_pages = list(range(total_pages))   # physical page pool
        self.sequences  = {}                          # seq_id → block_table

    def allocate(self, seq_id, n_tokens):
        """Allocate pages for a new sequence on demand."""
        pages_needed = (n_tokens + self.page_size - 1) // self.page_size
        if len(self.free_pages) < pages_needed:
            raise RuntimeError("Out of KV cache memory (no free pages)")
        block_table = []
        for _ in range(pages_needed):
            physical_page = self.free_pages.pop(0)
            block_table.append(physical_page)
        self.sequences[seq_id] = block_table
        return block_table

    def extend(self, seq_id):
        """Add one more page when the sequence grows past a page boundary."""
        if not self.free_pages:
            raise RuntimeError("Out of pages — need to preempt a sequence")
        physical_page = self.free_pages.pop(0)
        self.sequences[seq_id].append(physical_page)
        return physical_page

    def free(self, seq_id):
        """Return pages to the pool when a sequence finishes."""
        pages = self.sequences.pop(seq_id, [])
        self.free_pages.extend(pages)
        return len(pages)

    def utilization(self):
        used = self.total_pages - len(self.free_pages)
        return used / self.total_pages

# ── Simulation ────────────────────────────────────────────────────────────────
cache = PagedKVCache(total_pages=200, page_size=16)

import random
random.seed(42)

# Simulate 20 requests generating different numbers of tokens
for seq_id in range(20):
    n_gen = random.randint(20, 200)
    try:
        bt = cache.allocate(seq_id, n_gen)
        # Simulate incremental extension (every page_size tokens, add a page)
        extra_pages = max(0, (n_gen // cache.page_size) - len(bt))
        for _ in range(extra_pages):
            cache.extend(seq_id)
        if random.random() < 0.4:   # 40% of requests finish early
            freed = cache.free(seq_id)
            print(f"Seq {seq_id:2d}: generated {n_gen:3d} tokens, freed {freed} pages → utilization: {cache.utilization():.1%}")
        else:
            print(f"Seq {seq_id:2d}: generated {n_gen:3d} tokens, still running → utilization: {cache.utilization():.1%}")
    except RuntimeError as e:
        print(f"Seq {seq_id:2d}: {e}")

Continuous batching

PagedAttention enables continuous batching (also called dynamic batching or iteration-level scheduling). In naive batch serving, you form a batch and wait for every request in the batch to finish before starting new ones. With continuous batching, the serving loop processes one iteration (one generated token) at a time. When a request finishes, its KV cache pages are immediately freed and new requests can be inserted into the next iteration. GPU utilization stays high throughout because the batch is never waiting for slow sequences.

vLLM vs other serving options

TorchServe: wraps any PyTorch model in a REST API, handles model loading and basic batching, but has no KV cache optimization. Good for non-LLM models. TGI (HuggingFace Text Generation Inference): implements continuous batching and flash attention, good for HuggingFace-format models, slightly less optimized than vLLM on throughput. vLLM: best throughput on NVIDIA GPUs, supports PagedAttention + continuous batching + tensor parallelism. Ollama: wraps llama.cpp, focuses on CPU/Apple Silicon, not designed for high-throughput multi-user serving.

The production choice: vLLM if you have NVIDIA GPUs and need maximum throughput (research, API serving). TGI if you need easy HuggingFace integration. Ollama if you are serving locally on CPU or Mac. TorchServe if you are serving non-LLM PyTorch models alongside LLMs.

Benchmark vLLM vs naive serving on your model: run 100 concurrent requests and measure throughput (tokens/second) and p99 latency. vLLM should be 2-4× higher throughput at the same GPU count. This benchmark is what you run before committing to serving infrastructure.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →