AI Engineering 10 min read

LLM Inference Optimisation: Batching, Quantisation, and Speculative Decoding

How to reduce latency and cost at inference time without retraining. INT8/INT4 quantisation, continuous batching, speculative decoding explained.

LLM inference is expensive by default. A naive deployment of a 70B model will serve maybe 5 requests per second at high latency and burn a GPU budget fast. Inference optimisation is the discipline of extracting 10–100× more performance from the same hardware.

Quantisation: smaller numbers, faster math

Model weights are stored as 32-bit or 16-bit floats by default. Quantisation converts them to 8-bit or 4-bit integers. This halves or quarters memory usage and speeds up matrix multiplications on hardware that supports integer operations.

Format	Memory (7B model)	Quality loss	Use case
FP32	28 GB	None (baseline)	Training only
FP16	14 GB	Negligible	Standard inference
INT8	7 GB	1–2% on benchmarks	Production serving
INT4 (GGUF/GPTQ)	3.5 GB	3–5% on benchmarks	Edge, consumer GPU
INT2/1-bit	~1 GB	Significant	Research / extreme edge

llama.cpp and Ollama use GGUF quantised models to run 7B–70B models on consumer hardware (MacBook, gaming PC). A Q4_K_M quantised Llama 3 8B runs at 30+ tokens/second on an M2 MacBook Pro.

Continuous batching

Traditional static batching groups requests into fixed batches, leaving GPUs idle between batches. Continuous batching (used by vLLM, TGI) adds new requests to a running batch the moment a sequence finishes, keeping GPU utilisation near 100%.

Speculative decoding

A small "draft" model generates candidate tokens very quickly. The large "target" model verifies or rejects them in parallel — multiple tokens per forward pass. Typical speedup: 2–3× for tasks with predictable next tokens (code, structured output).

KV-cache management: the memory bottleneck

The KV cache stores key and value tensors for each token in the sequence. It grows with sequence length and is the dominant memory consumer during inference. vLLM's PagedAttention manages KV cache like virtual memory — splitting it into pages, eliminating fragmentation, and enabling 3–5× higher throughput.

vLLM is the standard for self-hosted inference serving. For most teams: use a managed API (OpenAI, Anthropic, Together) until you hit $10K+/month in inference costs — then evaluate self-hosted.

Model inference tradeoffs in Systems →: Compare latency, throughput, and cost across quantisation levels and serving strategies.

→ Interactive: Systems Lab covers Speculative Decoding, Flash Attention, KV Cache Engineering, and Serving Infrastructure as separate interactive modules.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →