AI Engineering 12 min read

Quantization Deep Dive: GPTQ, AWQ, GGUF, and When Each Wins

Cutting model size 4–8x without destroying quality. How GPTQ calibrates weight rounding with data, why AWQ outperforms on activation-aware pruning, what GGUF enables for CPU inference, and a practical guide to choosing your quantization stack for production.

Why Quantization Matters

A 70B parameter model in float32 needs 280GB of GPU memory. In 4-bit quantization, that drops to ~35GB — fitting on a single A100. Quantization is the single most impactful technique for making large models deployable, and the three dominant formats (GPTQ, AWQ, GGUF) make different tradeoffs on accuracy, speed, and hardware compatibility.

The Core Idea: Fewer Bits Per Weight

Model weights are stored as 32-bit floats by default. Quantization maps those to lower precision — 8-bit, 4-bit, even 2-bit — reducing memory and often accelerating matrix multiply on hardware with native int4 support. The challenge: some weights matter much more than others, and naive quantization destroys accuracy.

Key insight from the literature: weight outliers (rare very large values) cause most quantization error. All modern methods — GPTQ, AWQ, GGUF — are different strategies for protecting outlier weights while aggressively quantizing the rest.

GPTQ: Post-Training Quantization with Calibration Data

GPTQ quantizes one layer at a time, using a small calibration dataset to find the optimal rounding for each weight. It solves a layer-wise reconstruction problem: find 4-bit weights that minimize output error on calibration data. This is GPU-intensive but produces high-quality 4-bit models.

Best quality 4-bit quantization for GPU deployment
Requires calibration dataset (usually 128 samples from the training distribution)
Produces .safetensors files — compatible with vLLM, HF Transformers, TGI
Quantization takes 30min–4hrs depending on model size

AWQ: Activation-Aware Weight Quantization

AWQ identifies which weights are 'salient' — they activate with high magnitude on real inputs — and protects them. Non-salient weights get aggressively quantized. This activation-aware approach outperforms GPTQ on most benchmarks at 4-bit while requiring less calibration data.

AWQ is increasingly the default for quality-first production deployments. The AutoAWQ library supports most major model families and integrates with vLLM. If you're deploying on A100/H100 with GPU inference, AWQ is the current best practice.

GGUF: CPU-First Quantization for Local Inference

GGUF (formerly GGML) is the format for llama.cpp — the dominant CPU inference runtime. It supports mixed-precision quantization (Q4_K_M, Q5_K_M, Q8_0 etc.) and is optimized for CPU and Apple Silicon inference. GGUF is what powers Ollama, LM Studio, and most local model runners.

Format	Best For	Hardware	Accuracy vs fp16
GPTQ	High-quality GPU inference, vLLM/TGI deployment	NVIDIA GPU	~97–99%
AWQ	Quality-first GPU inference, easiest integration	NVIDIA GPU	~98–99%
GGUF Q4_K_M	CPU/Apple Silicon local inference	CPU, M-series Mac	~95–97%
GGUF Q8_0	CPU inference with near fp16 quality	CPU, M-series Mac	~99%

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →