Quantization Deep Dive: GPTQ, AWQ, GGUF, and When Each Wins
Cutting model size 4–8x without destroying quality. How GPTQ calibrates weight rounding with data, why AWQ outperforms on activation-aware pruning, what GGUF enables for CPU inference, and a practical guide to choosing your quantization stack for production.
Why Quantization Matters
A 70B parameter model in float32 needs 280GB of GPU memory. In 4-bit quantization, that drops to ~35GB — fitting on a single A100. Quantization is the single most impactful technique for making large models deployable, and the three dominant formats (GPTQ, AWQ, GGUF) make different tradeoffs on accuracy, speed, and hardware compatibility.
The Core Idea: Fewer Bits Per Weight
Model weights are stored as 32-bit floats by default. Quantization maps those to lower precision — 8-bit, 4-bit, even 2-bit — reducing memory and often accelerating matrix multiply on hardware with native int4 support. The challenge: some weights matter much more than others, and naive quantization destroys accuracy.
Key insight from the literature: weight outliers (rare very large values) cause most quantization error. All modern methods — GPTQ, AWQ, GGUF — are different strategies for protecting outlier weights while aggressively quantizing the rest.
GPTQ: Post-Training Quantization with Calibration Data
GPTQ quantizes one layer at a time, using a small calibration dataset to find the optimal rounding for each weight. It solves a layer-wise reconstruction problem: find 4-bit weights that minimize output error on calibration data. This is GPU-intensive but produces high-quality 4-bit models.
- Best quality 4-bit quantization for GPU deployment
- Requires calibration dataset (usually 128 samples from the training distribution)
- Produces .safetensors files — compatible with vLLM, HF Transformers, TGI
- Quantization takes 30min–4hrs depending on model size
AWQ: Activation-Aware Weight Quantization
AWQ identifies which weights are 'salient' — they activate with high magnitude on real inputs — and protects them. Non-salient weights get aggressively quantized. This activation-aware approach outperforms GPTQ on most benchmarks at 4-bit while requiring less calibration data.
AWQ is increasingly the default for quality-first production deployments. The AutoAWQ library supports most major model families and integrates with vLLM. If you're deploying on A100/H100 with GPU inference, AWQ is the current best practice.
GGUF: CPU-First Quantization for Local Inference
GGUF (formerly GGML) is the format for llama.cpp — the dominant CPU inference runtime. It supports mixed-precision quantization (Q4_K_M, Q5_K_M, Q8_0 etc.) and is optimized for CPU and Apple Silicon inference. GGUF is what powers Ollama, LM Studio, and most local model runners.
| Format | Best For | Hardware | Accuracy vs fp16 |
|---|---|---|---|
| GPTQ | High-quality GPU inference, vLLM/TGI deployment | NVIDIA GPU | ~97–99% |
| AWQ | Quality-first GPU inference, easiest integration | NVIDIA GPU | ~98–99% |
| GGUF Q4_K_M | CPU/Apple Silicon local inference | CPU, M-series Mac | ~95–97% |
| GGUF Q8_0 | CPU inference with near fp16 quality | CPU, M-series Mac | ~99% |
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →