AI Engineering 11 min read

ANN Algorithms Explained: IVF vs HNSW vs PQ — When to Use Which

Flat, IVF, HNSW, and Product Quantization: the recall-latency-memory tradeoffs for every corpus size. Benchmarked with FAISS on 100k vectors. The index selection decision framework that replaces guessing.

Every vector database — Pinecone, Qdrant, Weaviate, pgvector, FAISS — is built on approximate nearest neighbour algorithms. The algorithm you choose determines recall, latency, memory footprint, and whether you can update the index online. Understanding the tradeoffs at the algorithm level is what separates engineers who configure vector DBs from engineers who can choose between them and predict where each will fail.

The three families

Flat (exact): store all vectors, compute every distance at query time. O(N×d) per query. Perfect recall. Not practical above ~1M vectors. Use for: offline evaluation, small corpora, ground-truth computation. IVF (Inverted File): partition vectors into clusters (Voronoi cells) via k-means. At query time, search only the nprobe closest clusters. Recall vs. latency controlled by nprobe. Use for: medium-to-large corpora, acceptable training cost, tunable recall. HNSW (Hierarchical Navigable Small World): a multilayer graph. Higher layers are sparse (long-range connections), lower layers are dense (fine-grained). Search starts at top, greedy descent to bottom. No training step. Use for: production retrieval, best recall/latency tradeoff for <1B vectors.

# pip install faiss-cpu numpy
import numpy as np
import faiss
import time

np.random.seed(42)
N, d, k = 100_000, 128, 10

# Generate random corpus and queries
corpus  = np.random.randn(N, d).astype(np.float32)
faiss.normalize_L2(corpus)
queries = np.random.randn(100, d).astype(np.float32)
faiss.normalize_L2(queries)

# Ground truth via exact flat search
flat_index = faiss.IndexFlatIP(d)
flat_index.add(corpus)
t0 = time.perf_counter()
gt_scores, gt_ids = flat_index.search(queries, k)
flat_time = time.perf_counter() - t0
print(f"Flat (exact):    {flat_time*1000:.1f}ms for 100 queries  Recall@10: 100%  (ground truth)")

# IVF — train required
n_clusters = 256
ivf_index = faiss.IndexIVFFlat(faiss.IndexFlatIP(d), d, n_clusters, faiss.METRIC_INNER_PRODUCT)
ivf_index.train(corpus)
ivf_index.add(corpus)

for nprobe in [8, 32, 64]:
    ivf_index.nprobe = nprobe
    t0 = time.perf_counter()
    ivf_scores, ivf_ids = ivf_index.search(queries, k)
    ivf_time = time.perf_counter() - t0
    # Measure recall@10
    recall = np.mean([len(set(ivf_ids[i]) & set(gt_ids[i])) / k for i in range(len(queries))])
    print(f"IVF nprobe={nprobe:3d}:  {ivf_time*1000:.1f}ms  Recall@10: {recall:.2%}  Speedup: {flat_time/ivf_time:.1f}x")

# HNSW — no training required
hnsw_index = faiss.IndexHNSWFlat(d, 32)   # M=32 connections per node
hnsw_index.hnsw.efSearch = 64             # search budget
hnsw_index.add(corpus)
t0 = time.perf_counter()
hnsw_scores, hnsw_ids = hnsw_index.search(queries, k)
hnsw_time = time.perf_counter() - t0
recall = np.mean([len(set(hnsw_ids[i]) & set(gt_ids[i])) / k for i in range(len(queries))])
print(f"HNSW M=32 ef=64: {hnsw_time*1000:.1f}ms  Recall@10: {recall:.2%}  Speedup: {flat_time/hnsw_time:.1f}x")

HNSW: the graph structure

Each vector is a node. At the bottom layer, every node connects to M nearest neighbours (M=32 is a common default). At each higher layer, a fraction of nodes are promoted (roughly e^-layer). Top-layer nodes serve as entry points for coarse navigation; bottom-layer nodes serve fine-grained retrieval. Search: start at a random top-layer node, greedily navigate to the nearest node, descend a layer, repeat. Build time is O(N × M × log(N)). Query time is O(log(N) × M × efSearch) where efSearch is the beam width.

M controls the connectivity. Higher M = higher recall, more memory (each node stores M connections per layer). efSearch controls the search budget per query — more expansive search = higher recall, higher latency. efConstruction (build-time parameter) controls recall quality of the index: higher efC = better index structure, slower build. These three parameters are the HNSW tuning surface.

Product Quantization: billion-scale compression

For billion-scale corpora, storing full float32 vectors is not feasible. PQ splits each d-dimensional vector into M sub-vectors of d/M dimensions each. Each sub-vector is quantised to one of 256 codewords (1 byte). A 128-d float32 vector (512 bytes) becomes 8 bytes — 64× compression. Distance computation uses lookup tables over sub-vector distances. Recall loss is 3-8% compared to exact. IVF+PQ combines cluster-based filtering with per-cluster quantisation — the standard configuration for billion-scale production retrieval.

Decision framework: < 100k vectors → IndexFlatIP (exact). 100k-10M → IndexHNSWFlat (no training, best recall/latency). 10M-1B → IndexIVFFlat (if memory allows) or IndexIVFPQ (memory-constrained). > 1B → distributed IVF+PQ across multiple GPUs/nodes. Always validate with recall@k vs. your latency budget before committing to an index type.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →