FAISS From Scratch: How Approximate Nearest Neighbour Search Works
Build exact flat search, then an IVF index that searches only the closest clusters. Measure speedup and recall. Understand HNSW and Product Quantization at the right level of abstraction — the algorithm, not the API.
Every semantic search system eventually runs a nearest neighbour query: given a query vector, find the most similar vectors in a corpus. For 10 documents, compute 10 similarities and take the max. For 100 million documents, you need approximate nearest neighbour search. FAISS made this practical. Understanding its internals means understanding the core retrieval tradeoff: recall vs. latency.
Exact search: the baseline
import numpy as np
import time
def exact_search(query, corpus, k=5):
q_n = query / (np.linalg.norm(query) + 1e-9)
c_n = corpus / (np.linalg.norm(corpus, axis=1, keepdims=True) + 1e-9)
sims = c_n @ q_n
top_k = np.argpartition(sims, -k)[-k:]
return top_k[np.argsort(sims[top_k])[::-1]]
class IVFIndex:
def __init__(self, n_clusters=64):
self.n_clusters = n_clusters
self.centroids = None
self.cells = {}
def train(self, corpus, n_iter=20):
N, d = corpus.shape
self.centroids = corpus[np.random.choice(N, self.n_clusters, replace=False)].copy()
for _ in range(n_iter):
dists = np.linalg.norm(corpus[:, None] - self.centroids[None], axis=2)
assigns = dists.argmin(axis=1)
new_c = np.zeros_like(self.centroids)
counts = np.zeros(self.n_clusters)
for i, c in enumerate(assigns):
new_c[c] += corpus[i]; counts[c] += 1
mask = counts > 0
new_c[mask] /= counts[mask, None]
if np.allclose(new_c, self.centroids, atol=1e-4):
break
self.centroids = new_c
def add(self, corpus):
dists = np.linalg.norm(corpus[:, None] - self.centroids[None], axis=2)
for i, c in enumerate(dists.argmin(axis=1)):
self.cells.setdefault(c, []).append((i, corpus[i]))
def search(self, query, k=5, nprobe=8):
centroid_dists = np.linalg.norm(self.centroids - query, axis=1)
probe_cells = np.argpartition(centroid_dists, nprobe)[:nprobe]
candidates = []
for c in probe_cells:
for idx, vec in self.cells.get(c, []):
sim = float(vec @ query / (np.linalg.norm(vec) * np.linalg.norm(query) + 1e-9))
candidates.append((sim, idx))
candidates.sort(reverse=True)
return [idx for _, idx in candidates[:k]]
N, d = 10_000, 128
corpus = np.random.randn(N, d).astype(np.float32)
query = np.random.randn(d).astype(np.float32)
t0 = time.perf_counter(); exact = exact_search(query, corpus, k=5); exact_t = time.perf_counter() - t0
ivf = IVFIndex(); ivf.train(corpus); ivf.add(corpus)
t0 = time.perf_counter(); approx = ivf.search(query, k=5, nprobe=8); ivf_t = time.perf_counter() - t0
print(f"Exact: {exact_t*1000:.1f}ms IVF: {ivf_t*1000:.1f}ms Speedup: {exact_t/ivf_t:.1f}x")
print(f"Recall@5: {len(set(exact)&set(approx))}/5")
HNSW and PQ: the production choices
IVF partitions vectors into clusters. nprobe controls how many clusters to search — more probes = higher recall, more latency. HNSW (Hierarchical Navigable Small World) uses a multi-layer navigable graph. Search starts at the top (coarsest) level, greedily follows edges to the nearest neighbour, descends to finer levels. HNSW often beats IVF on recall-at-latency for datasets up to ~100M vectors and does not require a training step.
Product Quantization (PQ) compresses vectors: split each 128-d float32 vector (512 bytes) into 8 sub-vectors, encode each with a 256-entry codebook (1 byte per sub-vector). Result: 8 bytes instead of 512 bytes — 64× compression. IVF + PQ enables billion-scale retrieval on a single machine. The tradeoff: slightly lower recall versus uncompressed IVF.
Index selection guide
IndexFlatIP: exact, for evaluation or <100k vectors. IndexIVFFlat: approximate, no compression, good for <10M vectors. IndexHNSWFlat: graph-based, often best recall/latency, no training step needed, ~100M vectors. IndexIVFPQ: billions of vectors, accepts recall trade-off for massive memory savings. Understanding the algorithms (which you now do) makes the FAISS decision guide interpretable rather than arbitrary.
Install faiss-cpu and compare IndexFlatIP vs IndexIVFFlat vs IndexHNSWFlat on 100k random vectors. Time 100 search queries. Measure Recall@10 vs. exact. Plot recall vs. latency — this is the ANN-Benchmarks methodology, and the results will tell you which index wins for your production constraints.
- FAISS: A Library for Efficient Similarity Search — Johnson et al. (2019)
- FAISS wiki — index selection guide
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →