AI Engineering 12 min read

FAISS From Scratch: How Approximate Nearest Neighbour Search Works

Build exact flat search, then an IVF index that searches only the closest clusters. Measure speedup and recall. Understand HNSW and Product Quantization at the right level of abstraction — the algorithm, not the API.

Every semantic search system eventually runs a nearest neighbour query: given a query vector, find the most similar vectors in a corpus. For 10 documents, compute 10 similarities and take the max. For 100 million documents, you need approximate nearest neighbour search. FAISS made this practical. Understanding its internals means understanding the core retrieval tradeoff: recall vs. latency.

Exact search: the baseline

import numpy as np
import time

def exact_search(query, corpus, k=5):
    q_n = query  / (np.linalg.norm(query)  + 1e-9)
    c_n = corpus / (np.linalg.norm(corpus, axis=1, keepdims=True) + 1e-9)
    sims = c_n @ q_n
    top_k = np.argpartition(sims, -k)[-k:]
    return top_k[np.argsort(sims[top_k])[::-1]]

class IVFIndex:
    def __init__(self, n_clusters=64):
        self.n_clusters = n_clusters
        self.centroids = None
        self.cells = {}

    def train(self, corpus, n_iter=20):
        N, d = corpus.shape
        self.centroids = corpus[np.random.choice(N, self.n_clusters, replace=False)].copy()
        for _ in range(n_iter):
            dists = np.linalg.norm(corpus[:, None] - self.centroids[None], axis=2)
            assigns = dists.argmin(axis=1)
            new_c = np.zeros_like(self.centroids)
            counts = np.zeros(self.n_clusters)
            for i, c in enumerate(assigns):
                new_c[c] += corpus[i]; counts[c] += 1
            mask = counts > 0
            new_c[mask] /= counts[mask, None]
            if np.allclose(new_c, self.centroids, atol=1e-4):
                break
            self.centroids = new_c

    def add(self, corpus):
        dists = np.linalg.norm(corpus[:, None] - self.centroids[None], axis=2)
        for i, c in enumerate(dists.argmin(axis=1)):
            self.cells.setdefault(c, []).append((i, corpus[i]))

    def search(self, query, k=5, nprobe=8):
        centroid_dists = np.linalg.norm(self.centroids - query, axis=1)
        probe_cells = np.argpartition(centroid_dists, nprobe)[:nprobe]
        candidates = []
        for c in probe_cells:
            for idx, vec in self.cells.get(c, []):
                sim = float(vec @ query / (np.linalg.norm(vec) * np.linalg.norm(query) + 1e-9))
                candidates.append((sim, idx))
        candidates.sort(reverse=True)
        return [idx for _, idx in candidates[:k]]

N, d = 10_000, 128
corpus = np.random.randn(N, d).astype(np.float32)
query  = np.random.randn(d).astype(np.float32)

t0 = time.perf_counter(); exact = exact_search(query, corpus, k=5); exact_t = time.perf_counter() - t0
ivf = IVFIndex(); ivf.train(corpus); ivf.add(corpus)
t0 = time.perf_counter(); approx = ivf.search(query, k=5, nprobe=8); ivf_t = time.perf_counter() - t0
print(f"Exact: {exact_t*1000:.1f}ms  IVF: {ivf_t*1000:.1f}ms  Speedup: {exact_t/ivf_t:.1f}x")
print(f"Recall@5: {len(set(exact)&set(approx))}/5")

HNSW and PQ: the production choices

IVF partitions vectors into clusters. nprobe controls how many clusters to search — more probes = higher recall, more latency. HNSW (Hierarchical Navigable Small World) uses a multi-layer navigable graph. Search starts at the top (coarsest) level, greedily follows edges to the nearest neighbour, descends to finer levels. HNSW often beats IVF on recall-at-latency for datasets up to ~100M vectors and does not require a training step.

Product Quantization (PQ) compresses vectors: split each 128-d float32 vector (512 bytes) into 8 sub-vectors, encode each with a 256-entry codebook (1 byte per sub-vector). Result: 8 bytes instead of 512 bytes — 64× compression. IVF + PQ enables billion-scale retrieval on a single machine. The tradeoff: slightly lower recall versus uncompressed IVF.

Index selection guide

IndexFlatIP: exact, for evaluation or <100k vectors. IndexIVFFlat: approximate, no compression, good for <10M vectors. IndexHNSWFlat: graph-based, often best recall/latency, no training step needed, ~100M vectors. IndexIVFPQ: billions of vectors, accepts recall trade-off for massive memory savings. Understanding the algorithms (which you now do) makes the FAISS decision guide interpretable rather than arbitrary.

Install faiss-cpu and compare IndexFlatIP vs IndexIVFFlat vs IndexHNSWFlat on 100k random vectors. Time 100 search queries. Measure Recall@10 vs. exact. Plot recall vs. latency — this is the ANN-Benchmarks methodology, and the results will tell you which index wins for your production constraints.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →