AI Engineering 9 min read

Image Embeddings and Visual Search: Building Production Image Retrieval

How CLIP embeddings power reverse image search, product discovery, and content moderation. The full pipeline: embed → index → ANN search → rerank. What breaks at scale and how to fix it.

Every large-scale image application — reverse image search, e-commerce product discovery, content moderation, duplicate detection, medical image retrieval — runs on the same core technology: image embeddings and approximate nearest neighbor (ANN) search. Understanding this pipeline end-to-end is what separates engineers who build these systems from engineers who use them.

What Image Embeddings Are

An image embedding is a fixed-length vector that encodes the semantic content of an image. Two images of the same car model will have nearby embeddings. Two images of cars and trucks will be closer than two images of a car and a flower. The geometry of the embedding space encodes visual and semantic similarity.

CLIP embeddings are the dominant choice for production visual search because they're cross-modal: you can embed a text query ('red sports car') and find the nearest-neighbor images in a CLIP embedding index. No need for human-tagged keywords or alt text. The semantic alignment between image and text embeddings is CLIP's core property.

The Full Pipeline

Embed: run every image in the catalog through the embedding model (CLIP ViT-L/14 @ 336px is the standard choice). Store each image as a 768-dimensional float32 vector.
Index: build an ANN index over the embedded vectors. HNSW (Hierarchical Navigable Small World) is the standard: O(log n) query time, tunable recall/speed tradeoff.
Query: embed the query (text or image) with the same model. Run ANN search to find the k nearest neighbors in the index.
Rerank: optionally run a more expensive cross-encoder reranker over the top-k candidates to improve precision. Important for high-stakes retrieval (medical, legal).
Return: return the images corresponding to the top-ranked embeddings.

Indexing at Scale

Scale	Index Choice	Typical Stack
<1M images	HNSW in-memory (Faiss, Qdrant)	Single machine, fast iteration
1M–100M images	HNSW with quantization (IVF-PQ)	Qdrant, Weaviate, Pinecone — managed vector DB
>100M images	Distributed HNSW / ScaNN	Google ScaNN, custom Faiss sharding

What Breaks at Scale

Embedding drift: if you update your embedding model (e.g. CLIP → SigLIP), all existing embeddings become incompatible. You must re-embed the entire corpus. At 100M images, this is a significant engineering operation.
Recall degradation with quantization: IVF-PQ quantization (needed for memory efficiency at scale) trades recall for memory. Tune nprobe (number of IVF cells searched) to hit your recall target.
Distribution shift in the corpus: as your catalog grows, the embedding space becomes denser. Queries that worked well at 1M items may miss relevant results at 100M. Monitor recall on a golden eval set as the corpus grows.
Cross-modal gap: CLIP text and image embeddings are in the same space but not perfectly aligned. Text-to-image search works better than image-to-image search on text descriptions for some query types. Evaluate both query modalities separately.

Reranking for Precision

ANN search optimizes for recall — retrieving relevant items. It doesn't optimize for ranking precision. For applications where the top-3 results matter (product recommendations, medical image retrieval), add a reranking step: retrieve top-100 candidates by ANN, run a cross-encoder (a model that jointly encodes query + candidate) on each, return top-k by reranker score.

For production visual search: never use cosine similarity as your final ranking signal without at least A/B testing a reranker. ANN recall at k=100 is typically 90%+, but ANN precision at k=5 can be much lower. Rerankers routinely improve top-5 precision by 15–30%.

Content Moderation Use Case

CLIP embeddings are also used for content moderation: embed a known-bad image (e.g. a specific piece of CSAM hash-matched content), find all uploads within cosine distance 0.05 of that embedding. This 'semantic similarity' moderation catches near-duplicates, resizes, crops, and format-converted versions that hash-based detection misses. Major platforms run this pipeline at billions of images per day.

Embedding Space →: Visualize how embeddings cluster in the Embedding Space module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →