GenAI Systems Lab Open interactive version →
AI Engineering 11 min read

Inverted Index From Scratch: Build a Search Engine in 100 Lines

The data structure at the heart of every search engine. Build it: tokenize, invert, store TF, compute BM25 scores at query time. Then extend to a positional index for phrase queries. The Lucene/Elasticsearch segment model explained.

Every search engine — from Elasticsearch to Lucene to the search inside your own application — is built on an inverted index. It is one of the most important data structures in computer science for information retrieval. The word 'inverted' refers to the inversion: instead of document → words (a forward index), it maps word → documents. This inversion makes term lookup O(1) instead of O(N), which is what makes text search at scale possible.

Building the inverted index

import math
import re
from collections import defaultdict, Counter

class InvertedIndex:
    def __init__(self):
        self.postings   = defaultdict(dict)    # term → {doc_id: tf}
        self.doc_lengths = {}                  # doc_id → total token count
        self.N           = 0                   # number of documents
        self.docs        = {}                  # doc_id → original text

    def _tokenize(self, text):
        return re.findall(r'[a-z]+', text.lower())

    def add_document(self, doc_id, text):
        tokens = self._tokenize(text)
        tf = Counter(tokens)
        for term, count in tf.items():
            self.postings[term][doc_id] = count       # store raw TF
        self.doc_lengths[doc_id] = len(tokens)
        self.docs[doc_id] = text
        self.N += 1

    def df(self, term):
        """Document frequency: how many docs contain this term."""
        return len(self.postings.get(term, {}))

    def tfidf_score(self, term, doc_id):
        """TF-IDF score for a term in a document."""
        tf = self.postings.get(term, {}).get(doc_id, 0)
        if tf == 0:
            return 0.0
        # Log-normalised TF × IDF
        tf_norm = 1 + math.log(tf)
        idf     = math.log(self.N / (self.df(term) + 1)) + 1
        return tf_norm * idf

    def bm25_score(self, term, doc_id, k1=1.5, b=0.75):
        """BM25 score."""
        tf  = self.postings.get(term, {}).get(doc_id, 0)
        if tf == 0:
            return 0.0
        df  = self.df(term)
        idf = math.log((self.N - df + 0.5) / (df + 0.5) + 1)
        avgdl = sum(self.doc_lengths.values()) / self.N
        dl    = self.doc_lengths[doc_id]
        tf_norm = (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * dl / avgdl))
        return idf * tf_norm

    def search(self, query, k=5, method="bm25"):
        terms  = self._tokenize(query)
        scores = defaultdict(float)
        for term in terms:
            for doc_id in self.postings.get(term, {}):
                if method == "bm25":
                    scores[doc_id] += self.bm25_score(term, doc_id)
                else:
                    scores[doc_id] += self.tfidf_score(term, doc_id)
        ranked = sorted(scores.items(), key=lambda x: -x[1])[:k]
        return [(doc_id, round(score, 3), self.docs[doc_id][:80]) for doc_id, score in ranked]

    def stats(self):
        print(f"Documents: {self.N}  |  Unique terms: {len(self.postings)}")
        print(f"Avg doc length: {sum(self.doc_lengths.values())/self.N:.1f} tokens")


# ── Build index ───────────────────────────────────────────────────────────────
idx = InvertedIndex()
documents = {
    "d1": "BM25 is a probabilistic retrieval function used in search engines and information retrieval",
    "d2": "Dense retrieval uses neural embeddings to find semantically similar documents in vector space",
    "d3": "Hybrid search combines BM25 sparse retrieval with dense vector search using reciprocal rank fusion",
    "d4": "The inverted index maps each term to a list of documents containing that term with frequencies",
    "d5": "BM25 parameters k1 and b control term frequency saturation and length normalisation respectively",
    "d6": "Information retrieval systems must balance precision and recall for effective document search",
    "d7": "Dense embeddings capture semantic similarity while sparse BM25 captures exact lexical matches",
}
for doc_id, text in documents.items():
    idx.add_document(doc_id, text)
idx.stats()

print("
Search: 'BM25 retrieval parameters'")
for doc_id, score, text in idx.search("BM25 retrieval parameters", k=3):
    print(f"  [{doc_id}] score={score}  {text}...")

print("
Postings for term 'retrieval':", list(idx.postings.get("retrieval", {}).keys()))

Positional index and phrase queries

The basic inverted index stores (doc_id, tf) per term. A positional index stores (doc_id, [position_1, position_2, ...]) — every position where the term appears. With positions, you can answer phrase queries: 'information retrieval' requires 'information' and 'retrieval' to appear consecutively. The algorithm: find documents where both terms appear, then check if any position of 'information' is immediately followed by a position of 'retrieval'.

The cost: positional indexes are 2-4× larger than non-positional indexes. On a corpus with average document length 200 tokens, each posting stores ~200 positions instead of 1. Elasticsearch and Lucene use positional indexes by default. For most production search, the phrase query capability is worth the storage cost.

Index updates and merging

Real indexes are built with segment-based architecture (Lucene's model): new documents go into a small in-memory index (buffer). When the buffer is full, it is flushed to disk as a new segment. At query time, search all segments and merge results. Periodically, small segments are merged into larger ones. Deletion: mark documents as deleted (a delete bitmap), filter results at query time, and physically remove on the next merge. This segment model enables low-latency writes without rewriting the full index.

Extend this to a fielded index: create separate postings for title and body. Implement BM25F (BM25 with field weights) where title matches score higher than body matches. Then implement a positional index and test phrase queries. Building the full Lucene-style index from scratch takes about 200 lines and gives you a deep model of what Elasticsearch is doing when you call it.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →