AI Engineering 11 min read

BM25 From Scratch: Why the 40-Year-Old Formula Still Wins Half Your Queries

50 lines of Python, no dependencies. Build BM25 from the IDF formula up, tune k1 and b, run it head-to-head against dense retrieval on the same queries, and understand exactly when each one wins — and why hybrid search needs both.

BM25 is 40 years old. It predates neural networks, embeddings, and transformers. It has no learned parameters. It fits in 50 lines of Python. It still beats dense retrieval on a large fraction of real-world queries — and if you are building a RAG system without it, you are probably leaving recall on the table.

Most engineers who use BM25 treat it as a black box: 'sparse retrieval, handles keywords, pass.' This post makes you build it from scratch. Once you have, you will understand exactly when it wins, when it loses, and what the k1 and b parameters actually do — which is what an interviewer is actually asking when they say 'explain hybrid search.'

What BM25 is computing

BM25 scores a document against a query by combining two signals: how often a query term appears in the document (term frequency, with saturation), and how rare the term is across the whole corpus (inverse document frequency). The IDF component is the insight: a document that contains the word 'the' matches every query that uses 'the', which is useless. A document that contains 'reciprocal rank fusion' matches a very specific query, which is highly informative.

The saturation is the other key insight. TF-IDF scores grow linearly with term frequency — a document that says 'transformer' 20 times scores twice as high as one that says it 10 times. BM25 saturates: after a term appears a few times, additional occurrences add diminishing returns. This makes BM25 less gameable and more robust to repetitive documents.

The implementation

import math
from collections import Counter

class BM25:
    def __init__(self, corpus, k1=1.5, b=0.75):
        """
        k1: term frequency saturation. Higher = TF matters more, no saturation.
            k1=0 reduces BM25 to IDF-only. Typical range: 1.2–2.0.
        b:  length normalisation. b=1 fully normalises by doc length.
            b=0 ignores length entirely. Typical range: 0.5–0.9.
        """
        self.k1 = k1
        self.b = b
        self.corpus = [doc.lower().split() for doc in corpus]
        self.N = len(self.corpus)
        self.avgdl = sum(len(d) for d in self.corpus) / self.N
        # df[term] = number of documents containing term
        self.df = {}
        for doc in self.corpus:
            for term in set(doc):
                self.df[term] = self.df.get(term, 0) + 1

    def _score(self, query_terms, doc_idx):
        doc = self.corpus[doc_idx]
        tf = Counter(doc)
        dl = len(doc)
        score = 0.0
        for term in query_terms:
            if term not in self.df:
                continue
            # IDF: rare terms get high weight, common terms near zero
            idf = math.log((self.N - self.df[term] + 0.5) / (self.df[term] + 0.5) + 1)
            # TF with saturation (k1) and length normalisation (b)
            tf_norm = (tf[term] * (self.k1 + 1)) / (
                tf[term] + self.k1 * (1 - self.b + self.b * dl / self.avgdl)
            )
            score += idf * tf_norm
        return score

    def retrieve(self, query, k=3):
        terms = query.lower().split()
        scores = [(i, self._score(terms, i)) for i in range(self.N)]
        ranked = sorted(scores, key=lambda x: -x[1])[:k]
        return [(self.corpus[i], round(s, 3)) for i, s in ranked if s > 0]


corpus = [
    "BM25 is a probabilistic retrieval function used in search engines",
    "Dense retrieval uses neural embeddings to find semantically similar documents",
    "BM25 outperforms dense retrieval on exact keyword and rare term queries",
    "Hybrid search combines BM25 sparse retrieval with dense vector search using RRF",
    "Vector databases store embeddings for approximate nearest neighbour search",
    "The k1 parameter in BM25 controls how quickly term frequency saturates",
    "The b parameter controls length normalisation — b=0.75 is the standard default",
    "Reciprocal rank fusion merges ranked lists from multiple retrievers without score normalisation",
]

bm25 = BM25(corpus)
for query in ["BM25 keyword exact", "neural embedding similarity", "k1 parameter"]:
    print(f"\nQuery: '{query}'")
    for doc, score in bm25.retrieve(query, k=2):
        print(f"  [{score}] {' '.join(doc[:8])}...")

What to watch when you run this

Query 'BM25 keyword exact': the document containing all three terms scores highest. The word 'exact' is relatively rare in the corpus, so its IDF is high — it does significant work in separating documents. Query 'neural embedding similarity': BM25 returns the dense retrieval document, not because it understood semantics, but because the words 'neural' and 'embedding' appear in the right document. This is BM25's strength and its limit in one example.

Now try a query like 'how do transformers learn meaning?' — BM25 returns nothing or near-zero scores because 'transformers', 'learn', and 'meaning' do not appear in the corpus. Dense retrieval would return a semantically relevant document even with vocabulary mismatch. This is the exact complementarity that hybrid search exploits.

Experiment: set k1=0 and observe how every result now ignores term frequency entirely — pure IDF scoring. Then set b=0 and watch short documents suddenly outrank long ones. Understanding these parameters is what lets you tune BM25 for domain-specific corpora rather than accepting defaults.

When BM25 beats dense retrieval

Exact product codes (SKU-A4821), proper nouns, rare technical terms, legal citations, numeric identifiers. Dense retrieval embeds 'SKU-A4821' into a point in vector space near other product codes — but cosine similarity between 'SKU-A4821' and 'SKU-B3920' may be high because they look structurally similar. BM25 treats them as distinct strings. If the user wants the exact SKU, BM25 wins cleanly.

BM25 also wins when the corpus is small (fewer than ~10k documents), when queries are highly specific, and when low latency matters more than semantic coverage. Dense retrieval requires encoding the query through a neural network at query time. BM25 is a lookup and a few multiplications — microseconds vs. milliseconds.

What RRF does when you combine them

Hybrid search gets its power from Reciprocal Rank Fusion, not from averaging scores. You cannot average BM25 scores and cosine similarities directly — they have completely different scales and distributions. RRF converts each ranked list into reciprocal ranks (1/rank) and sums them. Document ranked 1st by BM25 gets score 1/(1+60), document ranked 2nd gets 1/(2+60), and so on. The 60 is a constant that smooths the curve. This lets you combine any two ranked lists regardless of their original score distributions.

Hybrid search in the RAG Lab →: Configure BM25 weight vs. dense retrieval in Scenario 3 and watch how result sets change on the same query.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →