GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

Text Preprocessing for Search: Tokenization, Stemming, and When to Stop

LLM tokenization and search tokenization are different problems. How a BM25 preprocessing pipeline works: tokenization, case-folding, stopwords (with specific cases when you must NOT remove them), stemming vs lemmatization, and field-specific pipeline decisions.

LLM tokenization and search tokenization solve different problems. An LLM tokenizer (BPE, WordPiece) converts text into subword integers optimised for neural network consumption — it cares about coverage, vocabulary size, and feeding a model. A search tokenizer converts text into terms optimised for index lookups — it cares about matching, recall, and what a user actually typed. They are different algorithms solving different problems. Using LLM tokenization logic for a BM25 index is a common mistake.

What the search preprocessing pipeline actually does

A standard search preprocessing pipeline runs in order: tokenize the text into terms, case-fold to lowercase, remove or handle stopwords, and apply stemming or lemmatization. Each step trades recall for precision or precision for recall in specific ways. Understanding each step means knowing when to deviate from the defaults.

import re
from collections import Counter

# ─── 1. Tokenisation ─────────────────────────────────────────────────────────
def tokenize(text):
    """Split on whitespace and punctuation, keep hyphens in compound words."""
    return re.findall(r"\b\w+(?:-\w+)*\b", text.lower())

# ─── 2. Stopwords ────────────────────────────────────────────────────────────
STOPWORDS = {
    "a", "an", "the", "is", "are", "was", "were", "be", "been", "being",
    "have", "has", "had", "do", "does", "did", "will", "would", "could",
    "should", "may", "might", "must", "can", "shall", "of", "in", "on",
    "at", "by", "for", "with", "about", "against", "to", "from", "up", "down",
    "and", "but", "or", "nor", "so", "yet", "both", "either", "neither",
    "not", "no", "nor", "only", "own", "same", "than", "too", "very",
    "just", "because", "as", "until", "while", "that", "this", "these", "those"
}

def remove_stopwords(tokens, custom_keep=None):
    keep = custom_keep or set()
    return [t for t in tokens if t not in STOPWORDS or t in keep]

# ─── 3. Stemming (Porter-lite) ───────────────────────────────────────────────
def stem(word):
    """Minimal Porter rules — not complete, but illustrates the idea."""
    if word.endswith("ing") and len(word) > 6:
        return word[:-3]
    if word.endswith("tion") and len(word) > 6:
        return word[:-4]
    if word.endswith("ness") and len(word) > 6:
        return word[:-4]
    if word.endswith("ies") and len(word) > 5:
        return word[:-3] + "y"
    if word.endswith("es") and len(word) > 4:
        return word[:-2]
    if word.endswith("s") and len(word) > 4 and not word.endswith("ss"):
        return word[:-1]
    return word

# ─── 4. Lemmatization (rule-based toy version) ───────────────────────────────
LEMMA_MAP = {
    "running": "run", "runs": "run", "ran": "run",
    "better": "good", "best": "good",
    "children": "child", "mice": "mouse", "geese": "goose",
    "studies": "study", "studied": "study", "studying": "study",
}
def lemmatize(word):
    return LEMMA_MAP.get(word, word)

# ─── Full pipeline ────────────────────────────────────────────────────────────
def preprocess(text, use_stemming=False, use_lemmatization=False, remove_stops=True):
    tokens = tokenize(text)
    if remove_stops:
        tokens = remove_stopwords(tokens)
    if use_stemming:
        tokens = [stem(t) for t in tokens]
    elif use_lemmatization:
        tokens = [lemmatize(t) for t in tokens]
    return tokens

# ─── Demo ─────────────────────────────────────────────────────────────────────
docs = [
    "The machine learning model is running well and improving",
    "Deep learning models run faster on GPUs",
    "Children are studying machine learning at universities",
    "The study of artificial intelligence is growing quickly",
]

print("Pipeline comparison:")
for doc in docs[:2]:
    print(f"\nOriginal:     {doc}")
    print(f"Tokenized:    {preprocess(doc, remove_stops=False)}")
    print(f"No stops:     {preprocess(doc)}")
    print(f"Stemmed:      {preprocess(doc, use_stemming=True)}")
    print(f"Lemmatized:   {preprocess(doc, use_lemmatization=True)}")

Stopwords: when NOT to remove them

The naive rule is: remove stopwords because they carry no meaning. This is wrong in several important cases. Phrase queries: 'to be or not to be' consists entirely of stopwords, but removing them destroys the query. 'How to cook pasta' — 'how' and 'to' are often stopped, but the user intent is procedural, and 'how' is the most discriminative word. Named entities: 'Who' is a stopword, but also a band. 'The The' is a band. 'It' is a pronoun but also a Stephen King novel and a pronoun that is often the query term when someone searches 'IT support'.

Production systems typically use a context-aware stopword list. BM25 naturally reduces the weight of high-frequency terms (via IDF), so you do not need aggressive stopword removal to fix BM25 quality. Aggressive stopword removal is more useful for reducing index size than for improving recall.

Stemming vs lemmatization: the actual tradeoff

Stemming is fast (regex rules), over-aggressively conflates forms (senses of 'university' and 'universal' merge to 'univers'), and makes no attempt at linguistic correctness. It is language-agnostic if you build per-language rules. Lemmatization uses a dictionary and grammatical rules to reduce words to their canonical form — 'running' → 'run', 'ran' → 'run', 'better' → 'good'. It is linguistically correct but slower (dictionary lookup), and requires language-specific resources.

In practice: use stemming for high-throughput search over large corpora where you need low latency and are willing to accept some over-merging. Use lemmatization for precision-critical search (legal, medical) where conflating 'well' → 'good' would be a problem. For semantic search with dense embeddings, you often skip both entirely — the embedding model captures morphological variants.

Field-specific pipeline decisions

E-commerce: preserve hyphens (model-number), do NOT remove 'for' and 'compatible with' (intent), do NOT stem 'USB' to 'US'. Medical: lemmatize to canonical forms, do NOT remove negation words ('not', 'no', 'without' carry clinical meaning). Legal: preserve all stopwords, case-fold with care (proper nouns), no stemming. Code search: tokenise on camelCase and snake_case boundaries, do NOT lowercase everything (Python is case-sensitive).

The preprocessing pipeline is not a set-and-forget configuration. Every domain has different requirements. The rule is: understand each step well enough to know which to skip, modify, or configure per field. The defaults will be wrong for something in every corpus.

Implement a controlled experiment: run BM25 retrieval on the same 1000 queries with and without stemming. Measure Recall@10. On most corpora, stemming improves recall by 2-5% and reduces precision by 1-3%. Then run the same experiment with aggressive stopword removal vs. only the top-20 most frequent stopwords. You will find the aggressive list hurts navigational queries disproportionately.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →