AI Engineering 10 min read

Query Understanding: Intent, Spelling Correction, and Expansion

The transformations between raw user text and a structured retrieval plan: intent classification (navigational/transactional/informational), edit-distance spelling correction, synonym expansion, and LLM-based query rewriting. When each step helps and when it hurts.

The query is what the user typed. The intent is what they actually want. These two things are often different. 'Apple' could mean the company, the fruit, or the Beatles record label depending on context. 'Python tutorial' could mean beginner or advanced. The query understanding pipeline is the set of transformations between raw text and a structured retrieval plan — intent classification, spelling correction, query expansion, and rewriting. Getting this right often has more impact than tuning the retrieval algorithm.

Spelling correction

Edit-distance based correction: find the dictionary word with minimum edit distance to the query term. Fast, covers most typos. Context-free: 'form' might be corrected to 'from' in a query where both are valid. Statistical correction: use a language model or an n-gram model to pick the most probable correction given query context. Context-aware but slower. The production heuristic: only auto-correct if confidence is above a threshold; otherwise show 'Did you mean: ...' with the option to keep the original query.

import re
from collections import Counter

# ─── 1. Query classification (intent) ────────────────────────────────────────
# In production: use a fine-tuned classifier or embedding similarity to templates
NAVIGATIONAL_PATTERNS = [r'(login|homepage|official site|contact|careers|support)']
TRANSACTIONAL_PATTERNS = [r'(buy|price|cost|discount|order|shipping|checkout)']
INFORMATIONAL_PATTERNS = [r'(what|how|why|explain|define|tutorial|example|guide)']

def classify_intent(query):
    q = query.lower()
    if any(re.search(p, q) for p in NAVIGATIONAL_PATTERNS):
        return "navigational"
    if any(re.search(p, q) for p in TRANSACTIONAL_PATTERNS):
        return "transactional"
    if any(re.search(p, q) for p in INFORMATIONAL_PATTERNS):
        return "informational"
    return "unclassified"

# ─── 2. Spelling correction (edit distance) ───────────────────────────────────
def edit_distance(s1, s2):
    m, n = len(s1), len(s2)
    dp = [[0]*(n+1) for _ in range(m+1)]
    for i in range(m+1): dp[i][0] = i
    for j in range(n+1): dp[0][j] = j
    for i in range(1, m+1):
        for j in range(1, n+1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = 1 + min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1])
    return dp[m][n]

VOCABULARY = {"machine", "learning", "deep", "neural", "network", "transformer",
              "retrieval", "embedding", "attention", "tokenization", "language",
              "model", "training", "inference", "gradient", "python", "pytorch"}

def correct_query(query, max_edit=2):
    corrected = []
    for word in query.lower().split():
        if word in VOCABULARY:
            corrected.append(word)
            continue
        candidates = [(w, edit_distance(word, w)) for w in VOCABULARY if abs(len(w)-len(word)) <= max_edit]
        candidates = [(w, d) for w, d in candidates if d <= max_edit]
        if candidates:
            best = min(candidates, key=lambda x: x[1])
            corrected.append(best[0])
        else:
            corrected.append(word)   # unknown, keep as is
    return " ".join(corrected)

# ─── 3. Query expansion (synonym and related terms) ───────────────────────────
SYNONYMS = {
    "fast":      ["quick", "rapid", "efficient", "low-latency"],
    "retrieve":  ["fetch", "get", "search", "query"],
    "embedding": ["vector", "representation", "encoding"],
    "llm":       ["large language model", "language model", "GPT", "Claude"],
}

def expand_query(query, max_expansions=2):
    words = query.lower().split()
    expansions = []
    for word in words:
        if word in SYNONYMS:
            expansions.extend(SYNONYMS[word][:max_expansions])
    if expansions:
        return query + " " + " ".join(expansions)
    return query

# ── Pipeline ─────────────────────────────────────────────────────────────────
test_queries = [
    "buy pytorch tutoral for machien learinng",
    "how to train a transformar model",
    "what is the diffrence between embedding and vector",
    "fast llm retrieval python",
]

for q in test_queries:
    intent    = classify_intent(q)
    corrected = correct_query(q)
    expanded  = expand_query(corrected)
    print(f"Original:  {q}")
    print(f"Intent:    {intent}")
    print(f"Corrected: {corrected}")
    print(f"Expanded:  {expanded[:100]}")
    print()

Query rewriting: the LLM approach

Modern retrieval systems use LLMs to rewrite queries into multiple variants: decompose multi-hop queries into sub-queries, expand abbreviations, add explicit context. HyDE (Hypothetical Document Embeddings) goes further: use an LLM to generate a hypothetical document that would answer the query, then embed that document and use it as the retrieval query. The hypothesis: the hypothetical document's embedding is closer to relevant documents than the original sparse query.

When query expansion hurts

Aggressive synonym expansion reduces precision. 'Bank' expanded to 'river bank' and 'financial institution' will retrieve both, causing irrelevant results for queries where the user had a specific sense. The standard production heuristic: expand only when the original query retrieves few results (low recall case), and only with high-confidence synonyms (word embeddings similarity > 0.85). Do not expand navigational queries at all — the user wants a specific destination.

A/B test each component of your query understanding pipeline independently. Start with just spelling correction and measure precision@5. Then add intent classification and measure whether navigational queries (where you should return a single result) improve. Add query expansion last and measure recall@10. Each component should show an improvement in its target metric without degrading others. This is the standard search AB testing methodology.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →