Foundations & Architecture 10 min read

From N-Grams to Neural Language Models: The Arc Before LLMs

Bigrams, trigrams, smoothing, and why counting failed. Bengio's 2003 neural language model was the first proof that distributed word representations could outperform any count-based system. This is the history that makes transformers legible — not as a sudden invention but as the latest step in a thirty-year trajectory.

**Start here — no prerequisites needed.** After this post you'll understand why counting word sequences works for language, why it fails at scale, and what problem word vectors solved. This is the foundation all of modern NLP builds on.

Before transformers, before neural networks, language models were built from counting. The n-gram era lasted thirty years and built every autocomplete, every spell-checker, every early speech recogniser. Understanding it is not nostalgia — it reveals exactly what problem neural language models solved, and why that solution was so disruptive.

An n-gram language model estimates the probability of the next word given the previous n-1 words. For n=2 (bigram), you ask: given the word 'machine', what word comes next? For n=3 (trigram), you ask: given 'learning' followed by 'is', what word follows? You answer by counting: look through a training corpus, find every time the history appeared, count how often each continuation appeared, divide. That ratio is your probability estimate.

The chain rule and the Markov assumption

Formally, the probability of a sentence w1, w2, ..., wn is P(w1)P(w2|w1)P(w3|w1,w2)...P(wn|w1..wn-1). This is exact — it follows from the chain rule of probability. The problem is that conditioning on the full history P(wn|w1..wn-1) requires having seen that exact sequence in training. With even a 10-word sentence, the history is essentially unique. You will never have enough data.

The n-gram approximation truncates this: instead of conditioning on all previous words, condition only on the last n-1. This is the Markov assumption — history beyond a fixed window does not matter. It is wrong in principle (the sentence 'The bank by the river...' needs context from further back to know whether 'bank' means finance or geography), but it worked surprisingly well in practice because most local transitions are highly predictable.

Maximum likelihood estimation and the sparsity problem

from collections import defaultdict, Counter
import math

def train_bigram(corpus_text):
    words = corpus_text.lower().split()
    unigram = Counter(words)
    bigram = defaultdict(Counter)
    
    for w1, w2 in zip(words[:-1], words[1:]):
        bigram[w1][w2] += 1
    
    # Maximum likelihood: P(w2 | w1) = count(w1, w2) / count(w1)
    probs = {}
    for w1, continuations in bigram.items():
        total = sum(continuations.values())
        probs[w1] = {w2: count / total for w2, count in continuations.items()}
    
    return probs, unigram

def score_sentence(sentence, probs, unigram, vocab_size, alpha=0.01):
    words = sentence.lower().split()
    log_prob = 0.0
    for w1, w2 in zip(words[:-1], words[1:]):
        # Laplace (add-alpha) smoothing to handle unseen bigrams
        count_w1w2 = probs.get(w1, {}).get(w2, 0) * unigram.get(w1, 0)
        count_w1 = unigram.get(w1, 0)
        p = (count_w1w2 + alpha) / (count_w1 + alpha * vocab_size)
        log_prob += math.log(p)
    return log_prob

corpus = """the dog saw the cat the cat ran the dog chased the cat the dog barked"""
probs, unigram = train_bigram(corpus)
vocab_size = len(unigram)

sentences = [
    "the dog chased the cat",   # likely — seen pattern
    "the cat chased the dog",   # plausible — words seen, bigram less common
    "cat the dog barked ran",   # unlikely — weird order
]
for s in sentences:
    score = score_sentence(s, probs, unigram, vocab_size)
    print(f"  {score:.2f}  |  {s}")

Maximum likelihood means P(w2|w1) = count(w1,w2) / count(w1). With a small corpus this immediately breaks: any bigram you did not see in training gets probability zero. Assign zero to any word in a sentence and the whole sentence has probability zero. You cannot compare sentences.

Smoothing is the fix: add a small constant (alpha) to every count, spreading probability mass to unseen events. Laplace smoothing (alpha=1) is simple but too aggressive. Kneser-Ney smoothing — which estimates how likely a word is to appear in novel contexts rather than how frequently it appears overall — became the standard because it respects the actual structure of the problem. 'Francisco' is common in 'San Francisco' but nearly never starts a new context; Kneser-Ney captures this. Standard n-gram systems used Kneser-Ney through most of the 2000s.

Why n-grams failed — and the exact problem neural language models solved

Two fundamental limits. First, the curse of dimensionality: with a vocabulary of 100k words, there are 10^10 possible bigrams. You will see a tiny fraction. No amount of smoothing fixes the fact that your estimates for unseen bigrams are garbage. Second, no generalisation: the bigram model that learns from 'the cat sat' knows nothing about 'the dog sat'. The word 'cat' and the word 'dog' are completely unrelated tokens with independent probability tables. There is no notion of similarity.

In 2003, Yoshua Bengio published 'A Neural Probabilistic Language Model'. The key insight: instead of a discrete lookup table, represent words as continuous vectors in a learned space. Words that appear in similar contexts (cat, dog, bird) get pushed to similar regions. The probability model is a neural network that takes these continuous representations as input. Now, statistics learned from 'the cat ate' transfer to 'the dog ate' because cat and dog land near each other in embedding space. This is the jump that made modern NLP possible.

Every architecture since — word2vec, GloVe, ELMo, BERT, GPT — is a variation on this idea. Bengio's contribution was to show that distributed representations and prediction could be learned jointly, and that this jointly learned system dramatically outperformed the best count-based models.

What this means for how you read modern LLMs

A GPT model is still, at its core, a language model: predict the next token given the previous tokens. The objective is the same as the bigram model above. What changed is the model that computes those probabilities. Instead of a lookup table + smoothing, it is a transformer with billions of parameters that learned to compress the distributional statistics of the training corpus into its weights. Perplexity — the standard LLM quality metric — is directly derived from n-gram model evaluation: lower perplexity means the model is less surprised by the test data, and thus has learned a better probability distribution. When an LLM paper reports perplexity, it is reporting on the same fundamental question the bigram model asked: how well does the model predict the next word?

Implement: run the bigram code above on a real text (Project Gutenberg novels, Wikipedia dumps). Observe how perplexity changes as you increase n from 2 to 3 to 4. Watch Kneser-Ney smoothing outperform Laplace on unseen test data. This gives you intuition for why larger context windows help, and why you always need some form of regularisation.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →