Foundations & Architecture 10 min read

BPE Tokenization From Scratch: The Algorithm That Decides How LLMs See Text

Byte Pair Encoding in 40 lines of Python. Train merge rules on a corpus, watch morphological structure emerge from statistics, tokenize out-of-vocabulary words. The algorithm behind tiktoken, WordPiece, and SentencePiece — made legible.

Every token an LLM processes was produced by a tokenizer. The tokenizer determines how text maps to integers, how much of the context window a sentence occupies, and whether rare words are handled gracefully or fragmented into noise. Yet most engineers who work with LLMs have never implemented the algorithm that produces these tokens. Byte Pair Encoding is 30 lines of code. This is those 30 lines.

BPE was originally a data compression algorithm (Gage, 1994). Sennrich et al. repurposed it for NLP in 2015. The insight: start with individual characters as your vocabulary, then repeatedly merge the most frequent pair of adjacent symbols. After enough merges, you have a vocabulary of subword units that covers common words whole while handling rare words by decomposing them into frequent subwords.

The algorithm

Step 1: Split every word in the corpus into characters, with a special end-of-word symbol. 'low' becomes ['l', 'o', 'w', '</w>']. Step 2: Count every adjacent pair in the character-split corpus. Step 3: Merge the most frequent pair everywhere. Step 4: Update the vocabulary. Step 5: Repeat until the vocabulary reaches the target size. The merge rules learned during training are applied in the same order to new text at inference time.

from collections import Counter, defaultdict
import re

def get_vocab(corpus_text):
    vocab = Counter()
    for word in corpus_text.split():
        chars = tuple(list(word) + ["</w>"])
        vocab[chars] += 1
    return vocab

def get_stats(vocab):
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        for i in range(len(word) - 1):
            pairs[(word[i], word[i+1])] += freq
    return pairs

def merge_vocab(best_pair, vocab):
    new_vocab = {}
    bigram = re.escape(' '.join(best_pair))
    p = re.compile(r'(?<!S)' + bigram + r'(?!S)')
    for word_tuple, freq in vocab.items():
        word_str = ' '.join(word_tuple)
        merged = p.sub(''.join(best_pair), word_str)
        new_vocab[tuple(merged.split())] = freq
    return new_vocab

def bpe_train(corpus_text, num_merges=30):
    vocab = get_vocab(corpus_text)
    merge_rules = []
    for i in range(num_merges):
        pairs = get_stats(vocab)
        if not pairs:
            break
        best = max(pairs, key=pairs.get)
        vocab = merge_vocab(best, vocab)
        merge_rules.append(best)
        if i < 5:
            print(f"  Merge {i+1}: {best[0]!r} + {best[1]!r} → {''.join(best)!r}  (freq={pairs[best]})")
    return vocab, merge_rules

def tokenize_bpe(word, merge_rules):
    tokens = list(word) + ["</w>"]
    for pair in merge_rules:
        i = 0
        new_tokens = []
        while i < len(tokens):
            if i < len(tokens)-1 and (tokens[i], tokens[i+1]) == pair:
                new_tokens.append(''.join(pair))
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        tokens = new_tokens
    return tokens

corpus = """low lower lowest new newer newest wide wider widest
            low frequency low power low latency low rank adapter
            learning machine learning deep learning representation learning"""
print("Training BPE:")
vocab, rules = bpe_train(corpus, num_merges=25)
print("
Tokenizing:")
for word in ["lowest", "lower", "newer", "learning", "unlowered"]:
    print(f"  {word:15s} → {tokenize_bpe(word, rules)}")

What you should observe

After training, 'low' becomes a single token. 'lower' becomes ['low', 'er</w>']. 'lowest' becomes ['low', 'est</w>']. The suffix 'er' and 'est' become subword units because they are frequent across many words — the morphological structure of English emerging from statistics alone, with no linguistic knowledge given to the algorithm.

'unlowered' did not appear in the corpus. The algorithm tokenizes it using the merge rules it learned: ['un', 'low', 'er', 'ed</w>']. This is BPE's key property: out-of-vocabulary words are always representable, just with more tokens. Compare to fixed vocabularies where unseen words map to a single [UNK], discarding all information.

From BPE to tiktoken and HuggingFace tokenizers

OpenAI's tiktoken is BPE at scale: GPT-4o uses ~100k tokens, trained on a corpus orders of magnitude larger. WordPiece (BERT) is similar but selects pairs by likelihood rather than frequency. SentencePiece (LLaMA, T5, Mistral) operates on raw Unicode without whitespace pre-tokenization — which matters for Chinese, Japanese, Arabic.

The vocabulary size tradeoff: large vocabulary (100k tokens) → most common words are single tokens, short sequences, low compute cost per word. Small vocabulary (32k) → more subword splits, longer sequences, higher cost — but potentially better generalisation to rare words. This architectural decision is made before a single training step and persists for the model's entire lifetime.

Train BPE with num_merges=10, 30, 100 on the same corpus. Count average tokens per word at each vocabulary size. This directly shows how vocabulary size affects context window efficiency — a core production cost variable. Then compare your tokenization to tiktoken on the same sentences.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →