Tokenization: Why 'cat' and 'cats' Are Different to an LLM
BPE, WordPiece, SentencePiece — what tokenizers do, why it matters for prompting, and how token counts drive your inference bill.
LLMs don't read words. They read tokens. Before a single character of your prompt reaches the model, a tokenizer has already broken it into a sequence of integer IDs — and those IDs are all the model ever sees.
Understanding tokenization is not optional for anyone building with LLMs. It affects your costs, your prompting strategy, why certain languages behave differently, and why models sometimes count wrong.
[Video: Andrej Karpathy — Let's build the GPT Tokenizer (deep-dive into BPE from scratch)]
What is a token?
A token is a subword unit — roughly 3–4 characters of English text on average. Tokens are not words, letters, or sentences. They are the chunks that a tokenizer learned to split text into during training.
- Common short words are usually 1 token: "the", "is", "a", "in"
- Longer common words split into 2: "token" + "ization" → "tokenization"
- Numbers are often 1 token each: "42" = 1 token, "1000" = 1 token
- Whitespace and punctuation are their own tokens
- Unknown or rare words split into many tokens: "biostratigraphically" = 6+ tokens
1 token ≈ 4 characters in English. 1,000 tokens ≈ 750 words. Non-English text typically uses 2–5× more tokens per word. Code and JSON vary widely — Python is efficient, SQL less so.
How BPE tokenizers work
Most production LLMs use Byte Pair Encoding (BPE) or a variant. The algorithm is simple:
- Start with every individual character as its own token
- Count the most frequent pair of adjacent tokens across the training corpus
- Merge that pair into a new single token
- Repeat until the vocabulary reaches its target size (e.g., 100K tokens)
The result is a vocabulary where common English words and subwords are single tokens, while rare combinations stay split. GPT-4 uses ~100K tokens. Claude uses a similar-sized vocabulary.
"Hello, world!" → ["Hello", ",", " world", "!"] → [15496, 11, 995, 0]
"tokenization" → ["token", "ization"] → [3642, 1634]
"cats" → ["cats"] → [34111]
"cat" → ["cat"] → [9246] ← different token entirely
Why "cat" and "cats" are different to a model
To a human, "cat" and "cats" are the same word in singular and plural forms. To an LLM, they may have completely different token IDs with no structural relationship.
The model learns their relationship from co-occurrence patterns in training data — not from any built-in understanding of morphology. This is why LLMs can still generalise across plural/singular forms, but it's a statistical pattern, not grammar.
It also explains some well-known quirks: models count letters poorly ("how many r's in strawberry?") because they see token IDs, not individual characters. "Strawberry" may be tokenized as ["Straw", "berry"] — two tokens — so character-level counting requires the model to reason about subword structure, which it hasn't directly practiced.
Asking a model to count characters? Add "think step by step, spelling out each character individually" — it forces the model to decompose the token, dramatically improving accuracy.
Tokens and language inequality
BPE is trained on text corpora that are overwhelmingly English. The result: non-English text is systematically undertokenized — the same meaning costs more tokens.
| Language | "hello how are you" (approx tokens) |
|---|---|
| English | 4 |
| French (Bonjour comment allez-vous) | 6 |
| Hindi (नमस्ते आप कैसे हैं) | 10–15 |
| Arabic (مرحبا كيف حالك) | 8–12 |
| Japanese (こんにちは、お元気ですか) | 12–18 |
This creates real cost and latency inequality for non-English applications. It also means multilingual models effectively have a smaller "usable" context window when processing non-English text.
Practical implications for builders
- Pricing is per token — verbose prompts cost more, especially with long system prompts repeated across millions of requests
- Prompt caching works at the token level — identical prefix tokens across requests can be cached (saving 80–90% on the cached portion)
- Max context windows are measured in tokens, not words — 200K tokens ≈ 150K words ≈ a medium-length novel
- Token budget matters in RAG — each retrieved chunk consumes tokens, reducing space for history and the answer
- Special tokens exist for chat structure: <|im_start|>, [INST], <|user|> — these consume tokens too
How to check token counts
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
text = "Your prompt text here"
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")
print(f"Estimated cost at $0.01/1K: ${len(tokens) / 1000 * 0.01:.4f}")
Always estimate token counts before production. A seemingly small system prompt change — adding a few paragraphs of context — can 10× your costs on high-volume endpoints.
Tokenization is the single most underrated source of bugs in LLM applications. Engineers think they're debugging the model — they're actually debugging the tokenizer.
Try the Tokenizer module →: See exactly how real text gets split. Paste any prompt and watch the token boundaries appear live.
- BPE: Neural Machine Translation with Rare Words using Subword Units (Sennrich et al., 2016)
- SentencePiece: A simple and language-independent subword tokenizer (Kudo & Richardson, 2018)
- Tokenization Is More Than Compression — Anthropic
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →