GenAI Systems Lab Open interactive version →
AI Engineering 11 min read

What Actually Happens During Pretraining

Data curation at trillion-token scale, tokenizer training, architecture decisions, the compute budget equation, and what the loss curve tells you. The upstream of everything else in LLMs.

Every LLM you have ever used is the product of a pretraining run that happened before you saw it. Fine-tuning, RLHF, prompt engineering — all of it operates on a foundation that was laid during pretraining. Understanding what actually happens during that process changes how you think about model behavior, failure modes, and limitations.

Pretraining is the process of training a neural network on a massive text corpus to predict the next token. Everything else — instruction following, reasoning, helpfulness — is learned on top of this foundation.

Stage 1: Data curation

The most important decisions in pretraining are made before a single gradient is computed. Data quality determines model quality more than architecture or scale. The major labs spend enormous effort on data pipelines — filtering, deduplication, quality scoring, and mixing.

Data contamination: if your benchmark test set appears in the pretraining corpus, your eval results are inflated. This is why new benchmarks become useless within months — they get scraped into Common Crawl and subsequent model versions memorise them.

Stage 2: Tokenizer training

Before the model trains, you need a tokenizer — a mapping from raw text to integers. The tokenizer is trained separately on a sample of the pretraining corpus and then frozen. Tokenizer choice has significant downstream effects.

AlgorithmUsed byKey propertyWeakness
BPE (Byte-Pair Encoding)GPT-2/3/4, Llama, MistralMerges frequent byte pairs iteratively. Compact, handles unknown tokens via byte fallback.Word boundaries vary — same word tokenises differently with/without leading space.
WordPieceBERT, DistilBERTMaximises likelihood of training data. Better for morphologically rich languages.Slower tokenisation, vocabulary tied to training language.
SentencePiece (BPE/Unigram)Llama 2, T5, PaLMLanguage-agnostic — treats text as raw bytes. Works across scripts.Slightly larger vocabularies, more complex implementation.
tiktoken (cl100k_base)GPT-4, GPT-4oByte-level BPE with 100K vocabulary. Better multilingual coverage than GPT-3 tokenizer.Larger vocabulary means larger embedding table.

Vocabulary size is a key decision: larger vocabularies mean fewer tokens per document (cheaper training, longer context in tokens) but larger embedding matrices. Most frontier models use 32K–128K vocabulary.

Stage 3: Architecture decisions

By 2024, the architecture decisions for frontier models have largely converged. Decoder-only transformer with a few key modifications:

The number of layers, attention heads, and hidden dimension are determined by the Chinchilla scaling laws: optimal model size given a compute budget. The Chinchilla finding (2022) was that most models were undertrained — you get better performance by training a smaller model on more tokens rather than a larger model on fewer tokens.

Chinchilla optimal: for a given compute budget C (measured in FLOPs), optimal model size N ≈ C^0.5 / 1.5, optimal tokens D ≈ 20N. In practice, inference costs push labs to train smaller models on more tokens than Chinchilla optimal — you save on training once but serve millions of requests.

Stage 4: The training run

The training objective is simple: given a sequence of tokens, predict the next one. Cross-entropy loss over the vocabulary. But executing this at scale is a massive engineering problem.

Reading the loss curve

The training loss is a direct window into model quality. Understanding what it tells you is a practical skill.

PatternWhat it meansAction
Smooth monotonic decreaseNormal healthy trainingMonitor for slowing — may need LR adjustment
Loss spike followed by recoveryLearning rate too high, or bad batch (corrupted data)Check LR schedule; add data quality filters; gradient clipping
Loss plateau earlyModel undertrained — not enough computeTrain longer or increase LR
Loss divergence (NaN/Inf)Numerical instability — LR too high or bad dataReduce LR, add gradient clipping, check data pipeline
Loss drops then flattens permanentlyData exhausted — seen all unique patternsAdd more data or accept the floor

Validation loss (on held-out data) diverging from training loss signals memorisation. This is expected to some degree — but a large gap means the model is overfitting to training distribution.

What pretraining gives you — and what it doesn't

A pretrained model is a probability distribution over next tokens, conditioned on the preceding context. It has absorbed enormous amounts of world knowledge, linguistic patterns, and reasoning structures. But it has not learned to be helpful, safe, or instruction-following. That comes in post-training (SFT → RLHF/DPO).

The limitations baked in during pretraining cannot be fully fixed in post-training. A model that never saw code during pretraining cannot be made into a coding assistant by fine-tuning alone. A model trained on data with a 2024 cutoff will hallucinate about 2025 events regardless of how you prompt it. Pretraining is the foundation — everything downstream is constrained by it.

When debugging unexpected model behavior, ask: could this be a pretraining artifact? Biases, knowledge cutoff issues, reasoning limitations — these are usually pretraining constraints, not fine-tuning failures. The fix requires a different base model, not a better prompt.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →