AI Engineering 11 min read

What Actually Happens During Pretraining

Data curation at trillion-token scale, tokenizer training, architecture decisions, the compute budget equation, and what the loss curve tells you. The upstream of everything else in LLMs.

Every LLM you have ever used is the product of a pretraining run that happened before you saw it. Fine-tuning, RLHF, prompt engineering — all of it operates on a foundation that was laid during pretraining. Understanding what actually happens during that process changes how you think about model behavior, failure modes, and limitations.

Pretraining is the process of training a neural network on a massive text corpus to predict the next token. Everything else — instruction following, reasoning, helpfulness — is learned on top of this foundation.

Stage 1: Data curation

The most important decisions in pretraining are made before a single gradient is computed. Data quality determines model quality more than architecture or scale. The major labs spend enormous effort on data pipelines — filtering, deduplication, quality scoring, and mixing.

Scale: frontier models train on 1–15 trillion tokens. GPT-3 used 300B tokens. Llama 3 used 15T. The trend is more data, not just bigger models.
Sources: Common Crawl (web), books, code (GitHub), Wikipedia, scientific papers, forums. The mix matters — code improves reasoning even for text tasks.
Filtering: heuristic quality filters (token count, punctuation ratio, stop word ratio), classifier-based quality scoring (trained on curated positive examples), language ID filters, safety filters removing CSAM and violent content.
Deduplication: near-duplicate removal using MinHash LSH. Without deduplication, the model memorises repeated content and generalises worse. Llama 3 reported dedup reduced dataset size by ~30%.
Mixing recipe: the ratio of web/books/code/math in training data is a hyperparameter. More code → better reasoning. More math → better quantitative tasks. Labs don\'t publish exact recipes.

Data contamination: if your benchmark test set appears in the pretraining corpus, your eval results are inflated. This is why new benchmarks become useless within months — they get scraped into Common Crawl and subsequent model versions memorise them.

Stage 2: Tokenizer training

Before the model trains, you need a tokenizer — a mapping from raw text to integers. The tokenizer is trained separately on a sample of the pretraining corpus and then frozen. Tokenizer choice has significant downstream effects.

Algorithm	Used by	Key property	Weakness
BPE (Byte-Pair Encoding)	GPT-2/3/4, Llama, Mistral	Merges frequent byte pairs iteratively. Compact, handles unknown tokens via byte fallback.	Word boundaries vary — same word tokenises differently with/without leading space.
WordPiece	BERT, DistilBERT	Maximises likelihood of training data. Better for morphologically rich languages.	Slower tokenisation, vocabulary tied to training language.
SentencePiece (BPE/Unigram)	Llama 2, T5, PaLM	Language-agnostic — treats text as raw bytes. Works across scripts.	Slightly larger vocabularies, more complex implementation.
tiktoken (cl100k_base)	GPT-4, GPT-4o	Byte-level BPE with 100K vocabulary. Better multilingual coverage than GPT-3 tokenizer.	Larger vocabulary means larger embedding table.

Vocabulary size is a key decision: larger vocabularies mean fewer tokens per document (cheaper training, longer context in tokens) but larger embedding matrices. Most frontier models use 32K–128K vocabulary.

Stage 3: Architecture decisions

By 2024, the architecture decisions for frontier models have largely converged. Decoder-only transformer with a few key modifications:

Rotary Positional Embeddings (RoPE) — replaces absolute positional embeddings. Enables better length generalisation and context extension.
Grouped Query Attention (GQA) — reduces KV cache memory by sharing key/value heads across query heads. Used in Llama 3, Mistral, Gemma.
SwiGLU activation — replaces ReLU in the FFN. Empirically better performance, slightly more parameters.
RMSNorm — simpler than LayerNorm, same effect. Pre-norm (before attention) is now standard.
No bias terms — most modern LLMs remove bias from linear layers. Small speedup, no quality loss.

The number of layers, attention heads, and hidden dimension are determined by the Chinchilla scaling laws: optimal model size given a compute budget. The Chinchilla finding (2022) was that most models were undertrained — you get better performance by training a smaller model on more tokens rather than a larger model on fewer tokens.

Chinchilla optimal: for a given compute budget C (measured in FLOPs), optimal model size N ≈ C^0.5 / 1.5, optimal tokens D ≈ 20N. In practice, inference costs push labs to train smaller models on more tokens than Chinchilla optimal — you save on training once but serve millions of requests.

Stage 4: The training run

The training objective is simple: given a sequence of tokens, predict the next one. Cross-entropy loss over the vocabulary. But executing this at scale is a massive engineering problem.

Hardware: frontier runs use thousands of GPUs/TPUs. Llama 3 405B used ~16,000 H100s for ~77 days.
Parallelism: tensor parallelism (split layers across GPUs), pipeline parallelism (split layers sequentially), data parallelism (replicate model, split data). 3D parallelism combines all three.
Precision: BF16 for forward/backward pass, FP32 for optimizer state (AdamW). Mixed precision reduces memory by ~2× with minimal quality loss.
Learning rate schedule: warmup for ~1% of training steps, then cosine decay to 10% of peak LR. Typical peak LR: 3e-4 for 7B, 1e-4 for 70B.
Batch size: large batches (4M–16M tokens per step) for training stability. Gradient accumulation to fit large batches in GPU memory.
Checkpointing: save every few thousand steps. Resume from checkpoint on hardware failures (common at this scale).

Reading the loss curve

The training loss is a direct window into model quality. Understanding what it tells you is a practical skill.

Pattern	What it means	Action
Smooth monotonic decrease	Normal healthy training	Monitor for slowing — may need LR adjustment
Loss spike followed by recovery	Learning rate too high, or bad batch (corrupted data)	Check LR schedule; add data quality filters; gradient clipping
Loss plateau early	Model undertrained — not enough compute	Train longer or increase LR
Loss divergence (NaN/Inf)	Numerical instability — LR too high or bad data	Reduce LR, add gradient clipping, check data pipeline
Loss drops then flattens permanently	Data exhausted — seen all unique patterns	Add more data or accept the floor

Validation loss (on held-out data) diverging from training loss signals memorisation. This is expected to some degree — but a large gap means the model is overfitting to training distribution.

What pretraining gives you — and what it doesn't

A pretrained model is a probability distribution over next tokens, conditioned on the preceding context. It has absorbed enormous amounts of world knowledge, linguistic patterns, and reasoning structures. But it has not learned to be helpful, safe, or instruction-following. That comes in post-training (SFT → RLHF/DPO).

The limitations baked in during pretraining cannot be fully fixed in post-training. A model that never saw code during pretraining cannot be made into a coding assistant by fine-tuning alone. A model trained on data with a 2024 cutoff will hallucinate about 2025 events regardless of how you prompt it. Pretraining is the foundation — everything downstream is constrained by it.

When debugging unexpected model behavior, ask: could this be a pretraining artifact? Biases, knowledge cutoff issues, reasoning limitations — these are usually pretraining constraints, not fine-tuning failures. The fix requires a different base model, not a better prompt.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →