Pre-training Data Decisions: Deduplication, Quality Filtering, Domain Mixing, and Chinchilla
MinHash LSH near-dedup vs. semantic dedup. The perplexity filter paradox. Domain mixing ratios by capability. Chinchilla scaling laws: ~20 tokens per parameter is the compute-optimal ratio. Why modern practice trains smaller models longer.
**Readable after any earlier step.** After this post you'll understand that model capability is decided before a single weight is trained — what goes into the corpus, why data quality beats data quantity, and how deduplication and filtering shape what the model knows.
The pre-training data pipeline is where large language model capability is actually determined — before architecture, before scale. Research Engineers who understand these decisions can explain why GPT-4 and Llama 3 perform differently on the same benchmark even at the same parameter count.
Deduplication
Duplicate data inflates effective training tokens, causes memorization, and hurts generalization. Near-deduplication (not just exact matches) is essential: C4, The Pile, and RedPajama all contain substantial near-duplicate content that was only later identified.
- Exact dedup: hash-based (MD5, SHA). Fast, catches only identical documents. Near-dedup: MinHash LSH — approximate Jaccard similarity. Catches paraphrased duplicates, reformatted content, scraped mirrors. Semantic dedup: embedding similarity. Expensive, catches conceptual duplicates — used at smaller scales. Train-test contamination: deduplicate your training set against benchmark test sets. Benchmark performance is meaningless otherwise.
# MinHash for near-dedup (simplified)
from datasketch import MinHash, MinHashLSH
def shingle(text, k=5):
tokens = text.split()
return set(' '.join(tokens[i:i+k]) for i in range(len(tokens)-k+1))
def minhash_doc(text, num_perm=128):
m = MinHash(num_perm=num_perm)
for s in shingle(text):
m.update(s.encode('utf8'))
return m
# Add to LSH index, query for near-duplicates
# Jaccard threshold ~0.8 catches most near-dups
Quality Filtering
Raw web crawl data is mostly low-quality. Quality filtering removes content that will hurt model quality: spam, SEO content, boilerplate, incoherent text, adult content.
- Heuristic filters: minimum token count, maximum repetition ratio, language ID (fastText), perplexity filter (low-perplexity on a reference LM = higher quality). Classifier-based: train a binary classifier on curated high-quality vs. random web content. GPT-3's WebText used Reddit upvotes as a quality proxy. The C4 paper found removing lines without terminal punctuation, deduplicating 3-sentence spans, and removing pages with 'lorem ipsum' covered most of the quality gain. Over-filtering is a real risk: you can filter out minority languages, technical content, and rare domains.
The perplexity filter paradox: filtering for low-perplexity under a small reference LM biases toward text that looks like what the reference LM was trained on. You can accidentally filter out high-quality content in domains underrepresented in the reference model.
Domain Mixing
Different data sources contribute different capabilities. Web crawl gives broad coverage. Code improves reasoning. Books improve coherence. Academic papers improve factuality. The domain mix ratio determines capability profile.
Chinchilla Scaling Laws
The Chinchilla paper (Hoffmann et al., 2022) revised the compute-optimal scaling recipe. The finding: for a given compute budget, you should train a smaller model on more tokens than previous wisdom suggested. The optimal ratio is approximately 20 tokens per parameter.
- Previous practice (GPT-3 era): scale parameters aggressively, train on 300B tokens regardless of model size. Chinchilla: 70B model on 1.4T tokens outperforms 280B model on 300B tokens at the same compute. Practical constraint: inference cost. Smaller models trained longer are cheaper to serve — Llama 2 7B trained on 2T tokens serves well at inference. The law breaks down at very long training: eventually you run out of high-quality data and repeat data hurts.
# Chinchilla optimal allocation
# N* = optimal params, D* = optimal tokens
# Given compute budget C = 6 * N * D (approx)
# Hoffmann et al. finding:
# N* ∝ C^0.5 (scale params with sqrt of compute)
# D* ∝ C^0.5 (scale tokens with sqrt of compute)
# → D* / N* ≈ 20 (train ~20 tokens per parameter)
# Practical implication:
# Budget = 10^23 FLOPs
# Optimal: ~10B params, ~200B tokens
# Not: ~100B params, ~20B tokens (GPT-3 style)
What the Research Engineer Interview Asks
Frontier lab RE interviews often include a system design component: 'You have a 10^23 FLOP budget and want to build the best 7B model you can for coding + reasoning. Walk me through your data pipeline decisions.' They want to hear: dedup strategy, quality filter rationale, domain mix with justification, Chinchilla-aware compute allocation.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →