GenAI Systems Lab Open interactive version →
Foundations & Architecture 12 min read

Pre-training Data Decisions: Deduplication, Quality Filtering, Domain Mixing, and Chinchilla

MinHash LSH near-dedup vs. semantic dedup. The perplexity filter paradox. Domain mixing ratios by capability. Chinchilla scaling laws: ~20 tokens per parameter is the compute-optimal ratio. Why modern practice trains smaller models longer.

**Readable after any earlier step.** After this post you'll understand that model capability is decided before a single weight is trained — what goes into the corpus, why data quality beats data quantity, and how deduplication and filtering shape what the model knows.

The pre-training data pipeline is where large language model capability is actually determined — before architecture, before scale. Research Engineers who understand these decisions can explain why GPT-4 and Llama 3 perform differently on the same benchmark even at the same parameter count.

Deduplication

Duplicate data inflates effective training tokens, causes memorization, and hurts generalization. Near-deduplication (not just exact matches) is essential: C4, The Pile, and RedPajama all contain substantial near-duplicate content that was only later identified.

# MinHash for near-dedup (simplified)
from datasketch import MinHash, MinHashLSH

def shingle(text, k=5):
    tokens = text.split()
    return set(' '.join(tokens[i:i+k]) for i in range(len(tokens)-k+1))

def minhash_doc(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for s in shingle(text):
        m.update(s.encode('utf8'))
    return m

# Add to LSH index, query for near-duplicates
# Jaccard threshold ~0.8 catches most near-dups

Quality Filtering

Raw web crawl data is mostly low-quality. Quality filtering removes content that will hurt model quality: spam, SEO content, boilerplate, incoherent text, adult content.

The perplexity filter paradox: filtering for low-perplexity under a small reference LM biases toward text that looks like what the reference LM was trained on. You can accidentally filter out high-quality content in domains underrepresented in the reference model.

Domain Mixing

Different data sources contribute different capabilities. Web crawl gives broad coverage. Code improves reasoning. Books improve coherence. Academic papers improve factuality. The domain mix ratio determines capability profile.

Chinchilla Scaling Laws

The Chinchilla paper (Hoffmann et al., 2022) revised the compute-optimal scaling recipe. The finding: for a given compute budget, you should train a smaller model on more tokens than previous wisdom suggested. The optimal ratio is approximately 20 tokens per parameter.

# Chinchilla optimal allocation
# N* = optimal params, D* = optimal tokens
# Given compute budget C = 6 * N * D (approx)

# Hoffmann et al. finding:
# N* ∝ C^0.5  (scale params with sqrt of compute)
# D* ∝ C^0.5  (scale tokens with sqrt of compute)
# → D* / N* ≈ 20  (train ~20 tokens per parameter)

# Practical implication:
# Budget = 10^23 FLOPs
# Optimal: ~10B params, ~200B tokens
# Not: ~100B params, ~20B tokens (GPT-3 style)

What the Research Engineer Interview Asks

Frontier lab RE interviews often include a system design component: 'You have a 10^23 FLOP budget and want to build the best 7B model you can for coding + reasoning. Walk me through your data pipeline decisions.' They want to hear: dedup strategy, quality filter rationale, domain mix with justification, Chinchilla-aware compute allocation.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →