Continued Pretraining: When Fine-Tuning Isn't Deep Enough
When domain adaptation requires continued pretraining on unlabelled text rather than supervised fine-tuning. Medical, legal, and code domains — what it takes and when it's worth it.
Instruction fine-tuning teaches a model how to respond. Continued pretraining teaches a model what to know. The distinction matters when your domain is genuinely out-of-distribution from the base model's pretraining data — where the vocabulary, concepts, and reasoning patterns of your domain are so specialised that no amount of instruction tuning on labelled examples will fully close the gap.
Medicine. Law. Highly specialised scientific domains. Proprietary internal codebases with unique conventions. These are the domains where continued pretraining on unlabelled text becomes the right tool.
What continued pretraining actually does
Continued pretraining runs the standard language modelling objective (predict the next token) on a large corpus of domain-specific text — without any instruction-response structure. The model doesn't learn to answer questions; it learns the statistical patterns, terminology, and reasoning structures of the domain.
After continued pretraining, you still need instruction fine-tuning on top to teach the model how to use that knowledge in response to instructions. Continued pretraining → instruction fine-tuning is the standard two-stage pipeline for deep domain adaptation.
Continued pretraining changes what the model knows. Instruction fine-tuning changes how the model responds. You often need both. The order is always: continued pretraining first, then instruction fine-tuning on top of the domain-adapted base.
When continued pretraining is worth it
- Your domain has specialised vocabulary that the base model tokenises inefficiently (medical Latin, legal Latin, scientific notation, proprietary codebases)
- Your domain requires multi-step reasoning patterns not present in general instruction data
- You have a large corpus of unlabelled domain text (10B+ tokens ideally) but limited labelled examples
- Instruction fine-tuning on labelled data has reached a quality ceiling that more data doesn't improve
When it's not worth it
- You have <1B tokens of domain text: the compute cost is high relative to the knowledge gain
- Your domain is well-represented in the base model's pretraining data (general web text, common programming languages, mainstream science)
- Your quality ceiling is format/style rather than domain knowledge: instruction fine-tuning alone will close that gap
- You don't have the infrastructure for multi-GPU multi-day training runs
Data requirements and preparation
Continued pretraining data is unlabelled — just raw text from your domain. Quality still matters enormously. A 10B token corpus of high-quality medical literature will produce a better model than 50B tokens of scraped web text that happens to mention medical topics.
- Source selection: peer-reviewed papers, authoritative reference materials, high-quality domain documentation
- Deduplication: aggressive deduplication at the document level — duplicated text teaches the model to memorise, not generalise
- Length filtering: remove very short documents (metadata, headers) and very long ones that likely contain non-domain text
- Data mixing: mix domain text with a fraction (~10–20%) of general text to prevent catastrophic forgetting of general capabilities
Continued pretraining vs. RAG
| Aspect | Continued Pretraining | RAG |
|---|---|---|
| Knowledge type | Statistical patterns, reasoning structures, vocabulary | Specific factual claims, citations |
| Update cost | High — requires retraining | Low — update the index |
| Knowledge freshness | Static until next training run | Can be updated in real time |
| Hallucination risk | Doesn't reduce on facts outside training corpus | Reduces for facts in the retrieved documents |
| Best for | Deep domain vocabulary + reasoning | Dynamic factual knowledge + citation |
Explore domain adaptation approaches →: Compare continued pretraining, fine-tuning, and RAG for domain adaptation tasks.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →