AI Engineering 9 min read

Continued Pretraining: When Fine-Tuning Isn't Deep Enough

When domain adaptation requires continued pretraining on unlabelled text rather than supervised fine-tuning. Medical, legal, and code domains — what it takes and when it's worth it.

Instruction fine-tuning teaches a model how to respond. Continued pretraining teaches a model what to know. The distinction matters when your domain is genuinely out-of-distribution from the base model's pretraining data — where the vocabulary, concepts, and reasoning patterns of your domain are so specialised that no amount of instruction tuning on labelled examples will fully close the gap.

Medicine. Law. Highly specialised scientific domains. Proprietary internal codebases with unique conventions. These are the domains where continued pretraining on unlabelled text becomes the right tool.

What continued pretraining actually does

Continued pretraining runs the standard language modelling objective (predict the next token) on a large corpus of domain-specific text — without any instruction-response structure. The model doesn't learn to answer questions; it learns the statistical patterns, terminology, and reasoning structures of the domain.

After continued pretraining, you still need instruction fine-tuning on top to teach the model how to use that knowledge in response to instructions. Continued pretraining → instruction fine-tuning is the standard two-stage pipeline for deep domain adaptation.

Continued pretraining changes what the model knows. Instruction fine-tuning changes how the model responds. You often need both. The order is always: continued pretraining first, then instruction fine-tuning on top of the domain-adapted base.

When continued pretraining is worth it

Your domain has specialised vocabulary that the base model tokenises inefficiently (medical Latin, legal Latin, scientific notation, proprietary codebases)
Your domain requires multi-step reasoning patterns not present in general instruction data
You have a large corpus of unlabelled domain text (10B+ tokens ideally) but limited labelled examples
Instruction fine-tuning on labelled data has reached a quality ceiling that more data doesn't improve

When it's not worth it

You have <1B tokens of domain text: the compute cost is high relative to the knowledge gain
Your domain is well-represented in the base model's pretraining data (general web text, common programming languages, mainstream science)
Your quality ceiling is format/style rather than domain knowledge: instruction fine-tuning alone will close that gap
You don't have the infrastructure for multi-GPU multi-day training runs

Data requirements and preparation

Continued pretraining data is unlabelled — just raw text from your domain. Quality still matters enormously. A 10B token corpus of high-quality medical literature will produce a better model than 50B tokens of scraped web text that happens to mention medical topics.

Source selection: peer-reviewed papers, authoritative reference materials, high-quality domain documentation
Deduplication: aggressive deduplication at the document level — duplicated text teaches the model to memorise, not generalise
Length filtering: remove very short documents (metadata, headers) and very long ones that likely contain non-domain text
Data mixing: mix domain text with a fraction (~10–20%) of general text to prevent catastrophic forgetting of general capabilities

Continued pretraining vs. RAG

Aspect	Continued Pretraining	RAG
Knowledge type	Statistical patterns, reasoning structures, vocabulary	Specific factual claims, citations
Update cost	High — requires retraining	Low — update the index
Knowledge freshness	Static until next training run	Can be updated in real time
Hallucination risk	Doesn't reduce on facts outside training corpus	Reduces for facts in the retrieved documents
Best for	Deep domain vocabulary + reasoning	Dynamic factual knowledge + citation

Explore domain adaptation approaches →: Compare continued pretraining, fine-tuning, and RAG for domain adaptation tasks.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →