Building Instruction Tuning Datasets: Quality Over Everything
The data decisions that determine whether your fine-tuned model improves or degrades. Format design, diversity requirements, quality filtering, synthetic data generation, and the LIMA principle in practice.
The most common reason fine-tuned models fail in production isn't the training algorithm, the learning rate, or the base model choice. It's the dataset. Specifically: a dataset that contains enough examples to reduce training loss, but not enough quality or diversity to produce a model that generalises to real production inputs.
Dataset curation for instruction tuning is a discipline in itself. The LIMA paper showed that 1,000 carefully curated examples can match RLHF-heavy approaches. The corollary: 10,000 poorly curated examples can produce a model worse than your baseline. Quality is not a nice-to-have — it is the primary variable.
Anatomy of an instruction tuning example
Every example in an instruction tuning dataset is a (system_prompt, instruction, response) triple — though system_prompt is sometimes omitted or baked into the instruction. The structure matters as much as the content:
{
"system": "You are a helpful assistant that answers questions about company policy.",
"instruction": "Can employees expense meals during remote work days?",
"response": "Yes. As of January 2024, remote employees may expense up to ₹1,800 per day for meals while working remotely. Receipts are required for amounts above ₹500, and claims must be submitted within 30 days."
}
The five data quality dimensions
- Correctness: every response must be factually accurate for your domain. Wrong answers teach the model to be confidently wrong. Human review of a sample is non-negotiable.
- Format consistency: all responses must follow exactly the same structure, tone, and length guidelines. Inconsistency is learned as a pattern.
- Instruction diversity: if 80% of your examples are the same type of question, the model overfits to that type. Ensure coverage across all instruction types your production system will receive.
- Difficulty distribution: include easy, medium, and hard examples. A dataset skewed toward simple examples produces a model that struggles with complex production queries.
- Edge case coverage: identify the failure modes in your baseline model and create training examples that specifically address them.
Synthetic data generation — and its limits
Generating synthetic training data with GPT-4 or Claude is faster and cheaper than human annotation. The typical approach: write a few seed examples, then prompt the model to generate variations at scale. This works well for format and style tasks, and reasonably well for straightforward instruction following.
The limits of synthetic data: it inherits the biases and failure modes of the generator model. If you're fine-tuning a model to be better than GPT-4 at a specific task, training it on GPT-4's outputs creates a ceiling. And synthetic data tends to be more formulaic than human-written examples — less stylistic diversity, less natural variation in phrasing.
Never fine-tune a model on its own outputs (self-distillation without quality filtering). This creates a feedback loop where the model reinforces its own errors. If you use the target model to generate training data, filter aggressively with a separate evaluation model or human review.
Dataset filtering checklist
- Remove near-duplicates: use MinHash or cosine similarity on embeddings. >0.9 similarity threshold for removal.
- Length filter: remove examples with unusually short responses (likely low quality) or unusually long ones (likely templated)
- Quality classifier: train a simple classifier on a human-rated sample to score the rest of the dataset
- Toxicity/safety filter: run every response through a safety classifier before using as training data
- Human spot-check: manually review 5% of the final dataset before training
How much data do you need
For format and style adaptation: 500–2,000 high-quality examples is usually sufficient. For domain knowledge adaptation: 2,000–10,000 examples. For deep specialisation on complex tasks: 10,000–50,000. Beyond 50,000 supervised examples, the marginal return per example drops sharply — at that scale, consider whether continued pretraining on unlabelled domain text might be more efficient.
Explore fine-tuning approaches →: See how dataset quality and size affect fine-tuned model performance.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →