GenAI Systems Lab Open interactive version →
AI Engineering 11 min read

Building Instruction Tuning Datasets: Quality Over Everything

The data decisions that determine whether your fine-tuned model improves or degrades. Format design, diversity requirements, quality filtering, synthetic data generation, and the LIMA principle in practice.

The most common reason fine-tuned models fail in production isn't the training algorithm, the learning rate, or the base model choice. It's the dataset. Specifically: a dataset that contains enough examples to reduce training loss, but not enough quality or diversity to produce a model that generalises to real production inputs.

Dataset curation for instruction tuning is a discipline in itself. The LIMA paper showed that 1,000 carefully curated examples can match RLHF-heavy approaches. The corollary: 10,000 poorly curated examples can produce a model worse than your baseline. Quality is not a nice-to-have — it is the primary variable.

Anatomy of an instruction tuning example

Every example in an instruction tuning dataset is a (system_prompt, instruction, response) triple — though system_prompt is sometimes omitted or baked into the instruction. The structure matters as much as the content:

{
  "system": "You are a helpful assistant that answers questions about company policy.",
  "instruction": "Can employees expense meals during remote work days?",
  "response": "Yes. As of January 2024, remote employees may expense up to ₹1,800 per day for meals while working remotely. Receipts are required for amounts above ₹500, and claims must be submitted within 30 days."
}

The five data quality dimensions

Synthetic data generation — and its limits

Generating synthetic training data with GPT-4 or Claude is faster and cheaper than human annotation. The typical approach: write a few seed examples, then prompt the model to generate variations at scale. This works well for format and style tasks, and reasonably well for straightforward instruction following.

The limits of synthetic data: it inherits the biases and failure modes of the generator model. If you're fine-tuning a model to be better than GPT-4 at a specific task, training it on GPT-4's outputs creates a ceiling. And synthetic data tends to be more formulaic than human-written examples — less stylistic diversity, less natural variation in phrasing.

Never fine-tune a model on its own outputs (self-distillation without quality filtering). This creates a feedback loop where the model reinforces its own errors. If you use the target model to generate training data, filter aggressively with a separate evaluation model or human review.

Dataset filtering checklist

How much data do you need

For format and style adaptation: 500–2,000 high-quality examples is usually sufficient. For domain knowledge adaptation: 2,000–10,000 examples. For deep specialisation on complex tasks: 10,000–50,000. Beyond 50,000 supervised examples, the marginal return per example drops sharply — at that scale, consider whether continued pretraining on unlabelled domain text might be more efficient.

Explore fine-tuning approaches →: See how dataset quality and size affect fine-tuned model performance.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →