AI Engineering 11 min read

DPO in Practice: Building Preference Datasets and Running Alignment Training

Beyond the theory — how to actually collect preference data, format (prompt, chosen, rejected) triplets, run DPO training with PEFT, and evaluate the result against your baseline.

DPO (Direct Preference Optimization) is theoretically simpler than RLHF — one fine-tuning pass, no reward model, no PPO. But in practice, the data collection step is where most teams stumble. The quality of your preference dataset determines almost everything about the final model.

This post is the practical guide: how to collect preference data, how to format DPO triplets, how to configure and run DPO training, and how to evaluate the result.

What a DPO dataset looks like

{
  "prompt": "Explain the difference between RAG and fine-tuning.",
  "chosen": "RAG and fine-tuning solve different problems. RAG retrieves external documents at inference time — it's ideal when knowledge needs to stay current or be cited. Fine-tuning updates model weights — it's better for teaching consistent format, tone, or task-specific reasoning patterns. Neither replaces the other.",
  "rejected": "RAG is better than fine-tuning because it uses real documents. Fine-tuning is when you train the model again. You should use RAG for most things."
}

How to collect preference data

Method 1: Human annotation of model outputs

Sample production prompts (or create diverse synthetic prompts). Generate 2–4 responses per prompt using the target model (or a stronger model). Have human annotators rank the responses. This is the highest-quality approach but expensive — budget ~$0.50–2.00 per example with skilled annotators.

Method 2: AI-generated preference pairs

Use a strong model (GPT-4, Claude) to generate both chosen and rejected responses for your prompts. The chosen response should reflect ideal behaviour; the rejected response should exhibit the specific failure mode you're trying to fix (too verbose, wrong format, incorrect hedging, etc.). Faster and cheaper — but the chosen ceiling is limited by your generator model.

Method 3: Contrastive pairs from existing data

If you have an existing dataset with quality labels (thumbs up/down from production users, human ratings), convert high-quality vs. low-quality pairs for the same prompt into (chosen, rejected) DPO examples.

The most important DPO data quality rule: chosen and rejected responses should differ in exactly the dimension you want to improve. If they differ in multiple ways (quality AND length AND tone), the model learns a confounded signal. Make the contrast clean and specific.

DPO training configuration

from trl import DPOTrainer, DPOConfig
from peft import LoraConfig

# DPO works on top of LoRA — keeps VRAM manageable
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"])

dpo_config = DPOConfig(
    beta=0.1,                    # KL penalty strength — start here, tune 0.05–0.5
    learning_rate=5e-5,          # lower than SFT — DPO is more sensitive
    num_train_epochs=1,          # DPO overfits quickly — rarely go above 2
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    max_length=1024,
    max_prompt_length=512,
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,         # frozen reference model (SFT checkpoint)
    args=dpo_config,
    train_dataset=dataset,
    peft_config=lora_config,
)
trainer.train()

The beta hyperparameter

Beta controls the KL penalty — how far the trained model is allowed to drift from the reference model. Low beta (0.05) = aggressive preference learning, higher risk of capability degradation. High beta (0.5) = conservative, stays close to the reference model but weaker preference signal. Start at 0.1 and adjust based on evaluation results.

Evaluating DPO-trained models

Win rate: generate responses to held-out prompts from both baseline and DPO model, have a judge (GPT-4 or human) pick the better response. Target >60% win rate vs. baseline.
Capability regression: run the DPO model on general benchmarks (MMLU, HumanEval). DPO should not significantly degrade general capability — if it does, beta is too low.
Format/safety metrics: measure the specific failure modes you targeted with the preference data.

DPO models have a known tendency toward mode collapse if trained too aggressively or for too many epochs. The model learns to always produce responses that look like 'chosen' responses — but loses diversity and can fail on out-of-distribution prompts. Monitor response diversity alongside quality metrics.

Compare alignment methods →: See how DPO-trained models compare to SFT and RLHF-trained models.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →