Production Fine-Tuning Case Study: From Zero to Deployed Domain Expert
End-to-end walkthrough of fine-tuning a 7B model for a legal document extraction task. Dataset curation decisions, SFT vs DPO sequencing, evaluation harness setup, LoRA rank selection, quantization for serving, and what broke in production that didn't show up in eval.
The Task: Legal Clause Extraction at Scale
A legal tech company needed to extract 47 specific clause types from commercial contracts — NDAs, MSAs, vendor agreements. GPT-4 could do it at 94% accuracy but cost $0.40/contract. They had 50,000 contracts/month. The goal: match 94% accuracy at under $0.02/contract using a fine-tuned 7B model.
Stage 1: Dataset Curation (2 weeks)
The hardest part. They started with 1,200 human-annotated contracts from their legal team. Key decisions: (1) stratify by contract type — don't over-represent NDAs, (2) include negative examples — contracts where the clause is absent, (3) normalize output format strictly — JSON with null for absent clauses, not 'N/A' or empty string.
Lesson learned: 200 carefully curated examples outperformed 2,000 auto-labeled examples from GPT-4. The auto-labeling introduced systematic errors on edge cases that the model then learned to replicate.
Stage 2: SFT with LoRA
Base model: Mistral 7B Instruct v0.2. LoRA config: rank=64, alpha=128, target modules=[q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]. Training: 3 epochs, batch size 4, gradient accumulation 8. Hardware: 2x A100 80GB. Training time: 6 hours.
# LoRA config that worked
from peft import LoraConfig
lora_config = LoraConfig(
r=64,
lora_alpha=128,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
Stage 3: DPO for Format Compliance
SFT got to 89% accuracy but produced malformed JSON 12% of the time — the model would add commentary, miss closing braces, or use inconsistent null representations. They ran one round of DPO with 400 preference pairs: chosen = valid JSON output, rejected = the malformed output from the same prompt. DPO fixed format compliance to 99.2%.
Stage 4: Evaluation Harness
The evaluation harness compared extracted clause values against ground truth using: exact match for boolean fields, fuzzy string match (>0.85 similarity) for text fields, and structured diff for date/party fields. They held out 150 contracts as a test set — never used in training or DPO.
What Broke in Production
- Distribution shift: contracts from a new client used British English legal formatting — accuracy dropped to 81% until they added 50 UK-format examples and re-fine-tuned
- Token length: contracts longer than 8K tokens required chunking — the chunking strategy (where to split) mattered enormously for clause extraction accuracy
- Hallucinated clauses: the model occasionally invented clause content for absent fields — fixed by adding explicit 'clause is absent, return null' examples to DPO
- Inference cost: GPTQ 4-bit quantization maintained 93.1% accuracy (vs 94.2% fp16) at 60% of the serving cost
Final Results
93.1% accuracy at $0.018/contract (vs $0.40 for GPT-4). 22x cost reduction with 1.1pp accuracy loss — a tradeoff the legal team considered acceptable. The fine-tuned model ran on a single A100 with GPTQ quantization, serving 400 contracts/hour.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →