Writing PRDs for AI Features: A Framework for Product Managers
What makes AI PRDs different — uncertainty ranges, fallback behaviour, eval criteria, human-in-the-loop decisions, and what 'done' looks like for an AI feature.
AI PRDs break traditional product specification. The core problem: traditional PRDs assume deterministic systems. AI features are probabilistic — the "feature" is a statistical distribution of outputs, not a defined behaviour. This changes almost everything about how you write the spec.
What's different about AI PRDs
| Traditional PRD | AI PRD |
|---|---|
| "The feature does X" | "The feature does X in Y% of cases, degrades gracefully in Z%" |
| Success = function works | Success = eval metrics above threshold on golden test set |
| Bugs are binary (fixed/not fixed) | Quality is a continuous distribution that shifts with data |
| Rollback = revert code | Rollback = revert model or prompt version |
| "Done" is clear | "Done" requires ongoing monitoring and eval gates |
The AI PRD template
- Problem statement: what user need does this solve? What's the baseline (no AI) experience?
- AI approach: RAG / fine-tuning / prompting / agent — and why not the alternatives
- Input/output spec: what goes in, what comes out, what's the acceptable output distribution
- Evaluation criteria: specific metrics (faithfulness > 0.9, TTFT < 500ms) that define "done"
- Failure modes: what does a bad output look like? What are the acceptance criteria?
- Fallback behaviour: what happens when the model fails, is slow, or returns low-confidence output
- Human-in-the-loop: which decisions require human approval? What escalation paths exist?
- Data requirements: what data is needed for eval? For fine-tuning? Who labels it?
- Monitoring plan: what signals indicate degradation post-launch? Who owns the alert?
The most important section most AI PRDs are missing: fallback behaviour. What does the user experience when the model fails? "Show an error message" is not an answer. Good AI PMs design the failure path as carefully as the success path.
Writing evaluation criteria
Eval criteria must be specific, measurable, and agreed on before engineering starts. Vague criteria like "responses should be accurate" cause scope disputes at launch. Good criteria look like:
- Faithfulness score ≥ 0.90 on the golden test set (measured by RAGAS)
- P99 end-to-end latency ≤ 3,000ms under 100 concurrent users
- Hallucination rate (NLI-flagged) ≤ 2% on the held-out evaluation set
- Human preference rate ≥ 70% over the baseline (no-AI) experience in A/B test
The AI launch gate
Define a binary launch gate: a set of criteria that must all pass before the feature ships. This replaces intuition-based "looks good to me" sign-offs with objective thresholds. The eval pipeline runs automatically and blocks launch if any criterion fails.
Practice AI PRD writing →: Work through a real AI feature spec in the AI PM module with structured feedback.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →