The Incident Room: How to Respond to LLM Production Failures
A playbook for LLM incidents — how to triage, isolate, mitigate, and do a post-mortem. What's different about AI incidents vs. traditional software incidents.
It's 11pm. Your on-call phone buzzes. The AI feature is producing wrong answers at scale. Users are seeing it. You have three engineers in a Slack thread and no runbook.
This post is the runbook you should have written before that moment arrived. AI production incidents are different from regular software incidents — they're probabilistic, hard to reproduce, and often don't have a clear fix. But the process for handling them can be prepared in advance.
Step 1: Contain immediately
Before you understand the problem, reduce the blast radius. Your first 10 minutes: disable the AI feature or route to a fallback (cached responses, simplified model, 'sorry, this feature is temporarily unavailable'). User-facing wrong answers are worse than no answers. Don't debug with users watching.
The most common incident mistake: spending the first hour trying to understand *why* instead of containing *what*. Contain first. Understand second.
Step 2: Characterise the failure
- When did it start? (Check your monitoring — did something change around that time?)
- What fraction of requests are affected? (10%? 100%? A specific user segment?)
- What does the failure look like? (Wrong answers? Errors? Refusals? Latency spikes?)
- Is it reproducible? (Pick 3 failing examples and try to reproduce them manually)
- What changed recently? (Prompt update? Model version? RAG index refresh? New traffic pattern?)
Step 3: The AI incident decision tree
| Symptom | Most likely cause | First investigation step |
|---|---|---|
| Wrong answers, consistent pattern | Prompt regression or model change | Roll back last prompt change; check if model version changed |
| Wrong answers, random subset | Retrieval quality degradation | Check retrieved chunks for the failing queries — are they relevant? |
| Stale/outdated information | Document index not updated | Check last index update time; verify source document dates |
| Increased refusals | Model safety update or prompt trigger | Test affected prompts directly; check if model version changed |
| Latency spike | Provider issue, rate limits, or context growth | Check provider status page; check average context length |
| Errors / exceptions | API change, schema mismatch, or rate limit | Check API changelog; look at error codes in traces |
| Hallucinations on specific topics | Training data gap or retrieval miss | Check if those topics are covered in your knowledge base |
Step 4: Fix or mitigate
The fix depends on the cause. But in an active incident, your goal is not the perfect fix — it's the fastest safe mitigation. Acceptable mitigations: roll back the prompt to the previous version (this is why you version prompts), switch to a backup model, disable the specific feature or query type that's failing, add a disclaimer to affected responses.
The proper fix happens after the incident is contained. Don't try to fix root cause while users are affected.
Step 5: Post-mortem
Every AI incident deserves a blameless post-mortem within 48 hours. The five questions: What happened? When did it start? Why didn't we catch it before it hit users? What made it hard to diagnose? What changes prevent recurrence? The last question is the only one that matters for next time.
The most valuable post-mortem output is not the root cause analysis — it's the new eval example. Every production incident should produce at least one new eval case that would have caught it. Your eval set should grow every time you have an incident.
Build the runbook before you need it
- Document your fallback path: exactly what to do when the AI feature goes down (who approves the disable? where's the flag?)
- Identify your top 5 most likely failure modes and write detection + mitigation steps for each
- Set up alerts for: error rate, latency P99, model output quality (via sampling judge), cost anomalies
- Run a fire drill: before launch, simulate an incident and walk through the response process
LLM observability setup →: Configure monitoring and alerting for AI production systems.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →