GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

The Incident Room: How to Respond to LLM Production Failures

A playbook for LLM incidents — how to triage, isolate, mitigate, and do a post-mortem. What's different about AI incidents vs. traditional software incidents.

It's 11pm. Your on-call phone buzzes. The AI feature is producing wrong answers at scale. Users are seeing it. You have three engineers in a Slack thread and no runbook.

This post is the runbook you should have written before that moment arrived. AI production incidents are different from regular software incidents — they're probabilistic, hard to reproduce, and often don't have a clear fix. But the process for handling them can be prepared in advance.

Step 1: Contain immediately

Before you understand the problem, reduce the blast radius. Your first 10 minutes: disable the AI feature or route to a fallback (cached responses, simplified model, 'sorry, this feature is temporarily unavailable'). User-facing wrong answers are worse than no answers. Don't debug with users watching.

The most common incident mistake: spending the first hour trying to understand *why* instead of containing *what*. Contain first. Understand second.

Step 2: Characterise the failure

Step 3: The AI incident decision tree

SymptomMost likely causeFirst investigation step
Wrong answers, consistent patternPrompt regression or model changeRoll back last prompt change; check if model version changed
Wrong answers, random subsetRetrieval quality degradationCheck retrieved chunks for the failing queries — are they relevant?
Stale/outdated informationDocument index not updatedCheck last index update time; verify source document dates
Increased refusalsModel safety update or prompt triggerTest affected prompts directly; check if model version changed
Latency spikeProvider issue, rate limits, or context growthCheck provider status page; check average context length
Errors / exceptionsAPI change, schema mismatch, or rate limitCheck API changelog; look at error codes in traces
Hallucinations on specific topicsTraining data gap or retrieval missCheck if those topics are covered in your knowledge base

Step 4: Fix or mitigate

The fix depends on the cause. But in an active incident, your goal is not the perfect fix — it's the fastest safe mitigation. Acceptable mitigations: roll back the prompt to the previous version (this is why you version prompts), switch to a backup model, disable the specific feature or query type that's failing, add a disclaimer to affected responses.

The proper fix happens after the incident is contained. Don't try to fix root cause while users are affected.

Step 5: Post-mortem

Every AI incident deserves a blameless post-mortem within 48 hours. The five questions: What happened? When did it start? Why didn't we catch it before it hit users? What made it hard to diagnose? What changes prevent recurrence? The last question is the only one that matters for next time.

The most valuable post-mortem output is not the root cause analysis — it's the new eval example. Every production incident should produce at least one new eval case that would have caught it. Your eval set should grow every time you have an incident.

Build the runbook before you need it

LLM observability setup →: Configure monitoring and alerting for AI production systems.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →