AI Engineering 10 min read

The Incident Room: How to Respond to LLM Production Failures

A playbook for LLM incidents — how to triage, isolate, mitigate, and do a post-mortem. What's different about AI incidents vs. traditional software incidents.

It's 11pm. Your on-call phone buzzes. The AI feature is producing wrong answers at scale. Users are seeing it. You have three engineers in a Slack thread and no runbook.

This post is the runbook you should have written before that moment arrived. AI production incidents are different from regular software incidents — they're probabilistic, hard to reproduce, and often don't have a clear fix. But the process for handling them can be prepared in advance.

Step 1: Contain immediately

Before you understand the problem, reduce the blast radius. Your first 10 minutes: disable the AI feature or route to a fallback (cached responses, simplified model, 'sorry, this feature is temporarily unavailable'). User-facing wrong answers are worse than no answers. Don't debug with users watching.

The most common incident mistake: spending the first hour trying to understand *why* instead of containing *what*. Contain first. Understand second.

Step 2: Characterise the failure

When did it start? (Check your monitoring — did something change around that time?)
What fraction of requests are affected? (10%? 100%? A specific user segment?)
What does the failure look like? (Wrong answers? Errors? Refusals? Latency spikes?)
Is it reproducible? (Pick 3 failing examples and try to reproduce them manually)
What changed recently? (Prompt update? Model version? RAG index refresh? New traffic pattern?)

Step 3: The AI incident decision tree

Symptom	Most likely cause	First investigation step
Wrong answers, consistent pattern	Prompt regression or model change	Roll back last prompt change; check if model version changed
Wrong answers, random subset	Retrieval quality degradation	Check retrieved chunks for the failing queries — are they relevant?
Stale/outdated information	Document index not updated	Check last index update time; verify source document dates
Increased refusals	Model safety update or prompt trigger	Test affected prompts directly; check if model version changed
Latency spike	Provider issue, rate limits, or context growth	Check provider status page; check average context length
Errors / exceptions	API change, schema mismatch, or rate limit	Check API changelog; look at error codes in traces
Hallucinations on specific topics	Training data gap or retrieval miss	Check if those topics are covered in your knowledge base

Step 4: Fix or mitigate

The fix depends on the cause. But in an active incident, your goal is not the perfect fix — it's the fastest safe mitigation. Acceptable mitigations: roll back the prompt to the previous version (this is why you version prompts), switch to a backup model, disable the specific feature or query type that's failing, add a disclaimer to affected responses.

The proper fix happens after the incident is contained. Don't try to fix root cause while users are affected.

Step 5: Post-mortem

Every AI incident deserves a blameless post-mortem within 48 hours. The five questions: What happened? When did it start? Why didn't we catch it before it hit users? What made it hard to diagnose? What changes prevent recurrence? The last question is the only one that matters for next time.

The most valuable post-mortem output is not the root cause analysis — it's the new eval example. Every production incident should produce at least one new eval case that would have caught it. Your eval set should grow every time you have an incident.

Build the runbook before you need it

Document your fallback path: exactly what to do when the AI feature goes down (who approves the disable? where's the flag?)
Identify your top 5 most likely failure modes and write detection + mitigation steps for each
Set up alerts for: error rate, latency P99, model output quality (via sampling judge), cost anomalies
Run a fire drill: before launch, simulate an incident and walk through the response process

LLM observability setup →: Configure monitoring and alerting for AI production systems.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →