LLMOps: What Production AI Actually Needs That Tutorials Skip
Observability, prompt versioning, latency budgets, cost tracking, model routers, A/B testing, and rollback strategies. The full production checklist.
Every production AI system needs the same set of infrastructure that tutorial content skips. This is the checklist. If you can't check every box, your system is not production-ready — it's a demo that's somehow in production.
Before you deploy
- ✓ Eval pipeline: offline evaluation on ≥100 golden examples with defined pass/fail thresholds
- ✓ Prompt versioning: prompts checked into version control, not hardcoded strings
- ✓ Cost estimate: monthly cost projection at expected QPS — reviewed and approved
- ✓ Latency SLA: P50 and P99 targets defined, measured in staging, not guessed
- ✓ Fallback path: clear degraded mode (simpler model, cached response, or graceful error)
- ✓ Rate limiting: per-user and per-session token limits to prevent runaway costs
Observability (what to instrument)
- Every LLM call: trace ID, model, latency (TTFT + total), token counts, cost, feature, user ID
- Quality signals: thumbs up/down, explicit ratings, task completion flags
- Retrieval metrics (for RAG): chunks retrieved, reranker scores, context utilisation rate
- Agent metrics: steps per task, tool call distribution, success/failure/timeout rates
- Cost alerts: daily/monthly spend alerts at 50%, 80%, 100% of budget
Prompt management
Prompts are code. They have versions, they cause regressions, and they need to be deployed safely. At minimum: store prompts in version control with semantic versioning, run your eval suite before promoting a new prompt version, and maintain the ability to rollback to a previous prompt in under 5 minutes.
The most common LLMOps failure: a well-intentioned prompt tweak that ships without running evals and degrades the model's behaviour on edge cases that weren't manually tested. Eval gates before promotion are non-negotiable.
Ongoing operations
| Cadence | What to review |
|---|---|
| Daily | Cost vs. budget, error rate, P99 latency, flagged outputs |
| Weekly | Quality signal trends, eval score vs. baseline, top failure patterns |
| Monthly | Full eval suite run, prompt performance review, model upgrade consideration |
| Quarterly | RAG index freshness audit, eval set expansion, cost optimisation review |
We had no eval pipeline. We had no prompt versioning. We shipped. Costs went up 40% after a well-intentioned system prompt rewrite that nobody tested. Good intentions aren't a deployment strategy.
Model upgrade strategy
When a new model version drops, don't assume it's a drop-in replacement. Always run your full eval suite against the new model before promoting, compare on your tail distribution (not just average quality), and check latency and cost deltas. A model that's 10% better on average but 30% worse on your P99 tail is not an upgrade.
Build your LLMOps stack →: Configure observability, prompt versioning, and eval pipelines in the Systems module.
- Chip Huyen: Building LLM applications for production
- LLMOps: Operationalizing Large Language Models — Google Cloud
- Monitoring ML Models in Production — Evidently AI
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →