AI Engineering 13 min read

LLMOps: What Production AI Actually Needs That Tutorials Skip

Observability, prompt versioning, latency budgets, cost tracking, model routers, A/B testing, and rollback strategies. The full production checklist.

Every production AI system needs the same set of infrastructure that tutorial content skips. This is the checklist. If you can't check every box, your system is not production-ready — it's a demo that's somehow in production.

Before you deploy

✓ Eval pipeline: offline evaluation on ≥100 golden examples with defined pass/fail thresholds
✓ Prompt versioning: prompts checked into version control, not hardcoded strings
✓ Cost estimate: monthly cost projection at expected QPS — reviewed and approved
✓ Latency SLA: P50 and P99 targets defined, measured in staging, not guessed
✓ Fallback path: clear degraded mode (simpler model, cached response, or graceful error)
✓ Rate limiting: per-user and per-session token limits to prevent runaway costs

Observability (what to instrument)

Every LLM call: trace ID, model, latency (TTFT + total), token counts, cost, feature, user ID
Quality signals: thumbs up/down, explicit ratings, task completion flags
Retrieval metrics (for RAG): chunks retrieved, reranker scores, context utilisation rate
Agent metrics: steps per task, tool call distribution, success/failure/timeout rates
Cost alerts: daily/monthly spend alerts at 50%, 80%, 100% of budget

Prompt management

Prompts are code. They have versions, they cause regressions, and they need to be deployed safely. At minimum: store prompts in version control with semantic versioning, run your eval suite before promoting a new prompt version, and maintain the ability to rollback to a previous prompt in under 5 minutes.

The most common LLMOps failure: a well-intentioned prompt tweak that ships without running evals and degrades the model's behaviour on edge cases that weren't manually tested. Eval gates before promotion are non-negotiable.

Ongoing operations

Cadence	What to review
Daily	Cost vs. budget, error rate, P99 latency, flagged outputs
Weekly	Quality signal trends, eval score vs. baseline, top failure patterns
Monthly	Full eval suite run, prompt performance review, model upgrade consideration
Quarterly	RAG index freshness audit, eval set expansion, cost optimisation review

We had no eval pipeline. We had no prompt versioning. We shipped. Costs went up 40% after a well-intentioned system prompt rewrite that nobody tested. Good intentions aren't a deployment strategy.

Model upgrade strategy

When a new model version drops, don't assume it's a drop-in replacement. Always run your full eval suite against the new model before promoting, compare on your tail distribution (not just average quality), and check latency and cost deltas. A model that's 10% better on average but 30% worse on your P99 tail is not an upgrade.

Build your LLMOps stack →: Configure observability, prompt versioning, and eval pipelines in the Systems module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →