LLM Observability: What to Log, Trace, and Alert On
Prompt/response logging, latency tracing, cost tracking, quality signals, and alert thresholds. What a production-grade LLM monitoring stack looks like.
You can't improve what you can't see. Traditional application monitoring tracks errors and latency. LLM observability tracks something harder: whether the model is doing the right thing. These are different problems and require different tools.
The four pillars of LLM observability
- Traces: the full prompt, response, tool calls, and step sequence for each request — your debugging foundation
- Metrics: latency (TTFT, total), token counts, cost, throughput — your operational dashboard
- Quality signals: feedback, ratings, hallucination rates, task completion — your model health
- Alerts: threshold-based triggers when latency spikes, error rates climb, or quality drops below SLA
What to log on every LLM call
{
"trace_id": "tr_abc123",
"timestamp": "2025-05-19T10:23:01Z",
"model": "gpt-4o-mini",
"prompt_tokens": 412,
"completion_tokens": 88,
"latency_ms": 820,
"cost_usd": 0.00062,
"user_id": "u_789",
"session_id": "s_456",
"feature": "rag_qa",
# Store hash of prompt, not full text, to save storage
"prompt_hash": "sha256:...",
"response_hash": "sha256:...",
# Quality signals (async, after user feedback)
"thumbs_up": null,
"flagged": false,
}
Latency breakdown: what to measure
| Metric | What it measures | Typical SLA |
|---|---|---|
| TTFT | Time to first token — perceived responsiveness | < 500ms |
| TPS | Tokens per second — generation speed | > 30 tok/s |
| E2E latency | Total wall-clock time | < 3s for chat |
| Retrieval latency | Vector DB query time (RAG only) | < 100ms |
TTFT matters more than total latency for UX. Users tolerate slow generation if the first token appears quickly — it signals the response has started. Stream your responses and optimise TTFT first.
Tooling landscape
- LangSmith (LangChain) — best-in-class tracing for LangChain/LangGraph apps, tight IDE integration
- Arize Phoenix — open-source, strong on evals and drift detection, good for non-LangChain stacks
- Helicone — lightweight proxy-based logging, works with any OpenAI-compatible API
- Langfuse — open-source alternative to LangSmith, self-hostable, good for EU data residency
- Weave (W&B) — strongest if you're already using Weights & Biases for ML experiment tracking
Why standard APM misses LLM problems
Datadog and New Relic track errors, latency, and throughput — the right signals for deterministic services. LLM systems fail differently. Latency can be within SLA and error rate can be zero while the model is confidently answering questions wrong. Semantic drift, hallucination rate, and prompt regression are invisible to infrastructure monitoring. You need a separate observability layer that understands what a good response looks like.
- Standard APM gap: no concept of response quality — a 200 OK with a hallucinated answer looks identical to a correct one
- Standard APM gap: no prompt versioning — a prompt change that degrades quality 20% shows no signal in error rate
- Standard APM gap: no cost-per-feature attribution — you can't see which product surface is burning your token budget
- LLM-specific need: quality sampling — run LLM-as-judge over 5-10% of production traffic to catch regressions before users notice
Most LLM quality problems are invisible in standard metrics. Latency can be fine and error rate can be zero while the model is confidently answering questions wrong. Quality monitoring requires a separate eval pipeline running in parallel with your inference pipeline.
The production eval pipeline
The pattern that works at scale: sample 5-10% of production traffic, run each (prompt, response) pair through an LLM-as-judge evaluator on a background queue, store scores in your observability platform, and alert when a rolling window average drops below your quality threshold. This gives you a real-time quality signal without blocking the inference path. Typical judge dimensions: factual accuracy, instruction following, format compliance, appropriate hedging.
Explore observability in Systems →: See what a production observability stack looks like for a RAG + agent system.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →