GenAI Systems Lab Open interactive version →
AI Engineering 9 min read

LLM Observability: What to Log, Trace, and Alert On

Prompt/response logging, latency tracing, cost tracking, quality signals, and alert thresholds. What a production-grade LLM monitoring stack looks like.

You can't improve what you can't see. Traditional application monitoring tracks errors and latency. LLM observability tracks something harder: whether the model is doing the right thing. These are different problems and require different tools.

The four pillars of LLM observability

What to log on every LLM call

{
  "trace_id":       "tr_abc123",
  "timestamp":      "2025-05-19T10:23:01Z",
  "model":          "gpt-4o-mini",
  "prompt_tokens":  412,
  "completion_tokens": 88,
  "latency_ms":     820,
  "cost_usd":       0.00062,
  "user_id":        "u_789",
  "session_id":     "s_456",
  "feature":        "rag_qa",

  # Store hash of prompt, not full text, to save storage
  "prompt_hash":    "sha256:...",
  "response_hash":  "sha256:...",

  # Quality signals (async, after user feedback)
  "thumbs_up":      null,
  "flagged":        false,
}

Latency breakdown: what to measure

MetricWhat it measuresTypical SLA
TTFTTime to first token — perceived responsiveness< 500ms
TPSTokens per second — generation speed> 30 tok/s
E2E latencyTotal wall-clock time< 3s for chat
Retrieval latencyVector DB query time (RAG only)< 100ms

TTFT matters more than total latency for UX. Users tolerate slow generation if the first token appears quickly — it signals the response has started. Stream your responses and optimise TTFT first.

Tooling landscape

Why standard APM misses LLM problems

Datadog and New Relic track errors, latency, and throughput — the right signals for deterministic services. LLM systems fail differently. Latency can be within SLA and error rate can be zero while the model is confidently answering questions wrong. Semantic drift, hallucination rate, and prompt regression are invisible to infrastructure monitoring. You need a separate observability layer that understands what a good response looks like.

Most LLM quality problems are invisible in standard metrics. Latency can be fine and error rate can be zero while the model is confidently answering questions wrong. Quality monitoring requires a separate eval pipeline running in parallel with your inference pipeline.

The production eval pipeline

The pattern that works at scale: sample 5-10% of production traffic, run each (prompt, response) pair through an LLM-as-judge evaluator on a background queue, store scores in your observability platform, and alert when a rolling window average drops below your quality threshold. This gives you a real-time quality signal without blocking the inference path. Typical judge dimensions: factual accuracy, instruction following, format compliance, appropriate hedging.

Explore observability in Systems →: See what a production observability stack looks like for a RAG + agent system.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →