AI Engineering 9 min read

LLM Observability: What to Log, Trace, and Alert On

Prompt/response logging, latency tracing, cost tracking, quality signals, and alert thresholds. What a production-grade LLM monitoring stack looks like.

You can't improve what you can't see. Traditional application monitoring tracks errors and latency. LLM observability tracks something harder: whether the model is doing the right thing. These are different problems and require different tools.

The four pillars of LLM observability

Traces: the full prompt, response, tool calls, and step sequence for each request — your debugging foundation
Metrics: latency (TTFT, total), token counts, cost, throughput — your operational dashboard
Quality signals: feedback, ratings, hallucination rates, task completion — your model health
Alerts: threshold-based triggers when latency spikes, error rates climb, or quality drops below SLA

What to log on every LLM call

{
  "trace_id":       "tr_abc123",
  "timestamp":      "2025-05-19T10:23:01Z",
  "model":          "gpt-4o-mini",
  "prompt_tokens":  412,
  "completion_tokens": 88,
  "latency_ms":     820,
  "cost_usd":       0.00062,
  "user_id":        "u_789",
  "session_id":     "s_456",
  "feature":        "rag_qa",

  # Store hash of prompt, not full text, to save storage
  "prompt_hash":    "sha256:...",
  "response_hash":  "sha256:...",

  # Quality signals (async, after user feedback)
  "thumbs_up":      null,
  "flagged":        false,
}

Latency breakdown: what to measure

Metric	What it measures	Typical SLA
TTFT	Time to first token — perceived responsiveness	< 500ms
TPS	Tokens per second — generation speed	> 30 tok/s
E2E latency	Total wall-clock time	< 3s for chat
Retrieval latency	Vector DB query time (RAG only)	< 100ms

TTFT matters more than total latency for UX. Users tolerate slow generation if the first token appears quickly — it signals the response has started. Stream your responses and optimise TTFT first.

Tooling landscape

LangSmith (LangChain) — best-in-class tracing for LangChain/LangGraph apps, tight IDE integration
Arize Phoenix — open-source, strong on evals and drift detection, good for non-LangChain stacks
Helicone — lightweight proxy-based logging, works with any OpenAI-compatible API
Langfuse — open-source alternative to LangSmith, self-hostable, good for EU data residency
Weave (W&B) — strongest if you're already using Weights & Biases for ML experiment tracking

Why standard APM misses LLM problems

Datadog and New Relic track errors, latency, and throughput — the right signals for deterministic services. LLM systems fail differently. Latency can be within SLA and error rate can be zero while the model is confidently answering questions wrong. Semantic drift, hallucination rate, and prompt regression are invisible to infrastructure monitoring. You need a separate observability layer that understands what a good response looks like.

Standard APM gap: no concept of response quality — a 200 OK with a hallucinated answer looks identical to a correct one
Standard APM gap: no prompt versioning — a prompt change that degrades quality 20% shows no signal in error rate
Standard APM gap: no cost-per-feature attribution — you can't see which product surface is burning your token budget
LLM-specific need: quality sampling — run LLM-as-judge over 5-10% of production traffic to catch regressions before users notice

Most LLM quality problems are invisible in standard metrics. Latency can be fine and error rate can be zero while the model is confidently answering questions wrong. Quality monitoring requires a separate eval pipeline running in parallel with your inference pipeline.

The production eval pipeline

The pattern that works at scale: sample 5-10% of production traffic, run each (prompt, response) pair through an LLM-as-judge evaluator on a background queue, store scores in your observability platform, and alert when a rolling window average drops below your quality threshold. This gives you a real-time quality signal without blocking the inference path. Typical judge dimensions: factual accuracy, instruction following, format compliance, appropriate hedging.

Explore observability in Systems →: See what a production observability stack looks like for a RAG + agent system.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →