AI Engineering 12 min read

Observability for Agent Systems: Traces, Cost, Latency, and What to Alert On

Why logs alone fail for agents. Complete trace anatomy for multi-step tasks. OpenTelemetry span attributes for LLM calls and tool invocations. Cost observability per task. Latency attribution. Alert thresholds that catch problems before users do.

Prerequisites: agent architecture basics, basic logging concepts. After this post you will be able to design an observability stack for an agent system: trace anatomy, span attributes, cost tracking, latency profiling, and alert thresholds.

Standard application observability — request logs, error rates, latency percentiles — is not enough for agent systems. An agent doesn't have one request. It has a chain of LLM calls, tool invocations, retrieval operations, and retries, each of which can fail silently or produce subtly wrong outputs.

When an agent produces a bad result, 'the model hallucinated' is not a diagnosis. Observability is what tells you whether the retriever returned wrong context, the prompt was malformed, the tool call failed silently, or the LLM made a bad planning decision.

Why Logs Alone Fail

Agent tasks span multiple operations over seconds or minutes. A single log line per operation gives you isolated events, not causally connected traces. The failure may be in the context, not the code. A tool returned data. The LLM misread it. The log shows a successful tool call — the real problem is invisible. Costs compound invisibly. Five LLM calls per task at 4k tokens each is not obvious from request logs. Cost observability requires per-call token accounting. Latency attribution is impossible without traces. 'The agent is slow' could mean slow retrieval, slow LLM inference, slow tool APIs, or excessive retries. Logs don't tell you which.

Trace Anatomy for Agent Systems

Every agent task should produce one distributed trace with child spans. A complete trace looks like:

Root span: the full agent task. Attributes: task_id, user_id, task_type, total_duration, total_cost, final_status. Planning span: the LLM call that produces the plan or next action. Attributes: model, prompt_tokens, completion_tokens, latency, temperature. Tool call span (one per tool): tool_name, arguments (masked if sensitive), result_size, latency, success/failure, retry_count. Retrieval span (if RAG): query, top_k, retrieval_latency, result_count, reranker_used. LLM generation span: model, prompt_tokens (including retrieved context), completion_tokens, latency, finish_reason. Error spans: exception type, stack trace, which step failed, whether it was recoverable.

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer('agent-system')

def run_agent_task(task_id: str, user_query: str):
    with tracer.start_as_current_span('agent.task') as root:
        root.set_attribute('task.id', task_id)
        root.set_attribute('task.query_length', len(user_query))
        
        with tracer.start_as_current_span('agent.plan') as plan_span:
            plan, tokens = llm.plan(user_query)
            plan_span.set_attribute('llm.prompt_tokens', tokens.prompt)
            plan_span.set_attribute('llm.completion_tokens', tokens.completion)
            plan_span.set_attribute('llm.cost_usd', tokens.cost())
        
        for step in plan.steps:
            with tracer.start_as_current_span(f'tool.{step.tool}') as tool_span:
                tool_span.set_attribute('tool.name', step.tool)
                tool_span.set_attribute('tool.args_hash', hash(str(step.args)))
                try:
                    result = execute_tool(step)
                    tool_span.set_attribute('tool.result_tokens', len(result))
                except Exception as e:
                    tool_span.set_status(Status(StatusCode.ERROR, str(e)))
                    raise

Cost Observability

Agent costs are not predictable from request volume. A single task can make 1 LLM call or 15, depending on the query complexity and tool failures.

Track cost per span, not per request. Aggregate to per-task, per-user, per-task-type. Set per-task cost budgets. If a task exceeds $0.50 in LLM calls, something is wrong — likely a retry loop or an unexpectedly long retrieved context. Alert on p95 cost, not average. Average cost hides runaway tasks. The worst 5% of tasks tell you where your cost problems live. Break down by model. If you're mixing GPT-4o and GPT-4o-mini, track separately. Misrouting expensive queries to the expensive model is a common budget leak.

Latency Observability

Measure wall-clock time per span. LLM inference latency scales with output tokens — long responses cost more time. Track time-to-first-token separately from total generation time. Users perceive TTFT as 'the AI is thinking.' Total time is the actual bottleneck. Retrieval latency should be under 200ms for synchronous agents. If it's higher, you need caching, ANN index tuning, or async prefetching. Tool API latency is outside your control but inside your trace. Know your p99 for each external tool. If a tool's p99 is 3s, it becomes the bottleneck on complex tasks.

What to Alert On

Task failure rate > 2%: anything above this is a systemic problem. Retry rate > 15%: your tools or LLM calls are flaky. High retry rate also means cost overrun. Mean task cost > 2x baseline: something in the pipeline is consuming more tokens than expected. Tool error rate per tool: a specific tool failing often means it's down or the schema is wrong. p95 task latency > SLA: surface to users before they complain. Planning loop depth > 5: agent is stuck in a reasoning loop. Needs a loop termination guard.

Senior framing: observability is not instrumentation you add after the system breaks. It is the mechanism by which you know the system is working before it visibly breaks. Ship with traces from day one. The cost of retroactively adding traces to a production agent system is always higher than building it in.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →