GenAI Systems Lab Open interactive version →
AI Engineering 12 min read

Observability for Agent Systems: Traces, Cost, Latency, and What to Alert On

Why logs alone fail for agents. Complete trace anatomy for multi-step tasks. OpenTelemetry span attributes for LLM calls and tool invocations. Cost observability per task. Latency attribution. Alert thresholds that catch problems before users do.

Prerequisites: agent architecture basics, basic logging concepts. After this post you will be able to design an observability stack for an agent system: trace anatomy, span attributes, cost tracking, latency profiling, and alert thresholds.

Standard application observability — request logs, error rates, latency percentiles — is not enough for agent systems. An agent doesn't have one request. It has a chain of LLM calls, tool invocations, retrieval operations, and retries, each of which can fail silently or produce subtly wrong outputs.

When an agent produces a bad result, 'the model hallucinated' is not a diagnosis. Observability is what tells you whether the retriever returned wrong context, the prompt was malformed, the tool call failed silently, or the LLM made a bad planning decision.

Why Logs Alone Fail

Trace Anatomy for Agent Systems

Every agent task should produce one distributed trace with child spans. A complete trace looks like:

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer('agent-system')

def run_agent_task(task_id: str, user_query: str):
    with tracer.start_as_current_span('agent.task') as root:
        root.set_attribute('task.id', task_id)
        root.set_attribute('task.query_length', len(user_query))
        
        with tracer.start_as_current_span('agent.plan') as plan_span:
            plan, tokens = llm.plan(user_query)
            plan_span.set_attribute('llm.prompt_tokens', tokens.prompt)
            plan_span.set_attribute('llm.completion_tokens', tokens.completion)
            plan_span.set_attribute('llm.cost_usd', tokens.cost())
        
        for step in plan.steps:
            with tracer.start_as_current_span(f'tool.{step.tool}') as tool_span:
                tool_span.set_attribute('tool.name', step.tool)
                tool_span.set_attribute('tool.args_hash', hash(str(step.args)))
                try:
                    result = execute_tool(step)
                    tool_span.set_attribute('tool.result_tokens', len(result))
                except Exception as e:
                    tool_span.set_status(Status(StatusCode.ERROR, str(e)))
                    raise

Cost Observability

Agent costs are not predictable from request volume. A single task can make 1 LLM call or 15, depending on the query complexity and tool failures.

Latency Observability

What to Alert On

Senior framing: observability is not instrumentation you add after the system breaks. It is the mechanism by which you know the system is working before it visibly breaks. Ship with traces from day one. The cost of retroactively adding traces to a production agent system is always higher than building it in.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →