Agents & Tool Use 8 min read

Tracing Agent Loops: How to Debug Step-by-Step Execution

What a step trace reveals, how to spot loops, wrong tool calls, and hallucinated observations — and how to use the Agent Loop Simulator to reproduce failures.

An agent produced a wrong answer. You need to find out why. The agent took 14 steps, called 6 different tools, and made 4 LLM calls. Where did it go wrong? Without tracing, this is archaeology. With tracing, it's a 5-minute investigation.

What a trace needs to capture

Every LLM call: inputs (messages array), outputs (response text), model, latency, token counts, cost
Every tool call: tool name, arguments, response, latency, any errors
Agent state at each step: what the agent 'knows' vs. what it's reasoning about
Branching decisions: when the agent chose between multiple actions, what it chose and why
The full span tree: parent-child relationships between spans (LLM call → tool call → LLM call)

OpenTelemetry for agents

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("agent")

async def agent_step(step_num, messages, available_tools):
    with tracer.start_as_current_span(f"agent_step_{step_num}") as span:
        span.set_attribute("step_number", step_num)
        span.set_attribute("message_count", len(messages))

        with tracer.start_as_current_span("llm_call") as llm_span:
            response = await call_llm(messages, available_tools)
            llm_span.set_attribute("model", response.model)
            llm_span.set_attribute("input_tokens", response.usage.input_tokens)
            llm_span.set_attribute("output_tokens", response.usage.output_tokens)

        if response.tool_use:
            with tracer.start_as_current_span("tool_call") as tool_span:
                tool_span.set_attribute("tool_name", response.tool_use.name)
                tool_span.set_attribute("tool_input", str(response.tool_use.input))
                try:
                    result = await execute_tool(response.tool_use)
                    tool_span.set_attribute("tool_result_length", len(str(result)))
                except Exception as e:
                    tool_span.set_status(Status(StatusCode.ERROR, str(e)))
                    raise

        return response

LangSmith for higher-level tracing

For teams using LangChain, LangSmith provides automatic tracing with a visual UI. Every chain, agent step, LLM call, and tool invocation is captured in a tree view. You can replay any trace, compare traces across runs, and annotate specific steps with feedback.

For teams not using LangChain, Langfuse and Arize Phoenix offer similar capabilities with a simpler SDK. Both support the OpenTelemetry standard, so you're not locked into a specific provider.

Debugging checklist for a failed agent run

Step 1: Find the step where the agent's reasoning diverged from the correct path — look for the first wrong inference
Step 2: Check the tool call that preceded the wrong inference — did the tool return unexpected data?
Step 3: Check the context at the divergence step — was the original task still in context, or had it been pushed too far back?
Step 4: Check for hallucinated tool arguments — did the agent invent parameters that don't exist?
Step 5: Look at the model's stated reasoning — does it match its actions? Inconsistency reveals the point of confusion

Trace agent loops →: Step through agent execution traces in the Agents module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →