Tracing Agent Loops: How to Debug Step-by-Step Execution
What a step trace reveals, how to spot loops, wrong tool calls, and hallucinated observations — and how to use the Agent Loop Simulator to reproduce failures.
An agent produced a wrong answer. You need to find out why. The agent took 14 steps, called 6 different tools, and made 4 LLM calls. Where did it go wrong? Without tracing, this is archaeology. With tracing, it's a 5-minute investigation.
What a trace needs to capture
- Every LLM call: inputs (messages array), outputs (response text), model, latency, token counts, cost
- Every tool call: tool name, arguments, response, latency, any errors
- Agent state at each step: what the agent 'knows' vs. what it's reasoning about
- Branching decisions: when the agent chose between multiple actions, what it chose and why
- The full span tree: parent-child relationships between spans (LLM call → tool call → LLM call)
OpenTelemetry for agents
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer("agent")
async def agent_step(step_num, messages, available_tools):
with tracer.start_as_current_span(f"agent_step_{step_num}") as span:
span.set_attribute("step_number", step_num)
span.set_attribute("message_count", len(messages))
with tracer.start_as_current_span("llm_call") as llm_span:
response = await call_llm(messages, available_tools)
llm_span.set_attribute("model", response.model)
llm_span.set_attribute("input_tokens", response.usage.input_tokens)
llm_span.set_attribute("output_tokens", response.usage.output_tokens)
if response.tool_use:
with tracer.start_as_current_span("tool_call") as tool_span:
tool_span.set_attribute("tool_name", response.tool_use.name)
tool_span.set_attribute("tool_input", str(response.tool_use.input))
try:
result = await execute_tool(response.tool_use)
tool_span.set_attribute("tool_result_length", len(str(result)))
except Exception as e:
tool_span.set_status(Status(StatusCode.ERROR, str(e)))
raise
return response
LangSmith for higher-level tracing
For teams using LangChain, LangSmith provides automatic tracing with a visual UI. Every chain, agent step, LLM call, and tool invocation is captured in a tree view. You can replay any trace, compare traces across runs, and annotate specific steps with feedback.
For teams not using LangChain, Langfuse and Arize Phoenix offer similar capabilities with a simpler SDK. Both support the OpenTelemetry standard, so you're not locked into a specific provider.
Debugging checklist for a failed agent run
- Step 1: Find the step where the agent's reasoning diverged from the correct path — look for the first wrong inference
- Step 2: Check the tool call that preceded the wrong inference — did the tool return unexpected data?
- Step 3: Check the context at the divergence step — was the original task still in context, or had it been pushed too far back?
- Step 4: Check for hallucinated tool arguments — did the agent invent parameters that don't exist?
- Step 5: Look at the model's stated reasoning — does it match its actions? Inconsistency reveals the point of confusion
Trace agent loops →: Step through agent execution traces in the Agents module.
- LangSmith: LLM Application Observability — LangChain
- OpenTelemetry for LLMs — OpenLLMetry
- Arize Phoenix: ML Observability for LLMs
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →