How to Evaluate Multi-Step AI Agents
Why standard LLM evals break for agents. Task completion rate, tool call accuracy, trajectory quality, and building an agent eval harness.
Standard LLM evals measure one thing: did the model produce a good response to this prompt? That works fine for a chatbot or a summariser. It breaks completely for agents.
An agent might take 12 steps to complete a task, call 6 tools, and produce an answer. Evaluating only the final answer misses 90% of what actually happened — and where things went wrong. You can get the right answer for the wrong reasons, or the wrong answer despite doing everything correctly up to the last step.
The core problem: per-turn accuracy is a local metric. Agent quality is a global metric. A chain of steps that each score 90% individually produces correct final answers only 28% of the time (0.9^12 ≈ 0.28). Eval at the trajectory level, not the step level.
The 5 dimensions of agent evaluation
| Dimension | What it measures | How to measure | Target threshold |
|---|---|---|---|
| Task completion rate | Did the agent accomplish the end goal? | Human or LLM-as-judge binary: success/fail | > 85% for production |
| Tool call precision | When the agent called a tool, was it the right one with right args? | Compare to golden trace | > 90% |
| Tool call recall | Did the agent call all the tools it needed? | Check required tools vs. called tools | > 95% |
| Trajectory efficiency | Did the agent take the shortest sensible path? | Steps taken vs. optimal steps | < 1.5x optimal |
| Graceful failure rate | When it couldn't complete, did it fail cleanly? | Audit failure-case outputs | > 80% clean failures |
Hallucination in reasoning is a sixth dimension worth tracking separately: did the agent fabricate tool outputs, misquote its own previous observations, or reason from premises it invented? This is subtler than factual hallucination and harder to catch without trace logging.
Why per-turn accuracy misses agent quality
Imagine a 6-step agent task. The agent makes a wrong tool call at step 3 but then self-corrects. The final answer is correct. Per-turn accuracy would score step 3 as a failure. But the real signal is: the agent caught its own error. That is actually good behaviour you want to preserve.
Conversely, an agent might produce a correct-looking final answer because it hallucinated an observation at step 4 and got lucky. Per-turn accuracy scores the final step highly. Trajectory eval catches the hallucination.
Never use per-turn LLM-as-judge scoring as your primary agent eval. It creates incentives for agents that look good locally while failing globally. Always have at least one end-to-end task completion metric.
Building an agent eval harness
Step 1: Trace logging
Every agent run must produce a complete, structured trace: step number, thought text (if ReAct), tool name, tool arguments, raw tool output, and the model's observation. Without this, debugging and evaluation are both impossible.
{
"run_id": "run_abc123",
"task": "Find Q3 revenue for Acme Corp and compare to Q2",
"steps": [
{
"step": 1,
"thought": "I need to search for Acme Corp Q3 revenue",
"tool": "search",
"tool_args": {"query": "Acme Corp Q3 2024 revenue"},
"tool_output": "Acme Corp reported Q3 2024 revenue of $4.2B...",
"observation": "Q3 revenue is $4.2B"
}
],
"final_answer": "Q3 revenue was $4.2B, up 12% from Q2's $3.75B",
"success": true,
"steps_taken": 4,
"optimal_steps": 3
}
Step 2: Golden trace construction
For each eval task, construct a golden trace: the correct sequence of tool calls with correct arguments and the correct final answer. This is labour-intensive — plan 30–60 minutes per golden example for complex tasks — but it is the only reliable ground truth you have.
Start with 50 golden examples covering: easy tasks, multi-hop tasks, tool chaining, ambiguous input, and tasks where the agent should refuse. Weight toward your real production traffic pattern.
Step 3: Replay testing
Replay testing means running a previous trace with mocked tool outputs — the tools return the same responses as the original run. This makes tests deterministic: if the agent's behaviour changes on the same inputs, you have a regression. Replay testing is to agents what unit tests are to functions.
def replay_test(golden_trace, agent):
# Mock tools to return same outputs as in the golden trace
mock_tools = {
step["tool"]: lambda args, o=step["tool_output"]: o
for step in golden_trace["steps"]
}
result = agent.run(task=golden_trace["task"], tools=mock_tools)
called = [s["tool"] for s in result.steps]
golden = [s["tool"] for s in golden_trace["steps"]]
assert called == golden, f"Tool sequence drift: {called} vs {golden}"
assert result.answer_matches(golden_trace["final_answer"], threshold=0.85)
Step 4: Human annotation protocol
For tasks where LLM-as-judge is unreliable, human annotation is the ground truth. Define a clear rubric: task completion (pass/fail), reasoning quality (1–3), and failure mode (tool error / hallucination / loop / gave up / off-task). Aim for inter-annotator agreement > 0.8 Cohen's kappa before trusting the scores.
Concrete metrics with thresholds
| Metric | Formula | Green | Yellow | Red (block ship) |
|---|---|---|---|---|
| Task completion rate | success / total runs | > 85% | 70–85% | < 70% |
| Tool precision | correct tool calls / total calls | > 90% | 80–90% | < 80% |
| Unnecessary steps | (actual - optimal) / optimal | < 50% | 50–100% | > 100% |
| Reasoning hallucination | hallucinated obs / total steps | < 2% | 2–5% | > 5% |
| Graceful failure rate | clean fails / total fails | > 80% | 60–80% | < 60% |
Real failure modes to test for
- Tool argument hallucination: the agent calls the right tool with fabricated arguments (e.g. a made-up product ID)
- Observation fabrication: the agent writes an observation that contradicts the actual tool output
- Premature termination: the agent declares success before completing all required steps
- Infinite tool loops: the agent calls the same tool with the same arguments 3+ times in a row
- Context bleed: information from an earlier step incorrectly substitutes into a later one
- Wrong delegation: in multi-agent systems, the orchestrator routes to the wrong sub-agent
- Silent truncation: the final answer drops information from the last observation because context was nearly full
Build a red team eval set specifically for failure modes, separate from your main eval suite. Include tasks designed to trigger each known failure. Run it on every model upgrade — a new model that scores higher on your main eval but fails more red-team cases is not a safe upgrade.
LLM-as-judge for agent evals
LLM-as-judge works for agents if you judge at the right level. Instead of scoring individual steps, ask: given this task and this complete trace, did the agent succeed? If not, at which step did it go wrong and why? Provide the full trace, use a strong judge model, and ask for structured output: {success, failure_step, failure_type, reasoning}.
Validate your judge model against human annotations on 100 examples. It should agree with humans > 85% of the time before you trust it at scale.
Try in Agents Lab →: Step through agent traces, score runs across the 5 dimensions, and see how failure modes surface in the trace log.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →