GenAI Systems Lab Open interactive version →
Agents & Tool Use 10 min read

How to Evaluate Multi-Step AI Agents

Why standard LLM evals break for agents. Task completion rate, tool call accuracy, trajectory quality, and building an agent eval harness.

Standard LLM evals measure one thing: did the model produce a good response to this prompt? That works fine for a chatbot or a summariser. It breaks completely for agents.

An agent might take 12 steps to complete a task, call 6 tools, and produce an answer. Evaluating only the final answer misses 90% of what actually happened — and where things went wrong. You can get the right answer for the wrong reasons, or the wrong answer despite doing everything correctly up to the last step.

The core problem: per-turn accuracy is a local metric. Agent quality is a global metric. A chain of steps that each score 90% individually produces correct final answers only 28% of the time (0.9^12 ≈ 0.28). Eval at the trajectory level, not the step level.

The 5 dimensions of agent evaluation

DimensionWhat it measuresHow to measureTarget threshold
Task completion rateDid the agent accomplish the end goal?Human or LLM-as-judge binary: success/fail> 85% for production
Tool call precisionWhen the agent called a tool, was it the right one with right args?Compare to golden trace> 90%
Tool call recallDid the agent call all the tools it needed?Check required tools vs. called tools> 95%
Trajectory efficiencyDid the agent take the shortest sensible path?Steps taken vs. optimal steps< 1.5x optimal
Graceful failure rateWhen it couldn't complete, did it fail cleanly?Audit failure-case outputs> 80% clean failures

Hallucination in reasoning is a sixth dimension worth tracking separately: did the agent fabricate tool outputs, misquote its own previous observations, or reason from premises it invented? This is subtler than factual hallucination and harder to catch without trace logging.

Why per-turn accuracy misses agent quality

Imagine a 6-step agent task. The agent makes a wrong tool call at step 3 but then self-corrects. The final answer is correct. Per-turn accuracy would score step 3 as a failure. But the real signal is: the agent caught its own error. That is actually good behaviour you want to preserve.

Conversely, an agent might produce a correct-looking final answer because it hallucinated an observation at step 4 and got lucky. Per-turn accuracy scores the final step highly. Trajectory eval catches the hallucination.

Never use per-turn LLM-as-judge scoring as your primary agent eval. It creates incentives for agents that look good locally while failing globally. Always have at least one end-to-end task completion metric.

Building an agent eval harness

Step 1: Trace logging

Every agent run must produce a complete, structured trace: step number, thought text (if ReAct), tool name, tool arguments, raw tool output, and the model's observation. Without this, debugging and evaluation are both impossible.

{
  "run_id": "run_abc123",
  "task": "Find Q3 revenue for Acme Corp and compare to Q2",
  "steps": [
    {
      "step": 1,
      "thought": "I need to search for Acme Corp Q3 revenue",
      "tool": "search",
      "tool_args": {"query": "Acme Corp Q3 2024 revenue"},
      "tool_output": "Acme Corp reported Q3 2024 revenue of $4.2B...",
      "observation": "Q3 revenue is $4.2B"
    }
  ],
  "final_answer": "Q3 revenue was $4.2B, up 12% from Q2's $3.75B",
  "success": true,
  "steps_taken": 4,
  "optimal_steps": 3
}

Step 2: Golden trace construction

For each eval task, construct a golden trace: the correct sequence of tool calls with correct arguments and the correct final answer. This is labour-intensive — plan 30–60 minutes per golden example for complex tasks — but it is the only reliable ground truth you have.

Start with 50 golden examples covering: easy tasks, multi-hop tasks, tool chaining, ambiguous input, and tasks where the agent should refuse. Weight toward your real production traffic pattern.

Step 3: Replay testing

Replay testing means running a previous trace with mocked tool outputs — the tools return the same responses as the original run. This makes tests deterministic: if the agent's behaviour changes on the same inputs, you have a regression. Replay testing is to agents what unit tests are to functions.

def replay_test(golden_trace, agent):
    # Mock tools to return same outputs as in the golden trace
    mock_tools = {
        step["tool"]: lambda args, o=step["tool_output"]: o
        for step in golden_trace["steps"]
    }
    result = agent.run(task=golden_trace["task"], tools=mock_tools)

    called = [s["tool"] for s in result.steps]
    golden = [s["tool"] for s in golden_trace["steps"]]
    assert called == golden, f"Tool sequence drift: {called} vs {golden}"
    assert result.answer_matches(golden_trace["final_answer"], threshold=0.85)

Step 4: Human annotation protocol

For tasks where LLM-as-judge is unreliable, human annotation is the ground truth. Define a clear rubric: task completion (pass/fail), reasoning quality (1–3), and failure mode (tool error / hallucination / loop / gave up / off-task). Aim for inter-annotator agreement > 0.8 Cohen's kappa before trusting the scores.

Concrete metrics with thresholds

MetricFormulaGreenYellowRed (block ship)
Task completion ratesuccess / total runs> 85%70–85%< 70%
Tool precisioncorrect tool calls / total calls> 90%80–90%< 80%
Unnecessary steps(actual - optimal) / optimal< 50%50–100%> 100%
Reasoning hallucinationhallucinated obs / total steps< 2%2–5%> 5%
Graceful failure rateclean fails / total fails> 80%60–80%< 60%

Real failure modes to test for

Build a red team eval set specifically for failure modes, separate from your main eval suite. Include tasks designed to trigger each known failure. Run it on every model upgrade — a new model that scores higher on your main eval but fails more red-team cases is not a safe upgrade.

LLM-as-judge for agent evals

LLM-as-judge works for agents if you judge at the right level. Instead of scoring individual steps, ask: given this task and this complete trace, did the agent succeed? If not, at which step did it go wrong and why? Provide the full trace, use a strong judge model, and ask for structured output: {success, failure_step, failure_type, reasoning}.

Validate your judge model against human annotations on 100 examples. It should agree with humans > 85% of the time before you trust it at scale.

Try in Agents Lab →: Step through agent traces, score runs across the 5 dimensions, and see how failure modes surface in the trace log.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →