GenAI Systems Lab Open interactive version →
Evaluation 12 min read

Testing Agentic Systems: Deterministic, Probabilistic, Trajectory, and Red Team

Why unit tests are not enough for agents. Mock tool strategies. Golden test cases. Scenario-based evaluation and trajectory testing. LLM-as-judge calibration for agent behavior. Red teaming: prompt injection via tool results, goal hijacking, resource exhaustion.

Prerequisites: agent architecture basics, evaluation metrics basics. After this post you will be able to design a complete testing strategy for an agentic system: deterministic unit tests, mock tool strategies, scenario-based evaluation, trajectory testing, and LLM-as-judge patterns.

Testing an agent is not like testing an API. An API has deterministic outputs for deterministic inputs. An agent has probabilistic planning, multi-step tool use, variable context, and failure modes that only appear across multiple turns.

The mistake is applying standard software testing discipline to agents and calling it done. The right approach has two layers: deterministic tests for the parts you can control, and probabilistic evaluation for the parts you cannot.

Layer 1: Deterministic Testing

Mock your tools. Test your scaffolding. Verify your schemas.

# Mock tool testing pattern
import pytest
from unittest.mock import patch, MagicMock

def test_agent_sends_email_once_on_success():
    mock_email = MagicMock(return_value={'status': 'sent', 'message_id': 'msg-123'})
    
    with patch('agent.tools.email_api.send', mock_email):
        result = agent.run_task('Send welcome email to customer-456')
    
    # Verify tool called exactly once — idempotency matters
    assert mock_email.call_count == 1
    assert result.status == 'completed'
    
def test_agent_does_not_retry_write_on_failure():
    mock_email = MagicMock(side_effect=TimeoutError('API timeout'))
    
    with patch('agent.tools.email_api.send', mock_email):
        result = agent.run_task('Send welcome email to customer-456')
    
    # Write tool failed: agent should escalate, NOT retry automatically
    assert mock_email.call_count == 1  # not 3
    assert result.status == 'escalated'

Layer 2: Probabilistic Evaluation

For the parts you cannot make deterministic — LLM planning decisions, generated responses, multi-turn behavior — you need scenario-based evaluation, not unit tests.

LLM-as-Judge for Agent Behavior

Some agent behaviors cannot be evaluated with rules. 'Did the agent explain the cancellation clearly?' requires semantic judgment. LLM-as-judge handles this — but with known limitations.

Red Teaming Agents

Before production, run adversarial tests specifically designed to find agent failure modes:

The fundamental testing principle for agents: test the behavior of the system under realistic conditions, not the output of a single function under ideal conditions. An agent that passes all unit tests can still fail spectacularly on a real user task. Scenario tests, trajectory evaluation, and red teaming are not optional extras — they are the core of agent quality assurance.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →