GenAI Systems Lab Open interactive version →
AI Engineering 13 min read

Designing an Agent System for Production: State, Tools, and Failure Handling

How to design an agent that doesn't spiral. State management, tool contracts, human-in-the-loop gates, reliability budgets, and rollback strategies.

Designing a single-agent demo is easy. Designing an agent system that ships to production — one that handles failures gracefully, doesn't accrue runaway costs, stays on task, and can be debugged when it breaks — is a fundamentally different problem. This is the architecture guide for production agent systems.

An agent that works 95% of the time isn't production-ready. An agent that fails gracefully 100% of the time is.

When to build an agent vs. a pipeline

Not every multi-step AI workflow needs an agent. Agents introduce non-determinism, failure cascades, and debugging complexity. Use agents when: the task requires dynamic tool selection (you can't hardcode the order), when recovery from failures requires judgment, or when the task has unbounded branching that a fixed pipeline can't handle. For everything else, a deterministic pipeline with LLM steps is cheaper, faster, and easier to test.

Use caseAgent?Why
Extract structured fields from a documentNo — pipelineFixed steps, deterministic output
Customer support that may need to look up orders, policies, or escalateYesDynamic tool selection based on query type
Summarise 50 documents into a reportNo — map-reduce pipelineFixed structure, parallelisable
Debug a failing CI pipeline by reading logs, forming hypotheses, running fixesYesRequires judgment, unknown number of steps
Classify and route incoming support ticketsNo — classifier + routerFixed categories, no iteration needed

Architecture patterns

Pattern 1: Single agent with tools

The simplest production agent: one LLM, a tool registry, an agentic loop. Suitable for most use cases. Limitations: context fills with tool results over long runs; single point of failure; no parallelism.

class ProductionAgent:
    def __init__(self, tools, system_prompt, max_steps=25):
        self.tools = {t.name: t for t in tools}
        self.system_prompt = system_prompt
        self.max_steps = max_steps

    def run(self, task: str) -> AgentResult:
        messages = [{"role": "user", "content": task}]
        steps = 0
        trace = []

        while steps < self.max_steps:
            response = llm(self.system_prompt, messages)
            trace.append({"step": steps, "response": response})

            if response.stop_reason == "end_turn":
                return AgentResult(success=True, output=response.text, trace=trace)

            if response.stop_reason == "tool_use":
                tool_results = []
                for tool_call in response.tool_calls:
                    # Validate before executing
                    result = self._execute_tool(tool_call)
                    tool_results.append(result)
                    trace.append({"step": steps, "tool": tool_call.name, "result": result})
                messages.append({"role": "assistant", "content": response.content})
                messages.append({"role": "user", "content": tool_results})

            steps += 1

        return AgentResult(success=False, error="max_steps_exceeded", trace=trace)

    def _execute_tool(self, tool_call):
        tool = self.tools.get(tool_call.name)
        if not tool:
            return ToolResult(error=f"Unknown tool: {tool_call.name}")
        try:
            validated = tool.schema.validate(tool_call.input)
            return tool.execute(validated)
        except ValidationError as e:
            return ToolResult(error=f"Invalid arguments: {e}")

Pattern 2: Supervisor + subagents

An orchestrator agent receives the task, decomposes it, and delegates to specialised subagents. Each subagent has a narrower set of tools and a focused system prompt. The orchestrator synthesises results. This is the right pattern when: different subtasks need different specialisations, subtasks can run in parallel, or the task naturally decomposes into independent work streams.

In the supervisor pattern, the orchestrator should never have write/action tools — only read tools and the ability to spawn subagents. The subagents hold the action capability. This limits blast radius: a misbehaving orchestrator can't directly take destructive actions.

Pattern 3: Specialised agents + message bus

For large-scale systems: individual specialised agents (research agent, writer agent, editor agent, validation agent) communicate via a message queue. No central orchestrator — each agent subscribes to relevant message types and publishes outputs. Highly scalable but significantly more complex to debug and coordinate.

Tool design — the most overlooked component

The quality of your tools determines agent performance more than the quality of your LLM. A well-designed tool is narrow, composable, and has excellent error messages. A poorly-designed tool has ambiguous parameters, broad scope, and returns opaque errors that the model can't recover from.

Tool design principleGood exampleBad example
Narrow scopeget_order_status(order_id)do_database_operation(query, type, table)
Typed parametersdate: ISO8601 string, requireddate: string (any format)
Actionable errors"Order #1234 not found. Valid format: #NNNN""Error: null pointer exception"
Idempotent by defaultupdate_ticket_status(id, status) — safe to retrysend_email(to, body) — each call fires an email
Dry-run modearchive_records(ids, dry_run=False)archive_records(ids) — no preview

State management

Long-running agents need persistent state that survives context window limits and can be resumed after failures. Three levels of state to manage:

import sqlite3, json
from dataclasses import dataclass

@dataclass
class AgentState:
    task_id: str
    original_task: str
    steps_completed: int
    notes: dict         # agent-written scratchpad
    status: str         # running | paused | completed | failed

class StateManager:
    def __init__(self, db_path="agent_state.db"):
        self.db = sqlite3.connect(db_path)
        self.db.execute("""CREATE TABLE IF NOT EXISTS states (
            task_id TEXT PRIMARY KEY, data TEXT, updated_at REAL
        )""")

    def save(self, state: AgentState):
        self.db.execute("INSERT OR REPLACE INTO states VALUES (?, ?, unixepoch())",
            (state.task_id, json.dumps(state.__dict__)))
        self.db.commit()

    def load(self, task_id: str) -> AgentState | None:
        row = self.db.execute("SELECT data FROM states WHERE task_id=?", (task_id,)).fetchone()
        return AgentState(**json.loads(row[0])) if row else None

Safety and control mechanisms

A production agent without control mechanisms is not a product — it's a liability. These are non-negotiable:

Observability for agents

Traditional request/response observability doesn't work for agents. You need trace-level observability: a hierarchical view of every step in a task run, with timing, token counts, and tool call details at each level. OpenTelemetry with a span-per-step model is the standard approach. Tools like Langfuse, Phoenix, and LangSmith visualise agent traces natively.

The two most important agent metrics in production: task success rate (end-to-end — did the agent complete its goal?) and cost per task (total tokens used across all steps and subagents). If you can only instrument two things, instrument these.

Testing agent systems

Agents are hard to unit test because they're non-deterministic. The pragmatic approach: deterministic integration tests with mocked tools (test that the right tools are called in the right order for known inputs), end-to-end eval with a golden task set (N tasks with defined acceptance criteria — pass if the final output meets criteria), and chaos testing (inject tool failures at random steps — verify graceful recovery).

Build and debug agents in the Agents module →: Step through agent execution, inject failures, and verify recovery behaviour.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →