GenAI Systems Lab Open interactive version →
Agents & Tool Use 12 min read

Building Reliable Agents: Loops, Tools, and Failure Modes

How to structure agent loops, tool calling patterns, retry logic, and the five failure modes that kill agent reliability in production.

Most agent demos work. Most agents in production fail. The gap is reliability — and reliability comes from how you structure the loop, handle failures, and constrain the action space. Here's what I've learned.

The Agent Loop

A reliable agent loop has four components: observe (what's the current state?), plan (what action should I take?), act (execute the action), and verify (did it work?). Most agents skip verify — they generate an action, execute it, and move to the next step regardless of outcome. That's where cascading failures start.

async def agent_loop(task, tools, max_steps=10):
    memory = []
    for step in range(max_steps):
        # Observe
        state = build_state(task, memory)
        # Plan + Act
        action = await llm.generate(state, tools)
        if action.type == "final_answer":
            return action.content
        # Execute with timeout
        try:
            result = await asyncio.wait_for(tools[action.name](**action.args), timeout=30)
        except asyncio.TimeoutError:
            result = {"error": "Tool timed out"}
        except Exception as e:
            result = {"error": str(e)}
        # Verify + remember
        memory.append({"action": action, "result": result})
    return "Max steps reached — partial result"

Tool Design

Good tools are narrow, idempotent, and return structured errors. Bad tools are broad (do_everything), have side effects that can't be retried, and return vague error messages. The model will misuse broad tools. It needs to call narrow tools correctly to make progress — which is actually a feature, not a bug, because you can validate narrow calls more easily.

The Five Failure Modes

Observability Is Not Optional

Log every step: the full prompt, the generated action, the tool inputs and outputs, latency, and token count. Without this, debugging a failed agent run is guesswork. Use structured logging — not print statements — so you can trace specific run IDs across steps. Build a replay tool: given a run ID, reconstruct exactly what the agent saw and did at each step.

Production rule: never deploy an agent without a max_steps limit, per-tool timeout, total cost budget, and a human escalation path. Any agent that can spend money or modify data needs all four.

Explore agent reliability patterns in the Agents Lab →:


Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →