Agents & Tool Use 12 min read

Building Reliable Agents: Loops, Tools, and Failure Modes

How to structure agent loops, tool calling patterns, retry logic, and the five failure modes that kill agent reliability in production.

Most agent demos work. Most agents in production fail. The gap is reliability — and reliability comes from how you structure the loop, handle failures, and constrain the action space. Here's what I've learned.

The Agent Loop

A reliable agent loop has four components: observe (what's the current state?), plan (what action should I take?), act (execute the action), and verify (did it work?). Most agents skip verify — they generate an action, execute it, and move to the next step regardless of outcome. That's where cascading failures start.

async def agent_loop(task, tools, max_steps=10):
    memory = []
    for step in range(max_steps):
        # Observe
        state = build_state(task, memory)
        # Plan + Act
        action = await llm.generate(state, tools)
        if action.type == "final_answer":
            return action.content
        # Execute with timeout
        try:
            result = await asyncio.wait_for(tools[action.name](**action.args), timeout=30)
        except asyncio.TimeoutError:
            result = {"error": "Tool timed out"}
        except Exception as e:
            result = {"error": str(e)}
        # Verify + remember
        memory.append({"action": action, "result": result})
    return "Max steps reached — partial result"

Tool Design

Good tools are narrow, idempotent, and return structured errors. Bad tools are broad (do_everything), have side effects that can't be retried, and return vague error messages. The model will misuse broad tools. It needs to call narrow tools correctly to make progress — which is actually a feature, not a bug, because you can validate narrow calls more easily.

The Five Failure Modes

Infinite loops: Agent repeats the same action because it doesn't detect it's not making progress. Fix: track action history, detect repetition, break loop after N identical actions.
Tool misuse: Wrong arguments, wrong tool for the task. Fix: strict JSON schema validation, retry with error feedback (max 2 retries).
Context overflow: Long tasks exhaust the context window. Fix: summarize memory periodically, keep only recent + relevant steps.
Hallucinated tool calls: Model invents tools that don't exist. Fix: constrained generation or explicit 'available tools only' instruction.
Goal drift: Agent pursues sub-goals that diverge from the original task. Fix: re-inject the original task at every step, add a 'goal check' before executing actions.

Observability Is Not Optional

Log every step: the full prompt, the generated action, the tool inputs and outputs, latency, and token count. Without this, debugging a failed agent run is guesswork. Use structured logging — not print statements — so you can trace specific run IDs across steps. Build a replay tool: given a run ID, reconstruct exactly what the agent saw and did at each step.

Production rule: never deploy an agent without a max_steps limit, per-tool timeout, total cost budget, and a human escalation path. Any agent that can spend money or modify data needs all four.

Explore agent reliability patterns in the Agents Lab →:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →