Agents & Tool Use 11 min read

Tool Use in Production Agents: Idempotency, Side Effects, and Audit Trails

How to classify tool risk (read vs write vs destructive), design idempotency correctly, set retry strategy per tool type, build audit logs for compliance, and prevent an agent from sending the email twice.

Prerequisites: agent architecture basics, function calling basics. After this post you will be able to design production-safe tool use patterns: classify tool risk, enforce idempotency, handle failures without data corruption, and build audit trails.

Tool use is where agents stop being text generators and start affecting the real world. Sending emails, updating records, triggering workflows, calling payment APIs — every tool call that has a side effect is a place where a bug doesn't just produce a wrong answer. It produces a wrong action.

Most tutorials show you how to make tool calls work. This post is about making them safe at production scale.

Read vs Write: The Most Important Classification

The first thing you do with any tool is classify it by risk:

Read tools: query_database(), get_customer(), search_docs(). These are safe to retry, safe to run multiple times, safe to call without human approval. Write tools: send_email(), update_record(), trigger_payment(), post_to_slack(). These change state. Running them twice causes real problems. Destructive tools: delete_record(), cancel_order(), revoke_access(). These need human-in-the-loop gates or explicit confirmation steps.

Interview trap: 'Just retry failed tool calls.' This is the most common wrong answer. Retrying a read is safe. Retrying a write that already succeeded sends the email twice, charges the card twice, updates the record twice. Idempotency design is the answer, not naive retries.

Idempotency Design

Idempotency means calling a tool multiple times with the same arguments produces the same result as calling it once. For write tools, you design this in — it doesn't happen automatically.

Idempotency key: every write operation gets a unique key generated before the call. If the call is retried, the same key is sent. The server uses the key to detect duplicate requests and return the original response without re-executing. Generate the key before calling, not after. If generation happens inside the tool, a failed call creates no key and a retry generates a new one — defeating the purpose. Store the key with the agent state. If the agent crashes and restarts, it must retry with the original key, not generate a new one. Example: send_email(to=..., subject=..., idempotency_key='task-123-email-confirmation'). If the API was already called with this key, it returns the original response.

import uuid

class AgentTask:
    def __init__(self):
        self.tool_keys = {}  # tool_name -> idempotency_key
    
    def get_or_create_key(self, tool_name: str) -> str:
        # Key created BEFORE the call, persisted across retries
        if tool_name not in self.tool_keys:
            self.tool_keys[tool_name] = str(uuid.uuid4())
        return self.tool_keys[tool_name]
    
    def send_email(self, to: str, body: str):
        key = self.get_or_create_key('send_email')
        return email_api.send(to=to, body=body, idempotency_key=key)

Tool Schema Design

The LLM decides which tool to call and what arguments to pass based on the schema. A bad schema produces wrong calls:

Descriptions must be precise about side effects. 'Send email to customer' is wrong. 'Send a one-time transactional email to a customer. Cannot be undone. Requires customer_id and template_id.' is right. Argument names should be unambiguous. customer_id, not id. email_template_id, not template. Enumerate allowed values. If status can only be 'active', 'paused', or 'cancelled', say so in the schema. Don't let the LLM guess. Mark required vs optional explicitly. The LLM will hallucinate arguments for optional fields if you're not clear.

Timeout and Retry Strategy Per Tool Type

Read tools: retry up to 3 times with exponential backoff. Fast timeouts (2–5s) are acceptable — if the data isn't available quickly, it's probably not needed for the current step. Write tools: one attempt. On failure, surface the error to the planner. Do not auto-retry. Let the planner decide whether to retry (with the original idempotency key) or escalate. Destructive tools: require explicit confirmation before calling. On failure, do not retry automatically. Log the failure and escalate to human review. External APIs: treat timeouts as unknown state, not failure. The call may have succeeded but the response was lost. Check idempotency key status before deciding to retry.

Audit Logging

In enterprise environments, tool calls must be auditable. This is not optional for compliance.

Log before calling: the intent (what tool, what arguments, which agent session). Log after: the result, latency, status, token cost of the LLM decision that triggered the call. Log the reasoning: store the LLM's reasoning trace that led to the tool call. This is your paper trail when something goes wrong. Do not log secrets or PII in plain text. Mask sensitive fields before writing to logs. Store logs outside the agent's memory. If the agent crashes, the audit trail must survive.

Production reality: the first time an agent sends an email to the wrong customer, or triggers a payment twice, you will need the audit log to understand exactly what the LLM decided, why it decided it, and what context it had. Build the audit trail before you need it.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →