Production & LLMOps 11 min read

Backend APIs for Agent Services: Async Endpoints, Streaming, and Webhook Patterns

Why standard REST breaks for agents. 202 + polling vs webhooks vs SSE. Request deduplication via idempotency keys. Readiness probes that check model availability. Token-based rate limiting.

Prerequisites: REST API basics, async programming. After this post you will know how to design backend APIs that handle agent workloads correctly: async endpoints, streaming output, webhook callbacks, idempotency at the API layer, and health probes that validate model availability.

Standard REST patterns assume requests complete in milliseconds. Agent tasks take 10–120 seconds. Streaming LLM output requires a different transport. Long-running tasks need job-based async patterns. These are not edge cases — they are the normal operating mode for any production agent service.

Getting the API layer wrong is expensive: timeouts cascade into duplicate agent runs, users see false errors, and retried write operations cause silent data corruption. Every API design decision here has a downstream consequence.

Sync vs Async Endpoint Design

The first decision: should the endpoint return a result, or a job handle?

Synchronous (block and wait): only works for tasks under ~5 seconds. API gateway timeout (typically 29–60s) terminates the connection before the agent completes. 202 + job polling: return a job ID immediately with HTTP 202. The client polls GET /jobs/{id}/status. Standard pattern for any task over ~5 seconds. Webhook callback: return 202 immediately and POST to a caller-provided callback URL when done. Preferred for server-to-server integrations — eliminates polling entirely. SSE (Server-Sent Events): hold the HTTP connection open and stream events as the agent produces them. Ideal for interactive UIs where users watch the agent work in real time.

# FastAPI async job pattern
from fastapi import FastAPI, BackgroundTasks
from uuid import uuid4

app = FastAPI()
job_store = {}  # Production: Redis or database

@app.post('/agent/tasks', status_code=202)
async def create_task(request: TaskRequest, background_tasks: BackgroundTasks):
    job_id = str(uuid4())
    idempotency_key = request.idempotency_key or job_id

    # Deduplication check before queuing
    if idempotency_key in job_store:
        existing = job_store[idempotency_key]
        return {'job_id': existing['job_id'], 'status': 'already_accepted'}

    job_store[idempotency_key] = {'job_id': job_id, 'status': 'pending'}
    background_tasks.add_task(run_agent_task, job_id, request)
    return {'job_id': job_id, 'status': 'pending', 'poll_url': f'/agent/tasks/{job_id}'}

Streaming Agent Output with SSE

SSE is the right transport for streaming agent output to a browser. It uses a single persistent HTTP connection with chunked transfer encoding. One non-obvious requirement: proxies buffer by default.

from fastapi.responses import StreamingResponse
import json

@app.post('/agent/stream')
async def stream_agent(request: TaskRequest):
    async def event_generator():
        async for step in run_agent_streaming(request):
            # step: {type: 'thought'|'action'|'result', content: str}
            yield f'data: {json.dumps(step)}\n\n'
        yield 'data: {"type": "done"}\n\n'

    return StreamingResponse(
        event_generator(),
        media_type='text/event-stream',
        headers={
            'Cache-Control': 'no-cache',
            'X-Accel-Buffering': 'no'  # Critical: disables nginx buffering
        }
    )
# Without X-Accel-Buffering: no, nginx accumulates the stream
# and flushes it all at once — defeating the purpose of streaming.

Request Deduplication at the API Layer

Network retries happen. A client sends a task request, the response is lost in transit, the client retries. Without deduplication, two identical agent tasks run in parallel — both complete, both bill the user, both write to the CRM.

Client generates a UUID before the first request and includes it as Idempotency-Key header or in the request body. API server checks a fast store (Redis) before accepting the task. If the key exists, return the original job ID and status. Idempotency keys should expire after 24–72 hours to allow legitimate re-requests after that window. Do not make idempotency checking optional. A system that works correctly only when clients behave correctly is not production-grade.

Readiness Probes That Check Model Availability

Standard readiness probes return 200 if the server process is running. For LLM services this is insufficient — the process can be alive while the model is still loading, or the inference backend is degraded.

Liveness probe: checks the server process is alive and not deadlocked. Simple HTTP 200 on /health. Should restart the pod only when it is genuinely stuck. Readiness probe: for an LLM service, readiness means model weights are loaded, the tokenizer is initialized, and a test tokenizer call succeeds. Until readiness passes, K8s keeps the pod out of the load balancer. Do not run a full inference call in a probe. At a 5-second probe interval that is 12 LLM calls per minute per pod — non-trivial cost.

Rate Limiting for AI Endpoints

Standard rate limiting counts requests per minute. For AI endpoints, request count is the wrong unit — one request may consume 50x the tokens of another.

Token-based rate limiting: track tokens consumed per user per window. Gate requests at token budget exhaustion, not request count. Tier-based limits: free tier gets X tokens/day; paid gets Y. Enforce at API middleware, not in agent logic. Concurrency limits: cap simultaneous agent tasks per user. One user with 10 concurrent tasks starves all others. Cost-based back-pressure: if total infra cost exceeds budget, reject with 429 + Retry-After rather than queuing indefinitely.

Senior framing: the API layer exists to protect both the user (no duplicate tasks, streaming that doesn't drop) and the infrastructure (no runaway queues, no burst overload). Design it as a contract: what guarantees does your API make to callers, and what invariants does it enforce to protect downstream systems?

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →