GenAI Systems Lab Open interactive version →
Production & LLMOps 11 min read

Backend APIs for Agent Services: Async Endpoints, Streaming, and Webhook Patterns

Why standard REST breaks for agents. 202 + polling vs webhooks vs SSE. Request deduplication via idempotency keys. Readiness probes that check model availability. Token-based rate limiting.

Prerequisites: REST API basics, async programming. After this post you will know how to design backend APIs that handle agent workloads correctly: async endpoints, streaming output, webhook callbacks, idempotency at the API layer, and health probes that validate model availability.

Standard REST patterns assume requests complete in milliseconds. Agent tasks take 10–120 seconds. Streaming LLM output requires a different transport. Long-running tasks need job-based async patterns. These are not edge cases — they are the normal operating mode for any production agent service.

Getting the API layer wrong is expensive: timeouts cascade into duplicate agent runs, users see false errors, and retried write operations cause silent data corruption. Every API design decision here has a downstream consequence.

Sync vs Async Endpoint Design

The first decision: should the endpoint return a result, or a job handle?

# FastAPI async job pattern
from fastapi import FastAPI, BackgroundTasks
from uuid import uuid4

app = FastAPI()
job_store = {}  # Production: Redis or database

@app.post('/agent/tasks', status_code=202)
async def create_task(request: TaskRequest, background_tasks: BackgroundTasks):
    job_id = str(uuid4())
    idempotency_key = request.idempotency_key or job_id

    # Deduplication check before queuing
    if idempotency_key in job_store:
        existing = job_store[idempotency_key]
        return {'job_id': existing['job_id'], 'status': 'already_accepted'}

    job_store[idempotency_key] = {'job_id': job_id, 'status': 'pending'}
    background_tasks.add_task(run_agent_task, job_id, request)
    return {'job_id': job_id, 'status': 'pending', 'poll_url': f'/agent/tasks/{job_id}'}

Streaming Agent Output with SSE

SSE is the right transport for streaming agent output to a browser. It uses a single persistent HTTP connection with chunked transfer encoding. One non-obvious requirement: proxies buffer by default.

from fastapi.responses import StreamingResponse
import json

@app.post('/agent/stream')
async def stream_agent(request: TaskRequest):
    async def event_generator():
        async for step in run_agent_streaming(request):
            # step: {type: 'thought'|'action'|'result', content: str}
            yield f'data: {json.dumps(step)}\n\n'
        yield 'data: {"type": "done"}\n\n'

    return StreamingResponse(
        event_generator(),
        media_type='text/event-stream',
        headers={
            'Cache-Control': 'no-cache',
            'X-Accel-Buffering': 'no'  # Critical: disables nginx buffering
        }
    )
# Without X-Accel-Buffering: no, nginx accumulates the stream
# and flushes it all at once — defeating the purpose of streaming.

Request Deduplication at the API Layer

Network retries happen. A client sends a task request, the response is lost in transit, the client retries. Without deduplication, two identical agent tasks run in parallel — both complete, both bill the user, both write to the CRM.

Readiness Probes That Check Model Availability

Standard readiness probes return 200 if the server process is running. For LLM services this is insufficient — the process can be alive while the model is still loading, or the inference backend is degraded.

Rate Limiting for AI Endpoints

Standard rate limiting counts requests per minute. For AI endpoints, request count is the wrong unit — one request may consume 50x the tokens of another.

Senior framing: the API layer exists to protect both the user (no duplicate tasks, streaming that doesn't drop) and the infrastructure (no runaway queues, no burst overload). Design it as a contract: what guarantees does your API make to callers, and what invariants does it enforce to protect downstream systems?

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →