GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

How I'd Build a Voice AI Pipeline

STT → LLM → TTS architecture, latency budget breakdown, streaming strategies, and where it actually breaks in production.

Voice AI has a deceptively simple stack: speech-to-text → LLM → text-to-speech. The hard part is doing it fast enough that it feels like a conversation and reliable enough that it doesn't break in a noisy environment. Here's how I'd build it.

The Latency Budget

Human conversation feels natural under 500ms end-to-end. Your budget: STT ~150ms, LLM first token ~200ms, TTS first audio chunk ~100ms, network ~50ms. Every component needs to stream — you can't batch any stage.

StageTarget LatencyTool OptionsBottleneck Risk
STT100–200msDeepgram Nova, Whisper Streaming, AssemblyAIAccuracy on accents
LLM150–300ms TTFTGPT-4o, Gemini Flash, GroqContext length, streaming setup
TTS80–150msElevenLabs, Cartesia, PlayHTVoice cloning quality
Network50–100msWebSockets (required)Regional latency

Streaming Is Non-Negotiable

Every stage must stream. STT streams partial transcripts as the user speaks. The LLM starts generating before the full transcript arrives (using interim results). TTS synthesizes and plays audio as the LLM generates tokens — sentence by sentence, not waiting for the full response. Without streaming, your latency is the sum of all stages. With streaming, it's the latency of the first stage that produces output.

Turn Detection

The hardest problem in voice AI isn't latency — it's knowing when the user has finished speaking. VAD (Voice Activity Detection) detects silence, but silence isn't always the end of a turn. Build a two-stage system: VAD triggers a 500ms silence window, then a small classifier decides if it's a natural pause (within a sentence) or a turn end. Without this, your agent constantly interrupts users mid-thought.

Interruption Handling

Users interrupt. When the user speaks while the AI is talking, you need to: (1) detect the interruption via VAD, (2) stop TTS audio immediately, (3) cancel the pending LLM generation, (4) process the new user input. This requires coordinated cancellation across all three pipeline stages — a complexity that's easy to underestimate.

The Architecture

# Simplified voice pipeline
async def voice_pipeline(audio_stream):
    async for transcript in stt.stream(audio_stream):  # streaming STT
        if is_turn_end(transcript):
            async for token in llm.stream(transcript.text):  # streaming LLM
                sentence = buffer.add(token)
                if sentence:  # sentence boundary
                    audio = await tts.synthesize(sentence)  # streaming TTS
                    await websocket.send(audio)

Where It Breaks


Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →