How I'd Build a Voice AI Pipeline
STT → LLM → TTS architecture, latency budget breakdown, streaming strategies, and where it actually breaks in production.
Voice AI has a deceptively simple stack: speech-to-text → LLM → text-to-speech. The hard part is doing it fast enough that it feels like a conversation and reliable enough that it doesn't break in a noisy environment. Here's how I'd build it.
The Latency Budget
Human conversation feels natural under 500ms end-to-end. Your budget: STT ~150ms, LLM first token ~200ms, TTS first audio chunk ~100ms, network ~50ms. Every component needs to stream — you can't batch any stage.
| Stage | Target Latency | Tool Options | Bottleneck Risk |
|---|---|---|---|
| STT | 100–200ms | Deepgram Nova, Whisper Streaming, AssemblyAI | Accuracy on accents |
| LLM | 150–300ms TTFT | GPT-4o, Gemini Flash, Groq | Context length, streaming setup |
| TTS | 80–150ms | ElevenLabs, Cartesia, PlayHT | Voice cloning quality |
| Network | 50–100ms | WebSockets (required) | Regional latency |
Streaming Is Non-Negotiable
Every stage must stream. STT streams partial transcripts as the user speaks. The LLM starts generating before the full transcript arrives (using interim results). TTS synthesizes and plays audio as the LLM generates tokens — sentence by sentence, not waiting for the full response. Without streaming, your latency is the sum of all stages. With streaming, it's the latency of the first stage that produces output.
Turn Detection
The hardest problem in voice AI isn't latency — it's knowing when the user has finished speaking. VAD (Voice Activity Detection) detects silence, but silence isn't always the end of a turn. Build a two-stage system: VAD triggers a 500ms silence window, then a small classifier decides if it's a natural pause (within a sentence) or a turn end. Without this, your agent constantly interrupts users mid-thought.
Interruption Handling
Users interrupt. When the user speaks while the AI is talking, you need to: (1) detect the interruption via VAD, (2) stop TTS audio immediately, (3) cancel the pending LLM generation, (4) process the new user input. This requires coordinated cancellation across all three pipeline stages — a complexity that's easy to underestimate.
The Architecture
# Simplified voice pipeline
async def voice_pipeline(audio_stream):
async for transcript in stt.stream(audio_stream): # streaming STT
if is_turn_end(transcript):
async for token in llm.stream(transcript.text): # streaming LLM
sentence = buffer.add(token)
if sentence: # sentence boundary
audio = await tts.synthesize(sentence) # streaming TTS
await websocket.send(audio)
Where It Breaks
- Noisy environments: STT accuracy drops. Use noise-robust models (Deepgram Nova-3) and consider a denoising preprocessor.
- Accents and domain vocabulary: Fine-tune STT on your domain or use custom vocabulary hints.
- Latency spikes: P99 matters more than P50. One slow LLM response breaks the conversation feel. Use timeout + fallback.
- Context management: Voice conversations can be long. Summarize older turns, keep recent turns verbatim — same as text multi-turn.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →