AI Engineering 10 min read

How I'd Build a Voice AI Pipeline

STT → LLM → TTS architecture, latency budget breakdown, streaming strategies, and where it actually breaks in production.

Voice AI has a deceptively simple stack: speech-to-text → LLM → text-to-speech. The hard part is doing it fast enough that it feels like a conversation and reliable enough that it doesn't break in a noisy environment. Here's how I'd build it.

The Latency Budget

Human conversation feels natural under 500ms end-to-end. Your budget: STT ~150ms, LLM first token ~200ms, TTS first audio chunk ~100ms, network ~50ms. Every component needs to stream — you can't batch any stage.

Stage	Target Latency	Tool Options	Bottleneck Risk
STT	100–200ms	Deepgram Nova, Whisper Streaming, AssemblyAI	Accuracy on accents
LLM	150–300ms TTFT	GPT-4o, Gemini Flash, Groq	Context length, streaming setup
TTS	80–150ms	ElevenLabs, Cartesia, PlayHT	Voice cloning quality
Network	50–100ms	WebSockets (required)	Regional latency

Streaming Is Non-Negotiable

Every stage must stream. STT streams partial transcripts as the user speaks. The LLM starts generating before the full transcript arrives (using interim results). TTS synthesizes and plays audio as the LLM generates tokens — sentence by sentence, not waiting for the full response. Without streaming, your latency is the sum of all stages. With streaming, it's the latency of the first stage that produces output.

Turn Detection

The hardest problem in voice AI isn't latency — it's knowing when the user has finished speaking. VAD (Voice Activity Detection) detects silence, but silence isn't always the end of a turn. Build a two-stage system: VAD triggers a 500ms silence window, then a small classifier decides if it's a natural pause (within a sentence) or a turn end. Without this, your agent constantly interrupts users mid-thought.

Interruption Handling

Users interrupt. When the user speaks while the AI is talking, you need to: (1) detect the interruption via VAD, (2) stop TTS audio immediately, (3) cancel the pending LLM generation, (4) process the new user input. This requires coordinated cancellation across all three pipeline stages — a complexity that's easy to underestimate.

The Architecture

# Simplified voice pipeline
async def voice_pipeline(audio_stream):
    async for transcript in stt.stream(audio_stream):  # streaming STT
        if is_turn_end(transcript):
            async for token in llm.stream(transcript.text):  # streaming LLM
                sentence = buffer.add(token)
                if sentence:  # sentence boundary
                    audio = await tts.synthesize(sentence)  # streaming TTS
                    await websocket.send(audio)

Where It Breaks

Noisy environments: STT accuracy drops. Use noise-robust models (Deepgram Nova-3) and consider a denoising preprocessor.
Accents and domain vocabulary: Fine-tune STT on your domain or use custom vocabulary hints.
Latency spikes: P99 matters more than P50. One slow LLM response breaks the conversation feel. Use timeout + fallback.
Context management: Voice conversations can be long. Summarize older turns, keep recent turns verbatim — same as text multi-turn.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →