AI Engineering 8 min read

Audio and Speech Models: How Whisper and the Speech Pipeline Works

From raw audio waveform to text: log-mel spectrograms, the encoder-decoder transformer architecture, why Whisper is so robust to accents and noise, and how to deploy speech-to-text in production at low cost.

Speech-to-text is one of the oldest AI problems. For decades it was solved with highly engineered pipelines: acoustic models, language models, pronunciation dictionaries, beam search decoders — each component tuned separately. OpenAI's Whisper (2022) replaced all of that with a single end-to-end transformer trained on 680,000 hours of audio. The result was a model significantly more robust to accents, background noise, and domain-specific vocabulary than anything before it.

The Input Representation: Log-Mel Spectrograms

Raw audio is a 1D waveform — pressure over time, sampled at 16,000 Hz for speech. This is too noisy and high-dimensional for direct transformer processing. Whisper first converts audio to a log-mel spectrogram: compute the short-time Fourier transform (STFT) over 25ms windows with 10ms hop, then bin the resulting frequency magnitudes into 80 mel-scale frequency bands (which compress the high frequencies humans are less sensitive to), then take the log.

The result is a 2D representation: time on the X axis, mel frequency on the Y axis, log-magnitude as the value. A 30-second audio clip becomes an 80×3000 spectrogram. This is treated as an image and processed by a CNN stem before being fed to the transformer encoder.

Mel scale: a perceptual frequency scale where equal distances represent equal perceived pitch differences. Humans perceive differences in low frequencies (200 Hz vs 400 Hz) more acutely than high frequencies (4000 Hz vs 4200 Hz). Mel scale compresses the high end accordingly.

The Architecture: Encoder-Decoder Transformer

Whisper uses the standard encoder-decoder architecture (from the original 'Attention is All You Need' paper). The encoder processes the log-mel spectrogram via a 2-layer CNN stem (for local feature extraction) followed by sinusoidal positional encoding and transformer encoder blocks. The decoder is an autoregressive transformer that generates text tokens one at a time, attending to the encoder output via cross-attention.

Input: 30-second audio chunks (shorter clips are zero-padded).
Encoder: CNN stem + transformer encoder → fixed-length context vector.
Decoder: generates transcription tokens autoregressively. First token is a language code (e.g., <|en|>).
Special tokens handle: language detection, task (transcription vs. translation), timestamp prediction, no-speech detection.

Why Whisper is Robust

Whisper was trained on 680,000 hours of audio from the internet — a deliberately diverse dataset including YouTube videos, podcasts, lectures, phone calls, and noisy recordings in 99 languages. Earlier ASR systems trained on clean read-speech corpora (like LibriSpeech) and failed on anything that didn't match that distribution. Whisper's robustness comes entirely from training distribution diversity, not architectural innovation.

Whisper Model Sizes

Model	Parameters	Relative Speed	Best Use Case
tiny	39M	~32×	Real-time transcription on low-power devices
base	74M	~16×	Fast transcription with acceptable accuracy
small	244M	~6×	Balanced accuracy/speed
medium	769M	~2×	High accuracy, production batch transcription
large-v3	1.55B	1×	Maximum accuracy, offline processing

Production Deployment Considerations

Whisper processes audio in 30-second chunks — long-form transcription requires chunking with overlap to handle words cut at chunk boundaries.
Hallucination on silence: Whisper will sometimes generate plausible text on silent or near-silent audio. Always run VAD (Voice Activity Detection) before Whisper to filter silent chunks.
For real-time streaming: Whisper itself is not streaming — it processes complete 30-second windows. Faster-Whisper (CTranslate2 backend) with sliding window inference is the standard production approach for near-real-time transcription.
Punctuation and capitalization: Whisper outputs punctuation and capitalization, which most traditional ASR systems don't. This matters for downstream NLP processing.
Word timestamps: Whisper can output per-word timestamps in addition to text — useful for subtitle generation and speaker diarization alignment.

For production ASR: Faster-Whisper (using CTranslate2 quantization) is 4× faster than the original PyTorch implementation at the same accuracy. For high-volume batch transcription, use large-v3 via Faster-Whisper with INT8 quantization. Cost: ~$0.0001 per minute of audio at compute cost — orders of magnitude cheaper than cloud ASR APIs.

Explore →: Explore multimodal AI capabilities in the Explore module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →