GenAI Systems Lab Open interactive version →
AI Engineering 8 min read

Audio and Speech Models: How Whisper and the Speech Pipeline Works

From raw audio waveform to text: log-mel spectrograms, the encoder-decoder transformer architecture, why Whisper is so robust to accents and noise, and how to deploy speech-to-text in production at low cost.

Speech-to-text is one of the oldest AI problems. For decades it was solved with highly engineered pipelines: acoustic models, language models, pronunciation dictionaries, beam search decoders — each component tuned separately. OpenAI's Whisper (2022) replaced all of that with a single end-to-end transformer trained on 680,000 hours of audio. The result was a model significantly more robust to accents, background noise, and domain-specific vocabulary than anything before it.

The Input Representation: Log-Mel Spectrograms

Raw audio is a 1D waveform — pressure over time, sampled at 16,000 Hz for speech. This is too noisy and high-dimensional for direct transformer processing. Whisper first converts audio to a log-mel spectrogram: compute the short-time Fourier transform (STFT) over 25ms windows with 10ms hop, then bin the resulting frequency magnitudes into 80 mel-scale frequency bands (which compress the high frequencies humans are less sensitive to), then take the log.

The result is a 2D representation: time on the X axis, mel frequency on the Y axis, log-magnitude as the value. A 30-second audio clip becomes an 80×3000 spectrogram. This is treated as an image and processed by a CNN stem before being fed to the transformer encoder.

Mel scale: a perceptual frequency scale where equal distances represent equal perceived pitch differences. Humans perceive differences in low frequencies (200 Hz vs 400 Hz) more acutely than high frequencies (4000 Hz vs 4200 Hz). Mel scale compresses the high end accordingly.

The Architecture: Encoder-Decoder Transformer

Whisper uses the standard encoder-decoder architecture (from the original 'Attention is All You Need' paper). The encoder processes the log-mel spectrogram via a 2-layer CNN stem (for local feature extraction) followed by sinusoidal positional encoding and transformer encoder blocks. The decoder is an autoregressive transformer that generates text tokens one at a time, attending to the encoder output via cross-attention.

Why Whisper is Robust

Whisper was trained on 680,000 hours of audio from the internet — a deliberately diverse dataset including YouTube videos, podcasts, lectures, phone calls, and noisy recordings in 99 languages. Earlier ASR systems trained on clean read-speech corpora (like LibriSpeech) and failed on anything that didn't match that distribution. Whisper's robustness comes entirely from training distribution diversity, not architectural innovation.

Whisper Model Sizes

ModelParametersRelative SpeedBest Use Case
tiny39M~32×Real-time transcription on low-power devices
base74M~16×Fast transcription with acceptable accuracy
small244M~6×Balanced accuracy/speed
medium769M~2×High accuracy, production batch transcription
large-v31.55BMaximum accuracy, offline processing

Production Deployment Considerations

For production ASR: Faster-Whisper (using CTranslate2 quantization) is 4× faster than the original PyTorch implementation at the same accuracy. For high-volume batch transcription, use large-v3 via Faster-Whisper with INT8 quantization. Cost: ~$0.0001 per minute of audio at compute cost — orders of magnitude cheaper than cloud ASR APIs.

Explore →: Explore multimodal AI capabilities in the Explore module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →