AI Engineering 8 min read

Why Transformers Won: The RNN to LSTM to Transformer Arc

The full architectural progression as a systems story. What RNNs and LSTMs hit their ceiling on, why attention solved both parallelism and long-range dependencies, and why decoder-only models dominate at scale.

The problem with processing sequences one step at a time

Before transformers, the dominant architecture for sequences was the Recurrent Neural Network. RNNs process tokens one at a time: the hidden state from token N is passed forward to token N+1, which updates it and passes it to N+2. The entire history of the sequence is compressed into a single fixed-size hidden state vector.

This design has two fundamental problems that compound at scale. The first is the vanishing gradient: when backpropagating through hundreds of sequential steps, gradients shrink exponentially. The model cannot learn long-range dependencies — by the time information from token 1 influences token 100, the gradient signal is essentially zero.

The second is that sequential processing cannot be parallelized. Token N+1 must wait for token N. Training a 10,000-token sequence requires 10,000 sequential operations. This is a hard ceiling on throughput regardless of hardware.

LSTM: the partial fix

Long Short-Term Memory networks (Hochreiter & Schmidhuber, 1997) addressed the vanishing gradient with a gating mechanism: a cell state that flows across time steps with additive rather than multiplicative updates, controlled by learned gates (input, forget, output). The forget gate can learn to keep important information alive across many steps instead of letting it decay.

LSTMs were genuinely better. They powered state-of-the-art results in machine translation, speech recognition, and text generation through the early 2010s. But they did not solve parallelism — they were still sequential. And they still compressed the entire past into a fixed-size state. At 500+ tokens, even LSTMs struggled with long-range dependencies.

The architectural insight that LSTM still missed: the bottleneck is not how well you carry forward information — it is that you must compress everything into a single vector. The fix is not a better compression mechanism. It is to stop compressing.

Attention: the key insight

The attention mechanism, formalized in 'Attention Is All You Need' (Vaswani et al., 2017), replaced sequential compression with direct access. Instead of reading from a hidden state that must carry the entire history, each token can directly attend to any other token in the sequence, weighted by learned relevance.

Two consequences follow immediately. First, parallelism: all attention computations across the sequence can run simultaneously — the dependency is on the full input, not on the previous step's output. Training speed improves by orders of magnitude on modern GPU hardware. Second, long-range dependencies: token 1 and token 10,000 can influence each other directly, with no gradient path length penalty.

The cost is quadratic memory and compute in sequence length — the attention matrix is N×N. This is why long context remains expensive. The transformer did not make sequences free; it made them parallelizable and made long-range dependencies tractable.

Encoder-only, decoder-only, encoder-decoder

The original transformer had two components: an encoder that builds a representation of the input, and a decoder that generates the output attending to the encoder's representation. Different tasks use different parts.

Encoder-only (BERT, RoBERTa): reads the full input bidirectionally — every token attends to every other token. Best for classification, NER, semantic similarity. Knows the full context; cannot generate new tokens.
Decoder-only (GPT series, Llama, Mistral): generates tokens left-to-right. Each token attends only to previous tokens (causal masking). The standard architecture for chat, completion, and agentic tasks. The architecture that scaled.
Encoder-decoder (T5, BART, mT5): full encoder over the input, then a decoder that attends to both the encoder output and its own generated tokens. Best for translation, summarization, question answering with discrete input/output.

Why decoder-only won at scale: the causal language modeling pretraining objective — predict the next token — is self-supervised over any text corpus. No labels needed. Encoder-decoder models need paired input-output data for pretraining. The data availability asymmetry was decisive. Internet-scale text is almost entirely unlabeled.

Why this history matters for engineers

Understanding the architectural arc makes specific engineering decisions obvious. BERT-family models are still the right choice for retrieval and classification — bidirectional attention gives richer representations for these tasks. Decoder-only models are the right choice for generation, instruction following, and agents. Encoder-decoder models remain competitive for structured generation tasks where the input and output are clearly separated.

The scaling story is decoder-only: all frontier models (GPT-4, Claude, Gemini, Llama 3) are decoder-only. The reason is not that the architecture is theoretically superior — it is that the pretraining objective scales with data, and data is the binding constraint. Any architecture that requires labeled data for pretraining will always lose to one that can train on raw internet text at ten times the scale.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →