Attention Is All You Need: The Paper That Built the AI Era
The 2017 Transformer paper that replaced RNNs, enabled parallel training, and became the foundation every LLM is built on. What it actually proposed and why it worked.
In 2017, the dominant approach to sequence modelling was recurrent neural networks — LSTMs and GRUs that processed text one token at a time. They had two fundamental problems: they couldn't be parallelised during training, and they struggled to retain information across long sequences.
A team at Google Brain published 'Attention Is All You Need'. It proposed replacing recurrence entirely with self-attention — an architecture that could process every token simultaneously, relate any token to any other regardless of distance, and scale to sizes RNNs never could. Every LLM today — GPT-4, Claude, Gemini, Llama — is a Transformer.
[Video: Andrej Karpathy — Let's build GPT from scratch (implementing the Transformer paper step by step)]
The core problem: sequence modelling before Transformers
RNNs process sequences step by step. To understand token 512, the network had to pass information through tokens 1 through 511 — each step could corrupt or lose earlier information. LSTMs improved this with gating, but sequential processing meant you couldn't parallelise training across a GPU cluster.
The bottleneck wasn't parameter count or data — it was architecture. Sequential computation meant you couldn't use GPU parallelism effectively. The Transformer solved this by making every token attend to every other token simultaneously.
What self-attention actually computes
For each token, the mechanism computes three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what should I pass forward?). Attention score = softmax(Q·K / √d_k). Output = weighted sum of Values across all positions.
The result: 'bank' in 'river bank' attends heavily to 'river'. 'Bank' in 'bank account' attends heavily to 'account'. Same mechanism, same weights — context-dependent meaning without any explicit disambiguation rule.
Multi-head attention
The paper ran attention multiple times in parallel with different learned weight matrices. Each head specialises without explicit supervision: one for syntactic relationships, one for semantic similarity, one for coreference. Outputs are concatenated and projected back to the model dimension.
Positional encoding
Self-attention is permutation-invariant — the same tokens in different orders produce the same output. The paper added sinusoidal positional encodings to each token embedding. Modern LLMs use RoPE (Rotary Positional Encoding) or Alibi, which generalise better to lengths beyond training.
Architecture variants
| Architecture | Examples | Best For |
|---|---|---|
| Encoder-only | BERT, RoBERTa | Classification, embeddings, NLU |
| Decoder-only | GPT-4, Claude, LLaMA | Text generation, reasoning, chat |
| Encoder-decoder | T5, BART, original Transformer | Translation, summarization, seq2seq |
The Transformer's most important property isn't self-attention — it's that self-attention is parallelisable. This turned LLM training into a matrix multiplication problem that GPU clusters could exploit fully. Every modernisation since (RoPE, FlashAttention, SwiGLU) is an engineering improvement on the same mathematical foundation from this 2017 paper.
Explore the Transformer architecture interactively →: Visualise attention heads, positional encoding, and how tokens relate to each other.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →