AI Engineering 10 min read

Attention Is All You Need: The Paper That Built the AI Era

The 2017 Transformer paper that replaced RNNs, enabled parallel training, and became the foundation every LLM is built on. What it actually proposed and why it worked.

In 2017, the dominant approach to sequence modelling was recurrent neural networks — LSTMs and GRUs that processed text one token at a time. They had two fundamental problems: they couldn't be parallelised during training, and they struggled to retain information across long sequences.

A team at Google Brain published 'Attention Is All You Need'. It proposed replacing recurrence entirely with self-attention — an architecture that could process every token simultaneously, relate any token to any other regardless of distance, and scale to sizes RNNs never could. Every LLM today — GPT-4, Claude, Gemini, Llama — is a Transformer.

[Video: Andrej Karpathy — Let's build GPT from scratch (implementing the Transformer paper step by step)]

The core problem: sequence modelling before Transformers

RNNs process sequences step by step. To understand token 512, the network had to pass information through tokens 1 through 511 — each step could corrupt or lose earlier information. LSTMs improved this with gating, but sequential processing meant you couldn't parallelise training across a GPU cluster.

The bottleneck wasn't parameter count or data — it was architecture. Sequential computation meant you couldn't use GPU parallelism effectively. The Transformer solved this by making every token attend to every other token simultaneously.

What self-attention actually computes

For each token, the mechanism computes three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what should I pass forward?). Attention score = softmax(Q·K / √d_k). Output = weighted sum of Values across all positions.

The result: 'bank' in 'river bank' attends heavily to 'river'. 'Bank' in 'bank account' attends heavily to 'account'. Same mechanism, same weights — context-dependent meaning without any explicit disambiguation rule.

Multi-head attention

The paper ran attention multiple times in parallel with different learned weight matrices. Each head specialises without explicit supervision: one for syntactic relationships, one for semantic similarity, one for coreference. Outputs are concatenated and projected back to the model dimension.

Positional encoding

Self-attention is permutation-invariant — the same tokens in different orders produce the same output. The paper added sinusoidal positional encodings to each token embedding. Modern LLMs use RoPE (Rotary Positional Encoding) or Alibi, which generalise better to lengths beyond training.

Architecture variants

Architecture	Examples	Best For
Encoder-only	BERT, RoBERTa	Classification, embeddings, NLU
Decoder-only	GPT-4, Claude, LLaMA	Text generation, reasoning, chat
Encoder-decoder	T5, BART, original Transformer	Translation, summarization, seq2seq

The Transformer's most important property isn't self-attention — it's that self-attention is parallelisable. This turned LLM training into a matrix multiplication problem that GPU clusters could exploit fully. Every modernisation since (RoPE, FlashAttention, SwiGLU) is an engineering improvement on the same mathematical foundation from this 2017 paper.

Explore the Transformer architecture interactively →: Visualise attention heads, positional encoding, and how tokens relate to each other.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →