Attention From Scratch: The 30 Lines That Explain Everything
Stop reading about Q, K, V matrices and implement them. 30 lines of NumPy, no abstractions, no PyTorch. You will understand what attention is actually computing — and why the sqrt(d_k) denominator exists — better than most people who have read the original paper.
**Prerequisite: Step 1 (NLP Origins).** After this post you'll be able to explain self-attention from first principles — Q, K, V, dot products, softmax — the mechanism at the heart of every LLM. No matrix algebra needed; the intuition comes first.
Every explanation of attention eventually says the same thing: Query, Key, Value matrices, scaled dot-product, softmax, weighted sum. The words are correct. They do not actually explain what is happening or why the mechanism works. The only way to get that is to implement it yourself, watch the numbers, and trace what each line is doing to the representation.
This post is 30 lines of NumPy. No PyTorch, no abstractions. Just the raw computation. If you run this in a Colab tonight and spend 20 minutes poking at the output, you will understand attention better than most people who have read the Transformer paper.
What attention is actually solving
Before RNNs were replaced, the fundamental problem with sequence models was that every token had to pass its information forward through a fixed-size hidden state. The word at position 1 in a 500-word document had to survive 499 compression steps to influence the output at position 500. It usually did not survive intact.
Attention eliminates this by letting every token look directly at every other token in the sequence — simultaneously, in parallel. No chain of handoffs. Token 1 and token 500 communicate directly. The mechanism that controls how much each token attends to each other token is what we are building below.
The implementation
import numpy as np
np.random.seed(42)
seq_len, d_model = 6, 8 # 6 tokens, 8-dim embeddings
# Simulated token embeddings — in a real model these come from an embedding table
X = np.random.randn(seq_len, d_model)
# Learned projection matrices — random here, trained in a real model
W_Q = np.random.randn(d_model, d_model) * 0.1
W_K = np.random.randn(d_model, d_model) * 0.1
W_V = np.random.randn(d_model, d_model) * 0.1
# Project input into Q, K, V spaces
Q = X @ W_Q # (6, 8) — what each token is searching for
K = X @ W_K # (6, 8) — what each token is advertising about itself
V = X @ W_V # (6, 8) — what each token will contribute if attended to
# Scaled dot-product: how well does each Q match each K?
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # (6, 6) — raw attention logits
# Softmax: turn logits into a probability distribution over positions
def softmax(x):
e = np.exp(x - x.max(axis=-1, keepdims=True)) # subtract max for numerical stability
return e / e.sum(axis=-1, keepdims=True)
weights = softmax(scores) # (6, 6) — each row sums to 1.0
# Weighted sum of values: the actual attended output
output = weights @ V # (6, 8) — new representation for each token
print("Attention weight matrix (row i = how token i distributes attention):")
print(np.round(weights, 3))
print("\nRow 0 sums to:", round(weights[0].sum(), 6))
print("Output shape:", output.shape)
What you should look at in the output
The weight matrix is the interesting part. Row 0 tells you how token 0 distributes its attention across all 6 tokens. Each row is a probability distribution — all values positive, each row sums to 1. A value of 0.6 in position (0, 3) means token 0 is taking 60% of its new representation from token 3's value vector.
In a trained model, these weights encode learned linguistic relationships. Token 'it' attends heavily to the noun it refers to. Token 'bank' distributes attention differently in 'river bank' vs 'savings bank'. The model learns which tokens should attend to which by backpropagating through the weight matrix during training.
With random weights (as above), the pattern has no linguistic meaning — it is just the geometry of random projections. But the mechanics are identical to what GPT-4 runs billions of times per second.
The sqrt(d_k) denominator — why it is there
Try removing the division by sqrt(d_k) and printing the weights again. With high-dimensional vectors, dot products grow large in magnitude. Large logits make softmax extremely sharp — essentially a hard argmax where one weight is near 1.0 and the rest near 0. The model loses its ability to attend to multiple positions simultaneously. Dividing by sqrt(d_k) keeps the logits in a range where softmax is still smooth and gradients flow cleanly during training.
Try this in Colab: add a causal mask before softmax. Set scores[i, j] = -1e9 for all j > i. Run softmax again. Every row now only attends to positions at or before it — this is exactly how GPT models are trained. Token i cannot see the future.
What multi-head attention adds
Single-head attention learns one set of Q/K/V projections. Multi-head attention runs h independent attention heads in parallel, each with different projection matrices, then concatenates the outputs. The intuition: one head might learn to attend to syntactic subjects, another to semantic similarity, another to positional proximity. Concatenating them gives the model richer composite representations than any single head could learn.
def multi_head_attention(X, num_heads=4):
d_model = X.shape[-1]
d_head = d_model // num_heads
heads = []
for _ in range(num_heads):
Wq = np.random.randn(d_model, d_head) * 0.1
Wk = np.random.randn(d_model, d_head) * 0.1
Wv = np.random.randn(d_model, d_head) * 0.1
Q, K, V = X @ Wq, X @ Wk, X @ Wv
scores = Q @ K.T / np.sqrt(d_head)
w = softmax(scores)
heads.append(w @ V)
return np.concatenate(heads, axis=-1) # (seq_len, d_model)
out = multi_head_attention(X, num_heads=4)
print("Multi-head output shape:", out.shape) # (6, 8) — same as input
What to build next in Colab
Once this runs: add positional encoding (a sine/cosine pattern added to X before the projections — this is how the model knows token order). Then add a feed-forward layer after attention (two linear layers with a ReLU in between). Then a residual connection (add X back to the attention output before the FFN). You now have one Transformer block. Stack 12 of them and you have a GPT-2-scale architecture.
The thing to notice at each step: none of these components are magical. Each one solves a specific, identifiable problem — positional encoding gives the model order information, residual connections prevent gradient vanishing, the FFN adds nonlinearity between attention steps. Understanding why each piece exists is what makes the whole thing click.
Transformer architecture in Concepts →: The txarch module walks through the full Transformer block — attention, FFN, residuals, LayerNorm — with interactive visualisations.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →