Why Q and K Had to Exist: The Naming Decision at the Core of Self-Attention
Q and K are separate learned projections of the same embedding because relevance is asymmetric — what's searching vs. what's available to be found. Using the raw embedding against itself makes every token attend maximally to itself.
A team is debugging why the token "agreed" attends heavily to "surgeon" six positions away, while nearly ignoring "the" sitting one position to its left. Attention weights are printed. The numbers are real. The question is why the mechanism works this way at all — why the thing that measures relevance needs two separate components instead of one.
The naive design would be to score every token against every other token by computing the dot product of their raw embeddings. Token A's vector dotted with token B's vector gives a scalar — do that for all pairs, normalize, and you have attention weights. It is simple. It is also broken from the first forward pass.
The problem is that an embedding vector is optimized to encode what a token means, not to encode two different things simultaneously: what this token is looking for, and what this token has to offer. These are not the same question. "surgeon" encodes a semantic cluster around medical expertise and action. But whether "surgeon" is a good match for what "agreed" is searching for depends on the syntactic and semantic context of the verb — a dimension that the raw embedding of "agreed" was never trained to express directly.
Worse: if you score a token against itself using its raw embedding, the dot product of a vector with itself is always its squared magnitude — the largest possible score. Every token becomes its own best match by default. Attention collapses. Every position attends to itself and ignores everything else.
The constraint this creates is exact: you need two different linear projections of each embedding — one that encodes what this token is searching for, and one that encodes what this token offers to others searching. The first is learned by a weight matrix W_Q. The second is learned by W_K. Both are applied to the same input embedding. The result is a Query vector and a Key vector for each token, living in a lower-dimensional projection space where dot products measure relevance along the axes that actually matter for the task.
Token embeddings (d_model = 512):
"agreed" → x_a [512 floats — encodes verb semantics]
"surgeon" → x_s [512 floats — encodes profession semantics]
Project through learned matrices (d_k = 64):
Q_agreed = x_a · W_Q [64 floats — "what am I looking for?"]
K_surgeon = x_s · W_K [64 floats — "what do I offer?"]
Score:
score = Q_agreed · K_surgeon / sqrt(64)
= 3.71 ← high: "agreed" is searching for an agent noun
score = Q_agreed · K_the / sqrt(64)
= 0.12 ← low: "the" offers nothing on that axis
Softmax over all scores → attention weights → weighted sum of V vectors
The division by sqrt(d_k) — 8.0 in this case — is not cosmetic. Without it, the dot products grow in magnitude as d_k grows, pushing the softmax into saturation where gradients vanish. Scaling keeps the distribution in a learnable regime.
This is why Q and K exist as separate projections rather than as a single vector. Relevance has two asymmetric sides — what is searching, and what is available to be found — and a single vector cannot encode both simultaneously without collapsing every token's self-attention score to its maximum possible value.
Q and K are two different learned views of the same token embedding: one asks 'what am I looking for?' and the other asks 'what do I have?' — separating them is what makes attention measure relevance instead of just self-similarity.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →