Word2Vec From Scratch: How Words Became Vectors
The distributional hypothesis, skip-gram with negative sampling, and a full NumPy implementation. Build the vectors, run the analogies, understand why king - man + woman ≈ queen. The ancestor of every modern embedding model.
In 2013, Mikolov et al. published two papers that changed NLP. The core idea fits in one sentence: words that appear in similar contexts have similar meanings — so train a neural network to predict context from word (or word from context), and the hidden layer weights become the word vectors. The training objective is a proxy. The embeddings are the prize.
This post implements skip-gram with negative sampling from scratch in NumPy. You will end up with word vectors where king - man + woman ≈ queen, where similar words cluster in space, and where you can do arithmetic on meaning. Understanding the mechanics explains why modern embedding models work the way they do.
The distributional hypothesis
'You shall know a word by the company it keeps.' — J.R. Firth, 1957. The intuition: 'cat' and 'dog' appear in similar sentences — 'my ___ ate the food', 'the ___ ran outside', 'I pet my ___'. If you can find a vector space where words with similar contexts land nearby, you have captured semantic similarity without any explicit labelling.
Word2Vec operationalises this. The skip-gram model takes a word (the 'center word') and tries to predict its surrounding words (the 'context words'). During training, the network adjusts the vectors for the center word and context words so the prediction improves. Words that appear in the same contexts end up with similar vectors because the same adjustment pressure is applied to all of them.
Architecture: skip-gram with negative sampling
Two embedding matrices: W_in (center word embeddings, shape vocab×d) and W_out (context word embeddings, shape vocab×d). For each (center, context) training pair: take the center word's row from W_in, take the context word's row from W_out, compute their dot product, push it through sigmoid to get a probability, and maximise it. That is the positive signal.
The full softmax over the vocabulary — P(context|center) = exp(center·context) / sum_over_all_words — is too expensive with a 100k-word vocabulary. Negative sampling approximates it: for each real (center, context) pair, sample k random 'noise' words and minimise their probability. With k=5, each update touches 6 words instead of 100k. The objective becomes: maximise P(real context | center word) while minimising P(noise | center word).
import numpy as np
from collections import Counter
import random
class Word2Vec:
def __init__(self, vocab, d=50, lr=0.025, n_neg=5):
self.vocab = vocab
self.w2i = {w: i for i, w in enumerate(vocab)}
V = len(vocab)
# Two matrices: center-word embeddings and context-word embeddings
self.W_in = (np.random.rand(V, d) - 0.5) / d # center
self.W_out = np.zeros((V, d)) # context
self.lr = lr
self.n_neg = n_neg
self.freq = np.ones(V) # for negative sampling distribution
def set_freq(self, word_counts):
counts = np.array([word_counts.get(w, 1) for w in self.vocab], dtype=float)
# Raise to 3/4 power — reduces frequency of common words, as in original paper
self.freq = counts ** 0.75
self.freq /= self.freq.sum()
def _sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -10, 10)))
def train_pair(self, center_idx, context_idx):
# Sample negative examples (never the real context word)
neg_indices = np.random.choice(
len(self.vocab), size=self.n_neg, p=self.freq, replace=True
)
# Center embedding
h = self.W_in[center_idx] # shape (d,)
# Gradient accumulator for center word
dh = np.zeros_like(h)
# Positive example: label=1, want sigmoid(h · ctx) → 1
for idx, label in [(context_idx, 1)] + [(n, 0) for n in neg_indices]:
score = self._sigmoid(h @ self.W_out[idx])
err = (label - score)
# Update context (or negative) embedding
self.W_out[idx] += self.lr * err * h
# Accumulate gradient for center embedding
dh += self.lr * err * self.W_out[idx]
# Update center embedding
self.W_in[center_idx] += dh
def similarity(self, w1, w2):
v1 = self.W_in[self.w2i[w1]]
v2 = self.W_in[self.w2i[w2]]
return float(v1 @ v2 / (np.linalg.norm(v1) * np.linalg.norm(v2) + 1e-9))
def most_similar(self, word, k=5):
v = self.W_in[self.w2i[word]]
norms = np.linalg.norm(self.W_in, axis=1, keepdims=True) + 1e-9
sims = (self.W_in / norms) @ (v / (np.linalg.norm(v) + 1e-9))
top = np.argsort(sims)[::-1][1:k+1]
return [(self.vocab[i], round(float(sims[i]), 3)) for i in top]
# ── Tiny demo ────────────────────────────────────────────────────────────────
sentences = [
"the king rules the kingdom",
"the queen rules the kingdom",
"the man wore a crown",
"the woman wore a crown",
"the king and the queen married",
"the dog chased the cat",
"the cat ran from the dog",
"the dog barked at the cat",
"the puppy chased the kitten",
] * 50 # repeat to get enough signal
# Build vocab
all_words = [w for s in sentences for w in s.split()]
counts = Counter(all_words)
vocab = list(counts.keys())
w2v = Word2Vec(vocab, d=30, lr=0.05, n_neg=5)
w2v.set_freq(counts)
# Training: sliding window, window_size=2
window = 2
for epoch in range(5):
random.shuffle(sentences)
for sentence in sentences:
words = sentence.split()
for i, center in enumerate(words):
ci = w2v.w2i[center]
ctx_range = range(max(0, i-window), min(len(words), i+window+1))
for j in ctx_range:
if j != i:
w2v.train_pair(ci, w2v.w2i[words[j]])
# Results
print("Most similar to 'king':", w2v.most_similar("king"))
print("Most similar to 'dog':", w2v.most_similar("dog"))
print("Similarity (king, queen):", w2v.similarity("king", "queen"))
print("Similarity (king, dog):", w2v.similarity("king", "dog"))
What you should see
With 50 repetitions of the tiny corpus, the vectors are noisy but directionally correct. 'King' and 'queen' should have higher cosine similarity than 'king' and 'dog'. 'Dog' and 'cat' should be close. With a real corpus (Wikipedia, Common Crawl), the geometry becomes precise: king - man + woman lands within a few nearest neighbours of queen. Relationships like country-capital, comparative-superlative, and verb-past-tense are all linear transformations in this space.
The magic is not in the architecture — it is in the training objective. By predicting context, the network is forced to compress the co-occurrence structure of the language into a fixed-dimensional vector. That structure is meaning.
From word2vec to modern embeddings
Word2vec produces static embeddings: 'bank' always has the same vector regardless of whether you mean the river bank or the financial institution. GloVe (2014) improved on word2vec by optimising directly on the word-word co-occurrence matrix rather than predicting context window by window. Both are context-free.
ELMo (2018) changed this: it ran a bidirectional LSTM over the whole sentence and produced a different embedding for each word based on its context. 'Bank' in 'river bank' and 'bank account' got different vectors. This contextual embedding idea is what BERT and GPT took and scaled up — the transformer replaced the LSTM, and pre-training on massive corpora gave the embeddings the depth that made them useful for every downstream task. Word2vec is the ancestor. Understanding it makes the whole arc from word vectors to GPT legible.
Run this with a real corpus: download the text8 dataset (Wikipedia subset, ~100MB) and train for 5 epochs with d=100. Training takes ~10 minutes on CPU. Then query most_similar('paris') and verify that 'france', 'london', 'berlin' appear. Verify that the analogy man:woman :: king:queen holds by computing king_vec - man_vec + woman_vec and finding the nearest word.
- Efficient Estimation of Word Representations in Vector Space — Mikolov et al. (2013)
- GloVe: Global Vectors for Word Representation — Pennington et al. (2014)
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →