Foundations & Architecture 11 min read

Positional Encodings: Sinusoidal vs RoPE vs ALiBi — What Changed and Why

Sinusoidal adds position vectors. RoPE rotates Q and K to encode relative positions. ALiBi adds a distance penalty to attention scores. Implemented from scratch. Why RoPE won, how YaRN enables context extension, and which encoding to use for your architecture.

Attention has no inherent notion of position. The sentence 'the dog bit the man' and 'the man bit the dog' produce identical sets of token embeddings — every word embedding is the same regardless of where the word appears. Without positional encoding, the model cannot distinguish these sentences. Every modern transformer uses some form of positional information. How that information is injected has changed substantially since the original transformer paper, and the choice has real production implications for context length.

Sinusoidal positional encoding: the original

Vaswani et al. (2017) added position-specific vectors to the token embeddings before the first attention layer. The vectors use sinusoidal functions: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) and PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)). This produces a unique pattern for each position. Different dimensions oscillate at different frequencies — position 1 is distinguishable from position 1000 in every dimension. The rationale: the model can learn to attend to relative positions because PE(pos+k) is a linear function of PE(pos) for any offset k.

import numpy as np
import torch

# ─── 1. Sinusoidal (Vaswani 2017) ────────────────────────────────────────────
def sinusoidal_pe(seq_len, d_model):
    pe = np.zeros((seq_len, d_model))
    pos = np.arange(seq_len)[:, None]               # (seq_len, 1)
    i   = np.arange(0, d_model, 2)[None, :]         # (1, d_model/2)
    div = 10000 ** (i / d_model)
    pe[:, 0::2] = np.sin(pos / div)                 # even dims
    pe[:, 1::2] = np.cos(pos / div)                 # odd dims
    return pe

# ─── 2. RoPE — Rotary Position Embedding (Su et al., 2022) ──────────────────
def rotate_half(x):
    """Rotate the last dimension: [x1, x2, ..., xn] → [-x2, x1, ..., -xn, xn-1]"""
    x1, x2 = x[..., :x.shape[-1]//2], x[..., x.shape[-1]//2:]
    return torch.cat([-x2, x1], dim=-1)

def apply_rope(q, k, seq_len, d_head, theta=10000.0):
    """Apply RoPE to Q and K tensors. Position is baked into Q@K products."""
    device = q.device
    i      = torch.arange(0, d_head, 2, dtype=torch.float32, device=device)
    freqs  = 1.0 / (theta ** (i / d_head))                   # (d_head/2,)
    pos    = torch.arange(seq_len, dtype=torch.float32, device=device)
    angle  = torch.outer(pos, freqs)                          # (seq_len, d_head/2)
    cos    = torch.cos(angle).unsqueeze(0).unsqueeze(0)       # (1, 1, seq_len, d_head/2)
    sin    = torch.sin(angle).unsqueeze(0).unsqueeze(0)
    # Expand to full d_head by repeating
    cos_full = torch.cat([cos, cos], dim=-1)
    sin_full = torch.cat([sin, sin], dim=-1)
    # Rotate Q and K
    q_rot = q * cos_full + rotate_half(q) * sin_full
    k_rot = k * cos_full + rotate_half(k) * sin_full
    return q_rot, k_rot

# ─── 3. ALiBi — Attention with Linear Biases (Press et al., 2022) ──────────
def alibi_bias(n_heads, max_seq_len):
    """ALiBi adds a fixed negative bias proportional to distance, per head."""
    slopes = torch.tensor(
        [2 ** (-(8/n_heads) * (i+1)) for i in range(n_heads)], dtype=torch.float32
    )   # different slope per head
    # Position difference matrix: -|i - j| for all (i, j)
    pos   = torch.arange(max_seq_len)
    rel   = pos[None, :] - pos[:, None]                       # (seq_len, seq_len)
    bias  = slopes[:, None, None] * rel[None, :, :]           # (n_heads, seq, seq)
    # Only apply to past positions (lower triangle, standard in causal LM)
    causal_mask = torch.tril(torch.ones(max_seq_len, max_seq_len))
    return bias * causal_mask

# ── Comparison ────────────────────────────────────────────────────────────────
print("1. Sinusoidal PE — position 0 vs position 10:")
pe = sinusoidal_pe(20, 8)
print(f"   pos 0:  {pe[0].round(3)}")
print(f"   pos 10: {pe[10].round(3)}")
print(f"   dot(pos_0, pos_10) = {pe[0] @ pe[10]:.3f}  (less similar = positions distinguishable)")

print("
2. RoPE — modifies Q,K before attention, not the embedding:")
B, H, T, dh = 1, 4, 10, 32
q = torch.randn(B, H, T, dh)
k = torch.randn(B, H, T, dh)
q_rot, k_rot = apply_rope(q, k, T, dh)
print(f"   Q shape unchanged: {q_rot.shape}  (position is in the rotation, not added)")

print("
3. ALiBi — attention bias matrix (negative, lower-triangular):")
bias = alibi_bias(4, 6)
print(f"   Head 0 bias (causal):")
print((bias[0] * 10).round(1).numpy())

RoPE: position in the rotation, not the embedding

Rotary Position Embedding (Su et al., 2022) does not add a positional vector to the token embedding. Instead, it rotates the Q and K vectors in the attention computation. The rotation angle is proportional to position. When you compute Q·Kᵀ between position i and position j, the dot product naturally captures the relative distance (i-j) through the interaction of the rotations. This is the critical advantage: RoPE captures relative positions, not absolute ones, which makes it generalise better to long sequences and supports techniques like YaRN and RoPE scaling for context extension.

LLaMA, Mistral, Qwen, Gemma, and most modern open-weight models use RoPE. It is the current default for new architectures because of its relative-position property and empirically superior performance at long context.

ALiBi: no learned positions, just distance penalties

Attention with Linear Biases (Press et al., 2022) adds a fixed negative bias to attention scores, proportional to the distance between tokens: bias = -slope × |i - j|. Different heads use different slopes. No position vectors, no rotation — just a penalty that grows with distance. ALiBi can be applied to models at inference time for sequences longer than the training context without fine-tuning, because the bias is a simple distance function with no learned parameters.

The practical production question: which positional encoding allows the best context extension? RoPE with YaRN scaling currently wins — LLaMA models can be extended from 8k to 128k+ context with RoPE scaling at a fraction of the fine-tuning cost of re-training with sinusoidal PE. If you are serving a model and want to serve longer contexts cheaply, understanding which PE was used is the first step.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →