Understand why transformers need position info, derive sinusoidal encodings, explore how RoPE encodes relative position through rotation, compare ALiBi's linear bias approach, and analyze long-context extrapolation methods.
Consider these two sentences: "The dog bit the man" and "The man bit the dog." Same words, completely different meanings, because word order matters. Humans take this for granted, but transformers have a surprising blind spot: the core attention mechanism treats both sentences identically. To a transformer without position information, "The dog bit the man" and "man the bit dog The" look the same.
This is called permutation invariance, and it means we need a way to tell the model where each word appears in a sequence. That's exactly what positional encoding does, and the choice of method directly determines how well a model handles long documents, conversations, and code. (If you haven't read about the attention mechanism yet, start there. It's the foundation this article builds on.)
💡 Key insight: Positional encoding injects word-order information into an architecture that would otherwise be "order-blind." In this article, you'll learn why position info is needed, how the original approach works, and why modern models use a clever rotation trick called RoPE.
Self-attention computes:
Reading the formula: compare every query with all keys (), scale by (key vector dimension) for stable softmax, convert scores to attention weights, then mix value vectors using those weights.
This operation is set-based: if you shuffle the input token order, the output is simply shuffled in the same way. The attention weights depend on content, not position.[1]
Without positional information, these two sentences produce identical attention patterns:
text1"The cat sat on the mat" 2"mat the on sat cat The"
But meaning depends entirely on word order! We need a mechanism to inject position information.
| Approach | How It Works | Modifies | Examples |
|---|---|---|---|
| Absolute (additive) | Add position vector to token embedding | Embeddings | Sinusoidal (Vaswani), Learned (GPT-2, BERT) |
| Relative (multiplicative) | Rotate Q/K vectors based on position | Attention scores | RoPE (Llama, Gemma, Qwen) |
| Relative (additive bias) | Add distance bias to attention logits | Attention scores | ALiBi (BLOOM, MPT) |
For position and dimension (where is the embedding dimension):
In plain terms: each position gets a unique "fingerprint" made of sine and cosine waves at different frequencies. Low dimensions oscillate quickly (like seconds on a clock), while high dimensions oscillate slowly (like hours). Together they create a unique pattern for every position, much like how a clock's hour, minute, and second hands uniquely identify each moment.
Why sinusoids? Because can be expressed as a linear function of :
where is a rotation matrix that depends only on offset , not absolute position. This means the model can, in principle, learn to attend based on relative position.[1]
Here is how we can generate these multi-frequency sinusoidal positional encodings. The function takes the maximum sequence length and embedding dimension as inputs, and returns a tensor of positional encodings that can be added directly to the token embeddings.
python1import torch 2import math 3 4def sinusoidal_positional_encoding( 5 max_len: int, 6 d_model: int, 7) -> torch.Tensor: 8 """Generate sinusoidal positional encodings.""" 9 pe = torch.zeros(max_len, d_model) 10 position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) 11 12 # Compute the frequency for each dimension pair 13 # Low dimensions → high frequency (local position) 14 # High dimensions → low frequency (coarse position) 15 div_term = torch.exp( 16 torch.arange(0, d_model, 2).float() 17 * (-math.log(10000.0) / d_model) 18 ) 19 20 pe[:, 0::2] = torch.sin(position * div_term) # Even dims 21 pe[:, 1::2] = torch.cos(position * div_term) # Odd dims 22 23 return pe # (max_len, d_model) 24 25# Usage: simply ADD to token embeddings 26# x = token_embeddings + pe[:seq_len]
| Dimension pair | Wavelength | What it captures |
|---|---|---|
| tokens | Very fine-grained position | |
| tokens | Medium-scale structure | |
| tokens | Very coarse position |
The multi-scale frequencies mean the model can attend to both local and distant positions through different embedding dimensions, similar to how Fourier series decompose signals at multiple scales.
Instead of fixed sinusoids, models like BERT and GPT-2 use learned position embeddings: a trainable matrix where each position gets its own optimizable vector.
| Property | Sinusoidal | Learned |
|---|---|---|
| Parameters | 0 (fixed formula) | added params |
| Flexibility | Fixed inductive bias | Adapts to task |
| Extrapolation | Beyond (in theory) | ❌ Hard limit at |
| Performance | Strong baseline | Comparable; no significant win |
💡 Key insight: Vaswani et al. (2017) tested both and found "nearly identical results."[1] Both have been superseded by relative position methods for modern LLMs.
💡 Design Choice: "Why do transformers add positional encodings to embeddings instead of concatenating them?"
A common intuition is that concatenating the positional vector to the token embedding (creating a vector) would perfectly preserve both pieces of information. However, concatenation would increase the embedding dimension from to , mathematically doubling the parameters of all linear layers and quadrupling the computational cost of downstream attention operations. Addition elegantly keeps the dimension constant.
But how does the model avoid confusing content with position when they are summed into the exact same values? Empirically, addition works well because the embedding space is extremely high-dimensional (typically or more). In high-dimensional spaces, random vectors are nearly orthogonal.
During training, the learned projections naturally separate the content and position signals into different, nearly orthogonal subspaces of the same -dimensional space. The model learns to route the position information primarily into the attention score calculations ( and ), while preserving the semantic content in the value vectors ().
RoPE[2] is used by Llama, Gemma, Qwen, Mistral, Phi, DeepSeek, which is virtually every modern open LLM.
Core insight: Instead of adding position to embeddings, rotate the query and key vectors in 2D subspaces. The dot product between rotated vectors naturally encodes relative position.
For a 2D subspace at position , apply rotation:
Reading the formula: take each consecutive pair of numbers in the query vector and spin them like a clock hand. How far they spin depends on two things: the token's position (further in the sequence = more rotation) and the frequency (each pair rotates at a different speed). The result: every position creates a unique rotational pattern that the model can use to figure out "how far apart are these two words?"
The frequency base is (same as sinusoidal encoding).
The key property: When computing , the rotation matrices combine:
Why this matters: the attention score between a query at position and a key at position depends only on the gap between them, not on where they sit in the sequence. Token 5 attending to token 3 produces the same positional effect as token 1005 attending to token 1003. This is what makes RoPE "relative": it captures distance, not address.
The code below demonstrates how RoPE applies these rotations in practice. It takes the query and key tensors as inputs, along with precomputed rotation frequencies based on the sequence length. The function reshapes the vectors into 2D pairs, applies the complex rotations, and returns the rotated query and key tensors.
python1import torch 2 3def precompute_freqs_cis( 4 dim: int, 5 max_seq_len: int, 6 theta: float = 10000.0, 7) -> torch.Tensor: 8 """Precompute rotation frequencies (complex exponentials).""" 9 freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim)) 10 t = torch.arange(max_seq_len, dtype=torch.float) 11 freqs = torch.outer(t, freqs) # (seq_len, dim/2) 12 return torch.polar(torch.ones_like(freqs), freqs) # complex64 13 14def apply_rotary_emb( 15 xq: torch.Tensor, # (batch, seq_len, n_heads, d_k) 16 xk: torch.Tensor, # (batch, seq_len, n_heads, d_k) 17 freqs_cis: torch.Tensor, # (seq_len, d_k/2) 18) -> tuple[torch.Tensor, torch.Tensor]: 19 """Apply RoPE to queries and keys.""" 20 # View as complex numbers: pairs (x_2i, x_{2i+1}) → x_2i + j·x_{2i+1} 21 xq_complex = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2)) 22 xk_complex = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2)) 23 24 # Multiply by rotation (complex multiplication = 2D rotation) 25 freqs_cis = freqs_cis.unsqueeze(0).unsqueeze(2) # broadcast 26 xq_out = torch.view_as_real(xq_complex * freqs_cis).flatten(-2) 27 xk_out = torch.view_as_real(xk_complex * freqs_cis).flatten(-2) 28 29 return xq_out.type_as(xq), xk_out.type_as(xk)
💡 Key insight: Imagine two dancers (a query and a key) on a circular stage. RoPE assigns each dancer a rotation angle based on their position in the sequence. When they face each other to "interact" (dot product), only the difference in their angles matters, not their absolute positions. This is why RoPE naturally encodes relative position: dancer at position 5 interacting with position 3 looks the same as position 105 interacting with position 103.
| Property | Sinusoidal | Learned | RoPE |
|---|---|---|---|
| Position info | Added to embeddings | Added to embeddings | Applied to Q, K only |
| Relative position | Implicit (model must learn) | Implicit | Explicit in dot product |
| Content-position mixing | Yes (additive) | Yes (additive) | No (multiplicative) |
| Value vectors | Modified by position | Modified by position | Unmodified ✅ |
| Extra parameters | 0 | 0 |
💡 Key insight: RoPE only modifies Q and K. The value vectors remain position-free. This is elegant because V carries the actual content that gets aggregated; only the routing (which tokens to attend to) should be position-aware.
An important theoretical result: RoPE introduces a natural long-range decay in attention scores. The expected dot product between "similar" tokens decays as their distance increases, formalized as:[3]
Where and are token positions, and is the rotation frequency for dimension pair .
As grows, this sum decays, creating an inductive bias favoring local dependencies. This is typically desirable (nearby tokens are usually more relevant), but it means vanilla RoPE struggles at very long contexts.
The base frequency controls this decay rate. Men et al. (2024) proved that the base frequency places a lower bound on the achievable context length:[3]
| Safe context length | Used by | |
|---|---|---|
| 10,000 | ~4K | Original RoPE, Gemma 2 |
| 500,000 | ~128K | Llama 3.1 |
| 1,000,000 | ~200K+ | Llama 4, some custom models |
⚠️ Warning: Simply increasing the base frequency gives "superficial" long-context capability: perplexity looks good, but retrieval accuracy degrades. This is why RoPE scaling methods (PI, NTK-aware scaling, YaRN) are needed for genuine long-context support.
ALiBi doesn't modify embeddings at all. Instead, it adds a linear bias directly to attention scores:
💡 Key insight: Imagine you're at a crowded party. People standing right next to you speak at full volume, but the farther away someone stands, the quieter they sound, linearly. ALiBi works exactly like this: it applies a simple distance penalty to attention scores, so nearby tokens "speak loudly" and distant tokens are naturally muffled. Different attention heads act like listeners with different hearing ranges: some focus on nearby whispers, others can hear across the room.
Reading the formula: this is the normal attention formula with one addition: before applying softmax, we add a distance penalty . Since is negative (closer tokens have less penalty), nearby tokens keep their original scores while distant tokens get pushed down. The slope controls how aggressively: a steep slope makes the model short-sighted, a gentle slope lets it see further.
where:
Head slopes: For heads, for
| Head | Slope | Effect |
|---|---|---|
| Head 1 | Strong local attention | |
| Head 4 | Moderate range | |
| Head 8 | Very long-range |
Different heads get different slopes through a fixed geometric sequence, giving the model a spectrum of attention ranges, from very local (steep slope) to nearly global (gentle slope).
The following snippet shows how to construct the ALiBi bias tensor. It takes the number of attention heads and maximum sequence length as inputs, computing a geometric sequence of slopes and a distance matrix. The output is a bias matrix that is added directly to the attention scores before the softmax step.
python1import torch 2 3def build_alibi_bias( 4 n_heads: int, 5 max_len: int, 6) -> torch.Tensor: 7 """Build ALiBi attention bias matrix.""" 8 # Geometric slopes: 1/2^(8/n_heads), 1/2^(16/n_heads), ... 9 slopes = torch.pow(2, -8 * torch.arange(1, n_heads + 1) / n_heads) 10 11 # Distance matrix: -|i - j| 12 positions = torch.arange(max_len) 13 distance = -(positions.unsqueeze(0) - positions.unsqueeze(1)).abs() 14 15 # slope × distance → (n_heads, max_len, max_len) 16 alibi = slopes.unsqueeze(1).unsqueeze(1) * distance.unsqueeze(0) 17 18 return alibi # Add this to attention scores before softmax
ALiBi was trained on 1K tokens but tested on 2K–11K tokens without quality degradation.[4] Why?
An implementation detail worth noting is how ALiBi interacts with standard softmax optimizations. ALiBi biases can produce very negative values for distant tokens. In practice, softmax is computed using the max-shift trick to prevent numerical overflow:
Where is the raw attention score and is the ALiBi distance bias for token .
When computing this in hardware, the very large negative biases applied to distant tokens mean that their shifted exponent values () become effectively zero. This causes them to drop out of the attention computation entirely.
Because of this property, ALiBi provides a natural, hardware-friendly "soft sliding window" without requiring explicit boolean masking or truncation of the sequence. The model naturally learns to ignore distant context simply because the math forces those attention weights to zero.
💡 Key insight: Imagine learning to swim in a 25-meter pool (your training context length). Sinusoidal encoding is like a swimmer who panics the instant they're in a 50-meter pool, with total failure. ALiBi is like a swimmer who learned proper stroke technique and can handle a 100-meter pool with no extra training. RoPE with scaling (YaRN) is like a swimmer who does a few warm-up laps in the bigger pool (fine-tuning) and then performs beautifully.
This is a critical practical challenge in production systems:
"Your model was trained on 4K tokens. How does it perform at 32K?"
| Method | Train length | Max usable length | Extrapolation |
|---|---|---|---|
| Sinusoidal | 4K | ~4K | ❌ Catastrophic |
| Learned | 4K | 4K exactly | ❌ No extrapolation |
| RoPE (vanilla) | 4K | ~6K | ⚠️ Degrades gradually |
| ALiBi | 4K | ~11K | ✅ Graceful |
| RoPE + NTK | 4K | ~32K | ✅ With scaling |
| RoPE + YaRN | 4K | ~128K | ✅ Best scaling |
When vanilla RoPE fails at long contexts, three scaling methods exist:
1. Position Interpolation (PI): Scale positions down to fit within training range[5]
Reading the formula: if you trained on 4K tokens but want to handle 16K, you shrink all positions by 4× so they fit within the range the model already knows. Position 16,000 gets mapped to effective position 4,000. The model sees familiar territory, but it needs a few hundred steps of fine-tuning to adjust to the compressed spacing.
2. NTK (Neural Tangent Kernel)-aware Scaling: Scale the frequency base instead of positions[6]
In plain terms: instead of squishing the positions, we slow down the rotation frequencies so that longer contexts still produce "reasonable" rotation angles. The scaling factor equals how much you want to extend (). This works without fine-tuning for moderate extensions because we're changing the ruler rather than the positions.
3. YaRN (Yet another RoPE extensioN): Combines NTK scaling with attention temperature adjustment[7]
where is a temperature factor. Achieves the best extrapolation results and is used to extend Llama to 128K context.
| Model | Method | Context length | Notes |
|---|---|---|---|
| GPT-4o | Learned absolute | 128K | Massive training budget makes this viable |
| Llama 3.1/4 | RoPE (scaled) | 128K+ | |
| Gemma 2 | RoPE | 8K | Standard |
| Mistral | RoPE + Sliding Window | 32K | Combines RoPE with local attention |
| BLOOM | ALiBi | 2K | Designed for multilingual, shorter contexts |
| MPT | ALiBi | 65K | Extended via ALiBi extrapolation |
| Qwen 2.5 | RoPE (YaRN) | 128K | YaRN scaling for long context |
| DeepSeek V3 | RoPE (YaRN) | 128K | Extended from 4K training |
The industry consensus (2025–2026): RoPE with frequency base scaling or YaRN is dominant. ALiBi is elegant but less used in the latest models. Learned positions only work if you can afford massive compute (GPT-4 scale).
"Why not concatenate positional encodings?" Concatenation increases the dimension from to , quadrupling the attention matrix size and doubling projection costs. Addition works because high-dimensional spaces allow the model to learn orthogonal subspaces for content vs. position.
"RoPE Relative Derivation" At position , . At position , . Reading the formula: after rotation, the dot product depends only on the relative offset , not on absolute positions. This is why RoPE naturally encodes relative position information. The result depends only on .
"Why does Llama use ?" Higher base frequencies slow down the cosine decay of , allowing the model to maintain meaningful attention scores at larger distances. Men et al. (2024) proved that the base frequency bounds the maximum effective context length: supports ~4K, while supports ~128K.[3]
Attention Is All You Need.
Vaswani, A., et al. · 2017
RoFormer: Enhanced Transformer with Rotary Position Embedding.
Su, J., et al. · 2021
Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.
Press, O., Smith, N. A., & Lewis, M. · 2022 · ICLR 2022
Extending Context Window of Large Language Models via Positional Interpolation.
Chen, S., et al. · 2023
YaRN: Efficient Context Window Extension of Large Language Models.
Peng, B., et al. · 2023
Base of RoPE Bounds Context Length.
Men, X., et al. · 2024 · NeurIPS 2024
NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
bloc97 · 2023