Understand why transformers need position info, derive sinusoidal encodings, explore how RoPE encodes relative position through rotation, compare ALiBi's linear bias approach, and analyze long-context extrapolation methods.
Consider these two sentences: "The dog bit the man" and "The man bit the dog." They use the same words but have completely different meanings because word order matters. Humans get this instantly, but transformers have a surprising blind spot: the core attention mechanism is "order-blind" and treats both sentences identically. To a transformer without position information, "The dog bit the man" and "man the bit dog The" look the same.
This is called permutation equivariance, and it means we need a way to tell the model where each word appears in a sequence. That's exactly what positional encoding does, and the choice of method directly determines how well a Large Language Model (LLM) handles long documents, conversations, and code. If you haven't read about the attention mechanism yet, start there. It's the foundation this article builds on.
š” Key insight: Positional encoding injects word-order information into an architecture that would otherwise be "order-blind." In this article, you'll learn why position info is needed, how the original approach works, and why modern models use a clever rotation trick called RoPE.
Self-attention computes:
Compare every query with all keys (), scale by (key vector dimension) for stable softmax, convert scores to attention weights, then mix value vectors using those weights.
This operation is set-based: if you shuffle the input token order, the output is simply shuffled in the same way. The attention weights depend on content, not position.[1]
Without positional information, transformers can't distinguish between different word orders, which can completely alter the meaning of a sentence. For example, these two sequences contain identical tokens but have very different semantics. When processed by a position-blind transformer, they produce identical attention patterns:
text1"The cat sat on the mat" 2"mat the on sat cat The"
But meaning depends entirely on word order! We need a mechanism to inject position information.
| Approach | How It Works | Modifies | Examples |
|---|---|---|---|
| Absolute (additive) | Add position vector to token embedding | Embeddings | Sinusoidal (Vaswani), Learned (GPT-2, BERT) |
| Relative (multiplicative) | Rotate Q/K vectors based on position | Attention scores | RoPE (Llama, Gemma, Qwen) |
| Relative (additive bias) | Add distance bias to attention logits | Attention scores | ALiBi (BLOOM, MPT) |
š” Key insight: Think of positional encoding like reading a clock. A clock uses multiple hands moving at different speeds (seconds, minutes, hours) to uniquely identify any moment in a 12-hour period. Sinusoidal positional encoding does the exact same thing for words in a sequence: each position gets a unique "fingerprint" made of sine and cosine waves at different frequencies. Low dimensions oscillate quickly (like seconds), while high dimensions oscillate slowly (like hours).
For position and dimension (where is the embedding dimension):
The combination of these sine and cosine waves creates a unique pattern for every position.
Because can be expressed as a linear function of :
where is a rotation matrix that depends only on offset , not absolute position. This means the model can, in principle, learn to attend based on relative position.[1]
Here's how we can generate these multi-frequency sinusoidal positional encodings using PyTorch. The function takes the maximum sequence length and embedding dimension as inputs to compute the required frequencies. It returns a tensor of positional encodings that can be added directly to the token embeddings before they enter the attention layers.
python1import torch 2import math 3 4def sinusoidal_positional_encoding( 5 max_len: int, 6 d_model: int, 7) -> torch.Tensor: 8 """Generate sinusoidal positional encodings.""" 9 pe = torch.zeros(max_len, d_model) 10 position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) 11 12 # Compute the frequency for each dimension pair 13 # Low dimensions ā high frequency (local position) 14 # High dimensions ā low frequency (coarse position) 15 # Īø_i = 10000^(-2i/d_model), implemented via exp(-2i/d_model * ln(10000)) 16 div_term = torch.exp( 17 torch.arange(0, d_model, 2).float() 18 * (-math.log(10000.0) / d_model) 19 ) 20 21 pe[:, 0::2] = torch.sin(position * div_term) # Even dims 22 pe[:, 1::2] = torch.cos(position * div_term) # Odd dims 23 24 return pe # (max_len, d_model) 25 26# Usage: simply ADD to token embeddings 27# x = token_embeddings + pe[:seq_len]
| Dimension pair | Wavelength (d=512) | What it captures |
|---|---|---|
| tokens | Very fine-grained local position | |
| tokens | Medium-scale structure | |
| tokens | Very coarse global position |
The multi-scale frequencies mean the model can attend to both local and distant positions through different embedding dimensions, similar to how Fourier series decompose signals at multiple scales.
Instead of fixed sinusoids, models like BERT (Bidirectional Encoder Representations from Transformers) and GPT-2 use learned position embeddings: a trainable matrix where each position gets its own optimizable vector.
| Property | Sinusoidal | Learned |
|---|---|---|
| Parameters | 0 (fixed formula) | added params |
| Flexibility | Fixed inductive bias | Adapts to task |
| Extrapolation | Beyond (in theory) | ā Hard limit at |
| Performance | Strong baseline | Comparable; no significant win |
š” Key insight: Vaswani et al. (2017) tested both and found "nearly identical results."[1] Both have been superseded by relative position methods for modern LLMs.
š” Key insight: Why do transformers add positional encodings to embeddings instead of concatenating them?
A common intuition is that concatenating the positional vector to the token embedding (creating a vector) would perfectly preserve both pieces of information. However, concatenation would increase the embedding dimension from to , mathematically doubling the parameters of all linear layers and quadrupling the computational cost of downstream attention operations. Addition elegantly keeps the dimension constant.
But how does the model avoid confusing content with position when they are summed into the exact same values? Empirically, addition works well because the embedding space is extremely high-dimensional (typically or more). In high-dimensional spaces, random vectors are nearly orthogonal.
During training, the learned projections naturally separate the content and position signals into different, nearly orthogonal subspaces of the same -dimensional space. The model learns to route the position information primarily into the attention score calculations ( and ), while the value vectors () preserve the semantic content.
RoPE (Rotary Position Embeddings) is used by Llama, Gemma, Qwen, Mistral, Phi, DeepSeek, which is virtually every modern open LLM.[2]
Instead of adding position to embeddings, rotate the query and key vectors in 2D subspaces. The dot product between rotated vectors naturally encodes relative position.
š” Key insight: Imagine two dancers (a query and a key) on a circular stage. RoPE assigns each dancer a rotation angle based on their position in the sequence. When they face each other to "interact" (dot product), only the difference in their angles matters, not their absolute positions. This is why RoPE naturally encodes relative position: a dancer at position 5 interacting with position 3 looks the same as position 105 interacting with position 103.
For a 2D subspace at position , apply rotation:
Take each consecutive pair of numbers in the query vector and spin them like a clock hand. How far they spin depends on two things: the token's position (further in the sequence = more rotation) and the frequency (each pair rotates at a different speed). The result: every position creates a unique rotational pattern that the model can use to figure out "how far apart are these two words?"
The frequency base is (same as sinusoidal encoding).
When computing , the rotation matrices combine:
The attention score between a query at position and a key at position depends only on the gap between them, not on where they sit in the sequence. Token 5 attending to token 3 produces the same positional effect as token 1005 attending to token 1003. This is what makes RoPE "relative": it captures distance, not address.
The code below demonstrates how RoPE applies these rotations in practice. It takes the query and key tensors as inputs, along with precomputed rotation frequencies based on the sequence length. The function reshapes the vectors into 2D pairs, applies the complex rotations, and returns the rotated query and key tensors.
python1import torch 2 3def precompute_freqs_cis( 4 dim: int, 5 max_seq_len: int, 6 theta: float = 10000.0, 7) -> torch.Tensor: 8 """Precompute rotation frequencies (complex exponentials).""" 9 freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim)) 10 t = torch.arange(max_seq_len, dtype=torch.float) 11 freqs = torch.outer(t, freqs) # (seq_len, dim/2) 12 return torch.polar(torch.ones_like(freqs), freqs) # complex64 13 14def apply_rotary_emb( 15 xq: torch.Tensor, # (batch, seq_len, n_heads, d_k) 16 xk: torch.Tensor, # (batch, seq_len, n_heads, d_k) 17 freqs_cis: torch.Tensor, # (seq_len, d_k/2) 18) -> tuple[torch.Tensor, torch.Tensor]: 19 """Apply RoPE to queries and keys.""" 20 # View as complex numbers: pairs (x_2i, x_{2i+1}) ā x_2i + jĀ·x_{2i+1} 21 xq_complex = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2)) 22 xk_complex = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2)) 23 24 # Multiply by rotation (complex multiplication = 2D rotation) 25 freqs_cis = freqs_cis.unsqueeze(0).unsqueeze(2) # broadcast 26 xq_out = torch.view_as_real(xq_complex * freqs_cis).flatten(-2) 27 xk_out = torch.view_as_real(xk_complex * freqs_cis).flatten(-2) 28 29 return xq_out.type_as(xq), xk_out.type_as(xk)
| Property | Sinusoidal | Learned | RoPE |
|---|---|---|---|
| Position info | Added to embeddings | Added to embeddings | Applied to Q, K only |
| Relative position | Implicit (model must learn) | Implicit | Explicit in dot product |
| Content-position mixing | Yes (additive) | Yes (additive) | No (multiplicative) |
| Value vectors | Modified by position | Modified by position | Unmodified ā |
| Extra parameters | 0 | 0 |
š” Key insight: RoPE only modifies Q and K. The value vectors remain position-free. This is elegant because V carries the actual content that gets aggregated; only the routing (which tokens to attend to) should be position-aware.
An important theoretical result: RoPE introduces a natural long-range decay in attention scores. The expected dot product between "similar" tokens decays as their distance increases, formalized as:[3]
Where and are token positions, is the rotation frequency for dimension pair , and is the attention score as a function of relative distance.
As grows, this sum decays, creating an inductive bias favoring local dependencies. This is typically desirable (nearby tokens are usually more relevant), but it means vanilla RoPE struggles at very long contexts. The base frequency, , controls this decay rate. Men et al. (2024) proved that it places a theoretical lower bound on the achievable context length: to support context length , you need .[3] This means doubling your target context requires roughly quadrupling the base frequency.
| Estimated safe context | Used by | |
|---|---|---|
| 10,000 | ~4K | Original RoPE, Gemma |
| 100,000 | ~32K | Some custom models |
| 500,000 | ~64-128K | Llama 3.1 |
ā ļø Warning: Simply increasing the base frequency gives "superficial" long-context capability: perplexity looks good, but retrieval accuracy degrades. This is why RoPE scaling methods (PI, NTK-aware scaling, YaRN) are needed for genuine long-context support.
ALiBi (Attention with Linear Biases) doesn't modify embeddings at all. Instead, it subtracts a linear bias directly from attention scores that increases with token distance:
š” Key insight: Imagine you're at a crowded party. People standing right next to you speak at full volume, but the farther away someone stands, the quieter they sound, linearly. ALiBi works exactly like this: it applies a simple distance penalty to attention scores, so nearby tokens "speak loudly" and distant tokens are naturally muffled. Different attention heads act like listeners with different hearing ranges: some focus on nearby whispers, others can hear across the room.
This is the normal attention formula with one modification: before applying softmax, we subtract a distance penalty . Since this term grows with distance, nearby tokens keep their original scores while distant tokens get pushed down. The slope controls how aggressively: a steep slope makes the model short-sighted, a gentle slope lets it see further.
where is a head-specific slope and is the distance between tokens.
For heads, slopes follow a geometric sequence that starts at with ratio , giving slopes for .
For 8 heads (n=8): .
| Head | Slope | Effect |
|---|---|---|
| Head 1 | Strong local attention | |
| Head 4 | Moderate range | |
| Head 8 | Very long-range |
Different heads get different slopes through a fixed geometric sequence, giving the model a spectrum of attention ranges, from very local (steep slope) to nearly global (gentle slope).
The following snippet shows how to construct the ALiBi bias tensor. It takes the number of attention heads and maximum sequence length as inputs, computing a geometric sequence of slopes and a distance matrix. The output is a bias matrix that's added directly to the attention scores before the softmax step.
python1import torch 2 3def build_alibi_bias( 4 n_heads: int, 5 max_len: int, 6) -> torch.Tensor: 7 """Build ALiBi attention bias matrix.""" 8 # Geometric slopes: 2^(-8/n), 2^(-16/n), ... for each head 9 slopes = torch.pow(2, -8 * torch.arange(1, n_heads + 1) / n_heads) 10 11 # Distance matrix: -|i - j| 12 positions = torch.arange(max_len) 13 distance = -(positions.unsqueeze(0) - positions.unsqueeze(1)).abs() 14 15 # slope Ć distance ā (n_heads, max_len, max_len) 16 alibi = slopes.unsqueeze(1).unsqueeze(1) * distance.unsqueeze(0) 17 18 return alibi # Add this to attention scores before softmax
ALiBi was trained on 1K tokens but tested on 2K-11K tokens without quality degradation.[4] Why?
ALiBi interacts elegantly with standard softmax optimizations. ALiBi biases can produce very negative values for distant tokens. In practice, softmax is computed using the max-shift trick to prevent numerical overflow:
Where is the raw attention score and is the ALiBi distance bias for token .
When computing this in hardware, the very large negative biases applied to distant tokens mean that their shifted exponent values () become effectively zero. This causes them to drop out of the attention computation entirely.
Because of this property, ALiBi provides a natural, hardware-friendly "soft sliding window" without requiring explicit boolean masking or truncation of the sequence. The model naturally learns to ignore distant context simply because the math forces those attention weights to zero.
š” Key insight: Imagine learning to swim in a 25-meter pool (your training context length). Sinusoidal encoding is like a swimmer who panics the instant they're in a 50-meter pool, with total failure. ALiBi is like a swimmer who learned proper stroke technique and can handle a 100-meter pool with no extra training. RoPE with scaling (YaRN) is like a swimmer who does a few warm-up laps in the bigger pool (fine-tuning) and then performs beautifully.
This is a critical practical challenge in production systems:
"Your model was trained on 4K tokens. How does it perform at 32K?"
| Method | Train length | Max usable length | Extrapolation |
|---|---|---|---|
| Sinusoidal | 4K | ~4K | ā Catastrophic |
| Learned | 4K | 4K exactly | ā No extrapolation |
| RoPE (vanilla) | 4K | ~6K | ā ļø Degrades gradually |
| ALiBi | 4K | ~11K | ā Graceful |
| RoPE + NTK | 4K | ~32K | ā With scaling |
| RoPE + YaRN | 4K | ~128K | ā Best scaling |
When vanilla RoPE fails at long contexts, three scaling methods exist:
Scale position indices down to fit within training range[5]
If you trained on 4K tokens but want to handle 16K, you shrink all position indices by 4x so they fit within the range the model already knows. Position 16,000 gets mapped to effective position 4,000. Critically, the rotation frequencies theta themselves stay unchanged. Only the position index that multiplies theta gets rescaled. The model needs a few hundred steps of fine-tuning to adjust to the compressed spacing.
Scale the frequency base instead of positions, trading off between low and high frequency components[6]
where is the scale factor and is the full hidden dimension (e.g., 4096).
Instead of squishing the positions, we slow down the rotation frequencies so that longer contexts still produce reasonable rotation angles. The scaling factor alpha equals how much you want to extend. This works without fine-tuning for moderate extensions because we're changing the ruler rather than the positions.
Combines NTK scaling with attention temperature adjustment[6]. Achieves the best quality for large extensions (4K to 128K) and requires fine-tuning.
The temperature scaling adjusts the attention softmax: , where . This preserves attention patterns at longer contexts. NTK-by-parts interpolates high-frequency and low-frequency components differently, preventing fine details from getting blurred during extension.
| Model family / era | Method | Typical context behavior | Notes |
|---|---|---|---|
| GPT-2 / BERT era | Learned absolute positions | Limited to training window | Strong baseline historically, weak extrapolation |
| Modern decoder-only open models (Llama, Qwen, Mistral families) | RoPE | Usually extended with scaling methods | Dominant choice in current decoder-only LLM design |
| Long-context decoder variants | RoPE + scaling (NTK, YaRN, interpolation) | Better long-range extrapolation | Common when pushing models far beyond pretraining length |
| BLOOM / early MPT-style variants | ALiBi | Good extrapolation without retraining | Simple and effective, but less common now |
| Sliding-window architectures | RoPE + local/sliding attention | Efficient long-context serving | Pairs rotary positions with sparse attention patterns |
RoPE dominates across most modern decoder-only LLMs, especially in open-weight model families where the architecture is published and inspectable. Different variants of RoPE handle context extension: scaling the base frequency, Position Interpolation, NTK-aware methods, or YaRN. ALiBi remains important historically and still appears in some long-context variants, but it is no longer the default choice for new large decoder models. Learned absolute positions, once common in GPT-2 and BERT, are now mostly associated with earlier transformer generations.
Concatenation increases the embedding dimension from to , doubling the size of all projection matrices (, , ) and doubling their parameter counts. Addition keeps the dimension constant, and the model learns to separate content and position into different subspaces within the same -dimensional space.
At position , . At position , .
After rotation, the dot product depends only on the relative offset , not on absolute positions. This is why RoPE naturally encodes relative position information.
Higher base frequencies slow down the cosine decay of , allowing the model to maintain meaningful attention scores at larger distances. Men et al. (2024) proved that the base frequency bounds the maximum effective context length: supports ~4K, while supports ~128K.[3]
Attention Is All You Need.
Vaswani, A., et al. Ā· 2017
RoFormer: Enhanced Transformer with Rotary Position Embedding.
Su, J., et al. Ā· 2021
Base of RoPE Bounds Context Length.
Men, X., et al. Ā· 2024 Ā· NeurIPS 2024
Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.
Press, O., Smith, N. A., & Lewis, M. Ā· 2022 Ā· ICLR 2022
Extending Context Window of Large Language Models via Positional Interpolation.
Chen, S., et al. Ā· 2023
YaRN: Efficient Context Window Extension of Large Language Models.
Peng, B., et al. Ā· 2023