LearnTransformer Deep DivesPositional Encoding: RoPE & ALiBi

🧠HardTransformer Architecture

Positional Encoding: RoPE & ALiBi

Understand why transformers need position information, how sinusoidal encodings work, how RoPE and ALiBi encode relative position, and why long-context extrapolation needs careful evaluation.

34 min read

Learning path

Step 92 of 158 in the full curriculum

Vision Transformers and Image Encoders Layer Normalization: Pre-LN vs Post-LN

Vision Transformers showed why image patches need position. Text has the same problem: "The model loaded the checkpoint" and "The checkpoint loaded the model" use nearly the same words, but the order changes the meaning.

Self-attention has a surprising blind spot: by itself, it has no built-in notion of "first", "second", or "third" token. If you permute the input sequence, the outputs permute in the same way. This is called permutation equivariance, and it means the model needs a separate signal for where each token sits.

Positional encoding injects word-order information into an architecture that would otherwise be order-blind. The first step is seeing why position information is needed, then comparing the original sinusoidal approach with relative methods such as RoPE and ALiBi for long-context behavior.

The permutation equivariance problem

Self-attention computes:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$

What this equation says

Compare every query with all keys ( $QK^T$ ), scale by $\sqrt{d_k}$ (key vector dimension) for stable softmax, convert scores to attention weights, then mix value vectors $V$ using those weights.

This unmasked operation is permutation equivariant: if you shuffle the input token order, the output is shuffled in the same way. The attention weights $QK^T$ depend on content, not position.^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762}

Without positional information, transformers can't distinguish between different word orders, which can completely alter the meaning of a sentence. For example, these two sequences contain identical tokens but have very different semantics. When processed by a position-blind transformer, they produce the same attention pattern up to the same row/column permutation:

text

"The model loaded the checkpoint"
"checkpoint the loaded model The"

Meaning depends on word order, so the model needs a source of position information. A causal mask adds directional structure in decoder-only models, and NoPE experiments show that causal language models without explicit position encodings can still learn positional information.^{[2]Reference 2Transformer Language Models without Positional Encodings Still Learn Positional Information.https://arxiv.org/abs/2203.16634} The mask still provides no explicit coordinate or distance feature, which is why most architectures add or derive a separate position signal.

Permutation trace showing content-only attention shuffling outputs with input token order. — Self-attention is content-based. Position encodings provide the missing signal for index, direction, and distance.

This tiny attention implementation verifies the property directly. It permutes input rows, recomputes attention, and gets exactly the permuted original output.

permutation-equivariance.py

import math

tokens = [[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]]
permutation = [2, 0, 1]

def attention(rows: list[list[float]]) -> list[list[float]]:
    scores = [
        [sum(q_i * k_i for q_i, k_i in zip(q, k)) / math.sqrt(2) for k in rows]
        for q in rows
    ]
    outputs = []
    for row in scores:
        normalizer = sum(math.exp(value) for value in row)
        weights = [math.exp(value) / normalizer for value in row]
        outputs.append([
            sum(weight * value[column] for weight, value in zip(weights, rows))
            for column in range(2)
        ])
    return outputs

baseline = attention(tokens)
shuffled = attention([tokens[index] for index in permutation])
expected = [baseline[index] for index in permutation]

same_output_up_to_permutation = all(
    abs(actual - target) < 1e-12
    for actual_row, target_row in zip(shuffled, expected)
    for actual, target in zip(actual_row, target_row)
)
print(same_output_up_to_permutation)
print([round(row[0], 3) for row in baseline])
print([round(row[0], 3) for row in shuffled])

Output

True
[0.802, 0.599, 0.752]
[0.752, 0.802, 0.599]

Three families of solutions

Use this map to place each method before we walk through it step by step.

Approach	How It Works	Modifies	Examples
Absolute (additive)	Add position vector to token embedding	Embeddings	Original Transformer, BERT, GPT-2^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762}^{[3]Reference 3BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.https://arxiv.org/abs/1810.04805}^{[4]Reference 4GPT-2 Source Implementation.https://github.com/openai/gpt-2/blob/master/src/model.py}
Relative (multiplicative)	Rotate Q/K vectors based on position	Q/K projections	RoFormer, Llama family^{[5]Reference 5RoFormer: Enhanced Transformer with Rotary Position Embedding.https://arxiv.org/abs/2104.09864}^{[6]Reference 6The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}
Relative (additive bias)	Add distance bias to attention logits	Attention scores	ALiBi paper^{[7]Reference 7Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.https://arxiv.org/abs/2108.12409}

The methods enter the transformer at different points. The three paths below are alternatives, not steps that every model applies together.

Diagram showing Token embedding, Absolute: add position vector, Project Q, K, V, and Projected Q, K. — Token embedding, Absolute: add position vector, Project Q, K, V, and Projected Q, K.

Three aligned transformer attention pipelines showing absolute position added to embeddings, RoPE rotations applied to projected queries and keys, and ALiBi distance penalties added to attention logits — Absolute encodings modify embeddings, RoPE modifies Q/K geometry, and ALiBi modifies attention logits.

Method 1: Sinusoidal positional encoding

The original approach^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762}

Positional encoding resembles reading a clock. A clock uses multiple hands moving at different speeds (seconds, minutes, hours) to uniquely identify any moment in a 12-hour period. Sinusoidal positional encoding does the same thing for words in a sequence: each position gets a unique "fingerprint" made of sine and cosine waves at different frequencies. Low dimensions oscillate quickly (like seconds), while high dimensions oscillate slowly (like hours).

For position $\text{pos}$ and dimension $i$ (where $d_{\text{model}}$ is the embedding dimension):

$PE_{(\text{pos}, 2i)} = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)$

$PE_{(\text{pos}, 2i+1)} = \cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)$

What the sine and cosine waves mean

The combination of these sine and cosine waves creates a unique pattern for every position. Each pair of dimensions acts like one clock hand spinning at a particular speed. Position 0 starts every hand at 0 or 1. As you move down the sequence, each hand turns a little more, and the exact angle of every hand together forms a fingerprint that no other position shares.

Why sinusoids?

Because $PE_{\text{pos}+k}$ can be expressed as a linear function of $PE_{\text{pos}}$ :

$PE_{\text{pos}+k} = R_k \cdot PE_{\text{pos}}$

where $R_k$ is a rotation matrix that depends only on offset $k$ , not absolute position. This means the model can, in principle, learn to attend based on relative position.^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762}

We can generate these multi-frequency sinusoidal positional encodings with plain Python. The function takes the maximum sequence length and embedding dimension as inputs to compute the required frequencies. It returns a table of positional encodings that can be added directly to token embeddings before they enter attention layers.

why-sinusoids.py

import math

def sinusoidal_positional_encoding(
    max_len: int,
    d_model: int,
) -> list[list[float]]:
    """Generate sinusoidal positional encodings."""
    assert d_model % 2 == 0
    pe: list[list[float]] = []
    for pos in range(max_len):
        row = [0.0] * d_model
        for i in range(0, d_model, 2):
            frequency = math.exp(-(math.log(10000.0) * i) / d_model)
            angle = pos * frequency
            row[i] = math.sin(angle)
            row[i + 1] = math.cos(angle)
        pe.append(row)

    return pe  # (max_len, d_model)

pe = sinusoidal_positional_encoding(max_len=8, d_model=8)
print(len(pe), len(pe[0]))
print([round(x, 3) for x in pe[0]])
print(round(pe[2][0], 3))
print(round(pe[2][1], 3))

Output

8 8
[0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0]
0.909
-0.416

For the fastest pair (dimensions 0 and 1), position 2 means angle 2 radians, so we get $\sin(2) \approx 0.909$ and $\cos(2) \approx -0.416$ . For the slowest pair (dimensions 6 and 7), position 2 means angle 0.002 radians, so the values stay extremely close to 0 and 1. This is the multi-scale fingerprint in action.

Frequency interpretation

Dimension pair	Wavelength (d=512)	What it captures
$(0, 1)$	$\sim 6$ tokens	Very fine-grained local position
$(64, 65)$	$\sim 20$ tokens	Medium-scale structure
$(510, 511)$	$\sim 60{,}000$ tokens	Very coarse global position

The multi-scale frequencies mean the model can attend to both local and distant positions through different embedding dimensions, similar to how Fourier series decompose signals at multiple scales.

sinusoidal-wavelengths.py

import math

d_model = 512
for first_dimension in [0, 64, 510]:
    angular_frequency = 10000 ** (-first_dimension / d_model)
    wavelength = 2 * math.pi / angular_frequency
    print(first_dimension, round(wavelength))

Output

6
20
60611

Exact sinusoidal positional encoding traces for dimension pairs 0 and 1, 64 and 65, and 510 and 511 at d_model 512, with sine and cosine waves sampled across positions and the resulting three-scale fingerprint at position 12 — Sinusoidal encodings combine fast, medium, and slow waves. Each token position gets a multi-scale fingerprint.

Learned vs sinusoidal absolute encodings

Instead of fixed sinusoids, models like BERT (Bidirectional Encoder Representations from Transformers) and GPT-2 (Generative Pre-trained Transformer 2) use learned position embeddings: a trainable matrix $P \in \mathbb{R}^{L_{\max} \times d}$ where each position gets its own optimizable vector.^{[3]Reference 3BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.https://arxiv.org/abs/1810.04805}^{[4]Reference 4GPT-2 Source Implementation.https://github.com/openai/gpt-2/blob/master/src/model.py}

Property	Sinusoidal	Learned
Parameters	0 (fixed formula)	$L_{\max} \times d$ added params
Flexibility	Fixed inductive bias	Adapts to task
Extrapolation	Beyond $L_{\max}$ (in theory)	Hard limit at $L_{\max}$
Performance in the original Transformer experiment	Strong baseline	Nearly identical result

Vaswani et al. (2017) tested both and found "nearly identical results."^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762} Later architectures also explore relative-position methods, which insert position into attention in different ways.

Why add instead of concatenate?

A common intuition is that concatenating the positional vector to the token embedding (creating a $2d$ vector) would perfectly preserve both pieces of information. However, concatenation would either force the whole transformer to run at width $2d$ or require an extra projection layer to get back to $d$ . If you keep the model width at $2d$ , square matrices like $W_Q$ , $W_K$ , and $W_V$ become roughly 4x larger. Addition keeps the hidden width unchanged and lets downstream learned projections use the combined representation.

Addition doesn't guarantee that content and position stay neatly separated. It keeps the hidden width fixed and gives later learned projections one combined representation to use. Whether that works well is empirical: Vaswani et al. report nearly identical results for their sinusoidal and learned-position variants.^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762}

Limitations of absolute methods

Limitation	Why it matters
Poor extrapolation	Quality often degrades for positions beyond the training length.
Implicit relative position	The model must learn relative offsets indirectly from absolute vectors.
Content-position mixing	Adding position vectors to embeddings mixes content and position in the same stream.

Method 2: Rotary Position Embeddings (RoPE)

A relative-position approach

RoPE (Rotary Position Embeddings) was introduced in RoFormer and is used in Llama-family models described in Meta's Llama 3 report.^{[5]Reference 5RoFormer: Enhanced Transformer with Rotary Position Embedding.https://arxiv.org/abs/2104.09864}^{[6]Reference 6The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}

Core insight

Instead of adding position to embeddings, rotate the query and key vectors in 2D subspaces. Holding the unrotated query and key fixed, the dot product between rotated vectors depends on their relative position.

Two RoPE unit-circle diagrams for query and key positions 2 and 1 versus 102 and 101 at frequency theta 0.5, showing different absolute angles but the same negative 0.5-radian phase gap and identical 0.877583 dot product — RoPE turns position into phase. Shift both tokens together and absolute angles change, but relative offset stays the same.

RoPE can be pictured as two dancers (a query and a key) on a circular stage. RoPE assigns each dancer a rotation angle based on their position in the sequence. When they face each other to "interact" (dot product), the positional part of that interaction depends on the difference in their angles, not their absolute positions. That difference gives RoPE relative position behavior: holding the dancers' starting poses fixed, positions 5 and 3 produce the same positional geometry as positions 105 and 103.

The math

For a 2D subspace $(q_{2i}, q_{2i+1})$ at position $m$ , apply rotation:

\begin{pmatrix} q_{2i}^{(m)} \\ q_{2i+1}^{(m)} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}

How to read the rotation matrix

Take each consecutive pair of numbers in the query vector and spin them like a clock hand. How far they spin depends on two things: the token's position $m$ (further in the sequence = more rotation) and the frequency $\theta_i$ (each pair rotates at a different speed). The result: every position creates a unique rotational pattern that the model can use to figure out "how far apart are these two words?"

The frequency base follows the same pattern as sinusoidal encoding, but production RoPE usually applies it over the per-head rotary dimension, not the full model width:

$\theta_i = 10000^{-2i/d_{\text{rotary}}}$

If every head dimension is rotated, $d_{\text{rotary}} = d_k$ . Some architectures rotate only part of each head, so the config's rotary dimension is the value that matters.

Position differences

When computing $q^{(m)} \cdot k^{(n)}$ , the rotation matrices combine:

$q^{(m)T} k^{(n)} = q^T R((n-m)\theta_i) k$

Toy numerical example (2D subspace). Suppose a query vector pair at position 5 and a key at position 2, both using frequency $\theta = 0.5$ :

the-key-property.py

import math

def rotate_2d(vec: list[float], pos: int, theta: float) -> list[float]:
    """Apply RoPE rotation to a 2D pair."""
    c, s = math.cos(pos * theta), math.sin(pos * theta)
    x, y = vec
    return [x * c - y * s, x * s + y * c]

q = [0.8, 0.6]          # original query pair
k = [0.7, 0.5]          # original key pair

def dot(a: list[float], b: list[float]) -> float:
    return sum(x * y for x, y in zip(a, b))

theta = 0.5
score_a = dot(rotate_2d(q, 5, theta), rotate_2d(k, 2, theta))
score_b = dot(rotate_2d(q, 105, theta), rotate_2d(k, 102, theta))

print(round(score_a, 6), round(score_b, 6))

Output

0.040884 0.040884

Running this shows that when the unrotated query and key are held fixed, the RoPE score stays the same after shifting both positions together. The positional transform depends on relative offset, exactly as the math predicts. This property is useful, but alone provides no proof that a model works beyond its trained context window.

Why this matters

Holding the unrotated query and key fixed, RoPE's contribution to their attention score depends on the signed relative offset $(n - m)$ between them, not on their absolute positions. Token 5 attending to token 3 produces the same positional geometry as token 1005 attending to token 1003. Actual attention scores remain content-dependent because the query and key vectors can differ across tokens and layers.

Worked example: the same fixed vectors and offset give the same score

Picture a single 2D pair where both the query and the key start as the simple unit vector $[1, 0]$ . Let the rotation frequency for this pair be $\theta = 0.5$ radians.

Query at position $m = 2$ , key at position $n = 1$ : relative distance $d = n - m = -1$ . After rotation, the dot product equals $\cos(d \cdot \theta) = \cos(-0.5) \approx 0.878$ .
Query at position $m = 102$ , key at position $n = 101$ : relative distance is still $d = -1$ . After rotation, the dot product is again $\cos(-0.5) \approx 0.878$ .

The absolute positions changed by 100, but the positional part of the attention score stayed identical because the starting vectors were held fixed and only their relative offset enters the rotation product. If the distance were $d = -2$ , this particular pair would score $\cos(-1.0) \approx 0.540$ , which is smaller. Avoid generalizing that single comparison into a monotonic distance penalty: cosine phases can rise again at larger offsets, and a real head sums many frequency pairs.

RoPE tensor mechanics for an eight-channel rotary dimension, splitting query and key vectors into four 2D pairs with frequencies 1, 0.1, 0.01, and 0.001, rotating each pair at query position 4 and key position 2, then summing four pairwise scores into one attention logit — RoPE applies one reusable pairwise recipe across the head. Each pair gets its own frequency, but every pair collapses absolute rotation into relative offset.

Implementation

This example applies RoPE rotations in practice. It takes the query and key tensors as inputs, along with precomputed rotation frequencies based on the sequence length. The function reshapes the vectors into 2D pairs, applies the complex rotations, and returns the rotated query and key tensors.

implementation.py

import torch

def precompute_freqs_cis(
    dim: int,
    max_seq_len: int,
    theta: float = 10000.0,
) -> torch.Tensor:
    """Precompute rotation frequencies (complex exponentials)."""
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
    t = torch.arange(max_seq_len, dtype=torch.float)
    freqs = torch.outer(t, freqs)  # (seq_len, dim/2)
    return torch.polar(torch.ones_like(freqs), freqs)  # complex64

def apply_rotary_emb(
    xq: torch.Tensor,  # (batch, seq_len, n_heads, d_k)
    xk: torch.Tensor,  # (batch, seq_len, n_heads, d_k)
    freqs_cis: torch.Tensor,  # (seq_len, d_k/2)
) -> tuple[torch.Tensor, torch.Tensor]:
    """Apply RoPE to queries and keys."""
    # View as complex numbers: pairs (x_2i, x_{2i+1}) -> x_2i + j·x_{2i+1}
    xq_complex = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_complex = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))

    # Multiply by rotation (complex multiplication = 2D rotation)
    freqs_cis = freqs_cis.unsqueeze(0).unsqueeze(2)  # broadcast
    xq_out = torch.view_as_real(xq_complex * freqs_cis).flatten(-2)
    xk_out = torch.view_as_real(xk_complex * freqs_cis).flatten(-2)

    return xq_out.type_as(xq), xk_out.type_as(xk)

# --- Verify the relative-distance property on tiny tensors ---
d_k = 4
freqs = precompute_freqs_cis(dim=d_k, max_seq_len=4)
# Both q and k are unit vectors [1,0,1,0] at different positions
q = torch.tensor([[[[1.0, 0.0, 1.0, 0.0]]]])   # (1, 1, 1, 4) at pos 0
k_far = torch.tensor([[[[1.0, 0.0, 1.0, 0.0]]]])  # same vector at pos 2
q_rot, _ = apply_rotary_emb(q, k_far, freqs[0:1])
_, k_rot = apply_rotary_emb(q, k_far, freqs[2:3])
print("Dot product at distance 2:", (q_rot * k_rot).sum().item())

# Now shift both by +1: q at pos 1, k at pos 3
q2 = torch.tensor([[[[1.0, 0.0, 1.0, 0.0]]]])
k2 = torch.tensor([[[[1.0, 0.0, 1.0, 0.0]]]])
q_rot2, _ = apply_rotary_emb(q2, k2, freqs[1:2])
_, k_rot2 = apply_rotary_emb(q2, k2, freqs[3:4])
print("Dot product at distance 2 (shifted):", (q_rot2 * k_rot2).sum().item())
# Both prints match because q and k stay fixed while relative distance stays the same.

Output

Dot product at distance 2: 0.5836532115936279
Dot product at distance 2 (shifted): 0.5836531519889832

What RoPE changes

Property	Sinusoidal	Learned	RoPE
Position info	Added to embeddings	Added to embeddings	Applied to Q, K only
Relative position	Implicit (model must learn)	Implicit	Explicit in dot product
Where position enters	Input embedding sum	Input embedding sum	Q/K rotation
Direct RoPE transform on V	Not applicable	Not applicable	None
Extra parameters	0	$L_{\max} \times d$	0

RoPE directly rotates Q and K, not V. This places its relative-position operation in attention routing rather than applying the same rotation to aggregated values. In deeper layers, V still comes from hidden states that already include context mixed by earlier position-aware attention.

Relative offset isn't a monotonic penalty

RoPE gives attention a structured function of relative offset. Farther tokens aren't guaranteed to receive smaller positional contributions. For a single pair of identical unit vectors, the positional score is:

$S_{\Delta} = \cos(\Delta\theta)$

This quantity oscillates as distance $\Delta$ changes. A full head combines multiple frequencies and content-dependent coefficients, so its score can also rise or fall across offsets.

rope-offset-is-not-monotonic.py

import math

theta = 0.5
distances = [1, 2, 4, 6, 8]
scores = [math.cos(distance * theta) for distance in distances]
print(distances)
print([round(score, 3) for score in scores])
print(scores[-1] > scores[-2])

Output

[1, 2, 4, 6, 8]
[0.878, 0.54, -0.416, -0.99, -0.654]
True

The RoFormer paper analyzes a long-term decay bound aggregated over RoPE frequencies, and Men et al. (2024) study how the RoPE base affects long-context retrieval behavior.^{[5]Reference 5RoFormer: Enhanced Transformer with Rotary Position Embedding.https://arxiv.org/abs/2104.09864}^{[8]Reference 8Base of RoPE Bounds Context Length.https://arxiv.org/abs/2405.14591} Those analyses motivate careful base selection and evaluation. They remain distinct from ALiBi's explicit linear distance penalty.

Side-by-side exact curves showing a single RoPE pair contribution cos of 0.5 times distance oscillating from distance 0 to 16, while an ALiBi bias with slope 0.1 decreases linearly, including the RoPE rise from negative 0.990 at distance 6 to negative 0.654 at distance 8 — A toy RoPE frequency contribution oscillates, while ALiBi explicitly subtracts more as distance grows. Aggregate RoPE behavior still requires measurement.

Caution: increasing the base frequency alone can give "superficial" long-context capability in the evaluations reported by Men et al.: perplexity may look good while retrieval accuracy degrades.^{[8]Reference 8Base of RoPE Bounds Context Length.https://arxiv.org/abs/2405.14591} A changed RoPE configuration therefore needs retrieval and task evaluation at its deployed length.

Method 3: Attention with Linear Biases (ALiBi)

The simplest approach^{[7]Reference 7Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.https://arxiv.org/abs/2108.12409}

ALiBi (Attention with Linear Biases) doesn't modify embeddings at all. Instead, it adds a head-specific linear bias directly to attention scores:

ALiBi behaves like a crowded room where nearby speakers are loud and distant speakers are quieter. It applies a simple distance penalty to attention scores, so nearby tokens "speak loudly" and distant tokens are naturally muffled. Different attention heads act like listeners with different hearing ranges: some focus on nearby whispers, others can hear across the room.

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + B^{(h)}\right) V,\quad B^{(h)}_{ij} = -m_h(i-j)\ \text{for}\ j \le i$

What the ALiBi formula does

This is the normal attention formula with one modification: before applying softmax, we add a negative bias that gets more negative the farther back a key token sits from the current query. In a causal decoder, the current token gets zero penalty, one token back gets $-m_h$ , two tokens back gets $-2m_h$ , and so on. A steep slope makes the head short-sighted. A gentle slope lets it see further.

Here, $m_h$ is the slope for head $h$ . Future positions still get removed by the causal mask.

Head slopes

For $n$ heads, the paper uses a fixed geometric schedule. For power-of-two head counts, this reduces to a clean sequence like $m_k = 2^{-8k/n}$ .

For 8 heads (n=8): $m = \frac{1}{2}, \frac{1}{4}, \frac{1}{8}, \frac{1}{16}, \frac{1}{32}, \frac{1}{64}, \frac{1}{128}, \frac{1}{256}$ .

ALiBi mechanism with eight head slopes, two causal bias matrices, and one worked softmax row showing finite past-token penalties and hard future masking. — ALiBi adds distance-dependent penalties before softmax. Different heads get different slopes, so some stay local while others remain long-range.

Head	Slope $m$	Effect
Head 1	$1/2$	Strong local attention
Head 4	$1/16$	Moderate range
Head 8	$1/256$	Very long-range

Different heads get different slopes through a fixed schedule, giving the model a spectrum of attention ranges, from very local (steep slope) to nearly global (gentle slope). For non-power-of-two head counts, the reference implementation uses a recursive workaround so slopes remain well spread.^{[7]Reference 7Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.https://arxiv.org/abs/2108.12409}

Worked example: a concrete distance penalty

Picture the token "loaded" at position 3 attending back to earlier words in "The model loaded the checkpoint." The raw dot-product score for "model" at position 1 is 2.5. Head 4 has slope $m = 1/16 \approx 0.0625$ .

Distance from "loaded" to "model" = $3 - 1 = 2$
Penalty = $-0.0625 \times 2 = -0.125$
Score entering softmax = $2.5 - 0.125 = 2.375$

For "The" at position 0, the distance is 3, so the penalty is $-0.0625 \times 3 = -0.1875$ , and the score entering softmax is $2.3125$ . The farther token is slightly quieter, but not silenced. With Head 1 ( $m = 0.5$ ), the same distances would produce penalties of $-1.0$ and $-1.5$ , making the head far more myopic.

worked-example-a-concrete-distance-penalty.py

def alibi_score(raw_score: float, query_pos: int, key_pos: int, slope: float) -> float:
    distance = max(query_pos - key_pos, 0)
    return raw_score - slope * distance

raw = 4.0
query_pos = 10
key_pos = 6

head_1 = alibi_score(raw, query_pos, key_pos, slope=1 / 2)
head_8 = alibi_score(raw, query_pos, key_pos, slope=1 / 256)

print(head_1, round(head_8, 3))

Output

2.0 3.984

This snippet constructs the ALiBi bias tensor for a causal decoder. It follows the reference slope helper, including non-power-of-two head counts, and bakes the causal mask into the returned bias matrix.

worked-example-a-concrete-distance-penalty-2.py

import math
import torch

def get_alibi_slopes_power_of_two(n_heads: int) -> torch.Tensor:
    start = 2 ** (-2 ** -(math.log2(n_heads) - 3))
    ratio = start
    return torch.tensor(
        [start * (ratio ** head) for head in range(n_heads)],
        dtype=torch.float32,
    )

def get_alibi_slopes(n_heads: int) -> torch.Tensor:
    assert n_heads > 0
    if math.log2(n_heads).is_integer():
        return get_alibi_slopes_power_of_two(n_heads)

    closest_power_of_two = 2 ** math.floor(math.log2(n_heads))
    return torch.cat(
        [
            get_alibi_slopes_power_of_two(closest_power_of_two),
            get_alibi_slopes(2 * closest_power_of_two)[0::2][
                : n_heads - closest_power_of_two
            ],
        ]
    )

def build_alibi_bias(
    n_heads: int,
    seq_len: int,
) -> torch.Tensor:
    """Build ALiBi logits bias for a causal decoder."""
    slopes = get_alibi_slopes(n_heads)

    q_pos = torch.arange(seq_len)[:, None]
    k_pos = torch.arange(seq_len)[None, :]

    # Causal distance: current token = 0, one step back = 1, etc.
    relative = (q_pos - k_pos).clamp(min=0).to(torch.float32)
    bias = -slopes[:, None, None] * relative

    causal_mask = k_pos > q_pos
    bias = bias.masked_fill(causal_mask.unsqueeze(0), float("-inf"))

    return bias  # Add directly to attention logits before softmax

# --- Inspect a tiny bias matrix for 4 heads and 4 tokens ---
bias = build_alibi_bias(n_heads=4, seq_len=4)
print(bias.shape)
print(bias[0, 3, :])
print(bias[-1, 3, :])
# Distant tokens get more negative values, and the last head is the most forgiving.

Output

torch.Size([4, 4, 4])
tensor([-0.7500, -0.5000, -0.2500, -0.0000])
tensor([-0.0117, -0.0078, -0.0039, -0.0000])

What ALiBi demonstrated

In one original-paper experiment, a 1.3 billion parameter ALiBi model is trained at length 1024 and evaluated at length 2048, matching the perplexity of a sinusoidal model trained at length 2048 while using less training time and memory in that setup.^{[7]Reference 7Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.https://arxiv.org/abs/2108.12409} The mechanism has several useful properties:

Linear biases are monotonic: distant tokens always get penalized more
The bias rule is defined for any distance, so there's no learned position table to run out
No learned position parameters -> nothing needs to be interpolated
The fixed slopes create a natural spectrum from local heads to longer-range heads

Softmax numerical stability with ALiBi

ALiBi interacts elegantly with standard softmax optimizations. ALiBi biases can produce very negative values for distant tokens. In practice, softmax is computed using the max-shift trick to prevent numerical overflow:

$\text{softmax}(s_i + b_i) = \frac{e^{s_i + b_i - \max(s + b)}}{\sum_j e^{s_j + b_j - \max(s + b)}}$

Where $s_i$ is the raw attention score and $b_i$ is the ALiBi distance bias for token $i$ .

When computing this in finite precision, very negative ALiBi biases can make distant tokens' shifted exponent values extremely small or round them to zero. Mathematically, ALiBi is still a soft distance-dependent recency bias, not a hard window. The causal mask remains separate.

alibi-bias-is-not-a-mask.py

import math

def softmax(logits: list[float]) -> list[float]:
    maximum = max(logits)
    shifted = [math.exp(value - maximum) for value in logits]
    total = sum(shifted)
    return [value / total for value in shifted]

raw_scores = [2.0, 2.0, 2.0]
distances = [2, 1, 0]
slope = 0.5
biased_scores = [
    score - slope * distance
    for score, distance in zip(raw_scores, distances)
]
weights = softmax(biased_scores)

print([round(weight, 3) for weight in weights])
print(weights[0] > 0)
print("future token uses causal mask separately")

Output

[0.186, 0.307, 0.506]
True
future token uses causal mask separately

The extrapolation challenge

This is a practical challenge in production systems:

"Your model was trained on 4K tokens. How does it perform at 32K?"

Long-context methods require separate evidence lanes and a target-workload deployment gate. — Being defined beyond the training window isn't enough. Long-context quality must be checked at the target context length.

Method	Mechanism beyond training length	Reported evidence and required check
Sinusoidal	Formula can compute unseen positions	Press et al. report sharp perplexity degradation for their WikiText-103 setup; evaluate your task.
Learned absolute	No row exists past configured $L_{\max}$ without changing the table	Extend or replace the table, then train and evaluate.
RoPE (unchanged)	Relative rotations continue at unseen offsets	Chen et al. report poor direct extension for their Llama experiments; evaluate retrieval and task quality.
ALiBi	Same linear logit-bias rule applies at new distances	Press et al. report 1024 to 2048 extrapolation in their setup.
RoPE with PI or YaRN	Modify the rotary geometry, typically with adaptation	Chen et al. report PI up to 32K; Peng et al. report YaRN up to 128K in their evaluated models.

These are results from different experiments, not a universal ranking. Context length is a model-and-evaluation claim, not a property proved by selecting a positional formula.^{[7]Reference 7Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.https://arxiv.org/abs/2108.12409}^{[9]Reference 9Extending Context Window of Large Language Models via Positional Interpolation.https://arxiv.org/abs/2306.15595}^{[10]Reference 10YaRN: Efficient Context Window Extension of Large Language Models.https://arxiv.org/abs/2309.00071}

RoPE context extension techniques

When adapting a RoPE-based model for longer contexts, these methods illustrate different interventions:

1. Position Interpolation (PI)

Scale position indices down to fit within training range^{[9]Reference 9Extending Context Window of Large Language Models via Positional Interpolation.https://arxiv.org/abs/2306.15595}

$\text{Let } L = \text{train length},\ L' = \text{target length}$ $f'(x, m) = f(x, m \cdot L / L')$

What rescaling the position means

If a model was trained on 4K tokens and you want it to handle 16K, position interpolation shrinks indices by 4x so they stay inside the range the model has seen. Position 16,000 maps to effective position 4,000, while the RoPE frequencies themselves stay unchanged; only the position index that multiplies theta is rescaled. Chen et al. report extensions to 32K with up to 1,000 fine-tuning steps in their Llama experiments.^{[9]Reference 9Extending Context Window of Large Language Models via Positional Interpolation.https://arxiv.org/abs/2306.15595}

position-interpolation-map.py

train_length = 4096
target_length = 16384
scale = train_length / target_length

for target_position in [0, 4096, 8192, 16383]:
    effective_position = target_position * scale
    print(target_position, round(effective_position, 2))

Output

2. NTK (Neural Tangent Kernel)-aware Scaling

Scale the frequency base instead of positions, trading off between low and high frequency components^{[10]Reference 10YaRN: Efficient Context Window Extension of Large Language Models.https://arxiv.org/abs/2309.00071}

$\theta'_{\text{base}} = \theta_{\text{base}} \cdot s^{d/(d-2)}$

where $s = L_{\text{target}} / L_{\text{train}}$ is the scale factor and $d$ is the per-head RoPE dimension (the head dimension, not the full hidden size).

What scaling the base changes

Instead of compressing every position, this formulation changes the base so lower-frequency bands are stretched more than high-frequency bands. Peng et al. document this as earlier NTK-aware work while developing YaRN; any quality claim remains dependent on model, target length, adaptation data, and evaluation.^{[10]Reference 10YaRN: Efficient Context Window Extension of Large Language Models.https://arxiv.org/abs/2309.00071}

3. YaRN (Yet another RoPE extensioN)

Combines NTK-by-parts interpolation with attention temperature adjustment.^{[10]Reference 10YaRN: Efficient Context Window Extension of Large Language Models.https://arxiv.org/abs/2309.00071} Peng et al. report RoPE-based Llama-family variants evaluated at contexts up to 128K.

The paper's recommended temperature rule for its Llama experiments is $\sqrt{1/t} = 0.1 \ln(s) + 1$ , where $s = L_{\text{target}} / L_{\text{train}}$ . NTK-by-parts interpolates frequency bands differently; temperature adjustment changes attention scale. Both choices remain hyperparameters to validate on the deployed model.

Three mathematical views of RoPE context extension for a 4K to 16K target: Position Interpolation compressing target indices into 0 to 4K, NTK-aware scaling changing a 10,000 base to about 40,890 and reducing low-frequency rotation rates, and YaRN combining frequency-band treatment with a 1.139 attention factor — RoPE extension methods change different parts of the positional geometry: position index, frequency base, and attention temperature.

Choosing a positional method

Goal	Candidate mechanism	Validation obligation
Reproduce original Transformer behavior	Sinusoidal absolute encoding	Confirm quality at trained lengths and any extrapolated lengths used.
Use learned finite position slots	Learned absolute embeddings	Define how new positions are initialized and adapted.
Encode relative offset inside query-key scores	RoPE	Measure target-length retrieval and generation quality, especially after any scaling.
Add an explicit monotonic distance bias	ALiBi	Measure whether its recency bias retains evidence needed by long tasks.
Extend an existing RoPE checkpoint	PI or YaRN-style adaptation	Validate short-context regression and target-context behavior after adaptation.

The implementation choice decides where position enters attention. It still leaves the trained checkpoint to be evaluated on the exact lengths and tasks being served.

What to check before moving on

Prove permutation equivariance for unmasked self-attention by tracing $X \rightarrow PX$ , $A \rightarrow PAP^T$ , and $Y \rightarrow PY$ .
Calculate sinusoidal frequencies and wavelengths, then explain how sine-cosine pairs form a multi-scale position fingerprint.
Derive $R(m\theta)^T R(n\theta) = R((n-m)\theta)$ and explain why RoPE rotates Q and K without directly rotating V.
Construct a causal ALiBi bias matrix, distinguish finite distance penalties from the $-\infty$ future mask, and compare steep versus gentle head slopes.
Identify the variable changed by PI, NTK-aware scaling, and YaRN instead of treating them as interchangeable "long context" switches.
Design a deployment gate that checks retrieval across positions, target tasks, short-context regressions, latency, memory, and cost.

Practice checkpoints

What to remember

For unmasked self-attention without a position signal, swapping input rows swaps output rows. Sinusoidal encodings add multi-frequency position fingerprints to token embeddings. RoPE rotates Q and K so the attention dot product can represent relative offset without directly rotating V. ALiBi adds distance penalties directly to attention logits. Long-context claims need validation at the target length, because a method can be mathematically defined for a longer sequence and still fail retrieval behavior.

Common failures and fixes

Order seems implicit: Symptom: you say unmasked attention already knows sequence order. Cause: you skipped the permutation-equivariance step. Fix: start from the content-only attention identity, then account separately for position signals and masking.
All methods sound same: Symptom: sinusoidal, learned, RoPE, and ALiBi blur into one idea. Cause: you aren't tracking where position enters. Fix: separate embeddings, Q/K geometry, and attention-logit bias explicitly.
RoPE rotates wrong tensor: Symptom: you describe applying RoPE rotation to embeddings or V. Cause: you lost the routing-versus-aggregation distinction. Fix: rotate Q and K after projection; leave V without direct RoPE rotation.
Long context looks solved on paper: Symptom: perplexity improves but retrieval still fails. Cause: you treated mathematical validity as behavioral proof. Fix: run passkey, retrieval, summarization, and target-task evals at deployed length.
Bigger base must be enough: Symptom: config advertises a larger window but distant facts still disappear. Cause: base change alone doesn't teach the model to use long-range evidence. Fix: pair geometry changes with rescaling method, long-context data, and task-level checks.
Concatenate by default: Symptom: width and projection cost balloon without a clear gain. Cause: you tried to preserve content and position separately at all costs. Fix: add position vectors unless you have a strong reason and budget for a wider architecture.

Going deeper

Why not concatenate positional encodings?

Concatenation would either force the transformer to run at width $2d$ or require an extra projection layer to get back to $d$ . If you keep the model width at $2d$ , square projection matrices like $W_Q$ , $W_K$ , and $W_V$ become roughly 4x larger. Addition keeps the width constant while learned projections operate on the resulting combined representation.

RoPE relative derivation

At position $m$ , $q^{(m)} = R(m\theta) \cdot q$ . At position $n$ , $k^{(n)} = R(n\theta) \cdot k$ . $q^{(m)T} k^{(n)} = q^T R(m\theta)^T R(n\theta) k = q^T R((n-m)\theta) k$

Holding unrotated $q$ and $k$ fixed, the rotation product depends only on relative offset $(n-m)$ , not on absolute positions. RoPE exposes relative position information inside the query-key score. Actual scores remain content-dependent because $q$ and $k$ come from token representations.

Why is RoPE base size part of a long-context design?

Men et al. derive and evaluate a relationship between RoPE base and effective long-context retrieval, while also showing that perplexity alone can hide retrieval failures.^{[8]Reference 8Base of RoPE Bounds Context Length.https://arxiv.org/abs/2405.14591} Base size is therefore a parameter to choose and validate alongside adaptation data and task-level long-context evaluations, not a standalone certificate of usable context length.

Next Step

Continue to Layer Normalization: Pre-LN vs Post-LN

Positional methods decide where attention can look; normalization decides whether deep attention blocks train stably as depth grows.

PreviousVision Transformers and Image Encoders

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Attention Is All You Need.

Vaswani, A., et al. · 2017

Transformer Language Models without Positional Encodings Still Learn Positional Information.

Haviv, A., Ram, O., Press, O., Izsak, P., & Levy, O. · 2022 · Findings of EMNLP 2022

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

GPT-2 Source Implementation.

OpenAI · 2019

RoFormer: Enhanced Transformer with Rotary Position Embedding.

Su, J., et al. · 2021

The Llama 3 Herd of Models.

Dubey, A., et al. · 2024 · arXiv preprint

Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.

Press, O., Smith, N. A., & Lewis, M. · 2022 · ICLR 2022

Base of RoPE Bounds Context Length.

Men, X., et al. · 2024 · NeurIPS 2024

Extending Context Window of Large Language Models via Positional Interpolation.

Chen, S., et al. · 2023

YaRN: Efficient Context Window Extension of Large Language Models.

Peng, B., et al. · 2023

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Positional Encoding: RoPE & ALiBi

The permutation equivariance problem

What this equation says

Three families of solutions

Method 1: Sinusoidal positional encoding

The original approach[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762

What the sine and cosine waves mean

Why sinusoids?

Frequency interpretation

Learned vs sinusoidal absolute encodings

Why add instead of concatenate?

Why do absolute positional encodings get added to token embeddings instead of concatenated?

Limitations of absolute methods

Method 2: Rotary Position Embeddings (RoPE)

A relative-position approach

Core insight

The math

How to read the rotation matrix

Position differences

Why this matters

Worked example: the same fixed vectors and offset give the same score

Implementation

What RoPE changes

Why does RoPE rotate Q and K without directly rotating V?

Relative offset isn't a monotonic penalty

Method 3: Attention with Linear Biases (ALiBi)

The simplest approach[7]Reference 7Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.https://arxiv.org/abs/2108.12409

What the ALiBi formula does

Head slopes

Worked example: a concrete distance penalty

What ALiBi demonstrated

Softmax numerical stability with ALiBi

The extrapolation challenge

RoPE context extension techniques

1. Position Interpolation (PI)

What rescaling the position means

2. NTK (Neural Tangent Kernel)-aware Scaling

What scaling the base changes

3. YaRN (Yet another RoPE extensioN)

Choosing a positional method

What to check before moving on

Practice checkpoints

Why is it wrong to describe RoPE as a monotonic distance penalty?

Why is ALiBi's mechanism defined beyond a learned absolute position table's configured maximum?

How would you compare PI, NTK-aware scaling, and YaRN?

In a 2D RoPE subspace, the rotation frequency is θ=0.5\theta = 0.5θ=0.5 radians. A query at position 5 and a key at position 3 both have the unit vector [1,0][1, 0][1,0]. What is their dot product after RoPE rotation? What would it be for positions 105 and 103?

A causal decoder with 8 heads uses ALiBi. At position 10, the raw attention score for the token at position 6 is 4.0. What score enters softmax for Head 1 (m=1/2m = 1/2m=1/2) and Head 8 (m=1/256m = 1/256m=1/256)? Which head is more likely to retrieve that token?

You're deploying a 4K-trained code model to 64K repository-history threads. Offline perplexity barely moves after increasing the RoPE base, but passkey retrieval drops. What should you conclude?

What to remember

Common failures and fixes

Going deeper

Why not concatenate positional encodings?

RoPE relative derivation

Why is RoPE base size part of a long-context design?

Mastery Check

Discussion

The original approach^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762}

The simplest approach^{[7]Reference 7Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.https://arxiv.org/abs/2108.12409}