Understand why transformers need position information, how sinusoidal encodings work, how RoPE and ALiBi encode relative position, and why long-context extrapolation needs careful evaluation.
Vision Transformers showed why image patches need position. Text has the same problem: "The package left the warehouse" and "The warehouse left the package" use nearly the same words, but the order changes the meaning.
Self-attention has a surprising blind spot: by itself, it has no built-in notion of "first", "second", or "third" token. If you permute the input sequence, the outputs permute in the same way. This is called permutation equivariance, and it means the model needs a separate signal for where each token sits.
Positional encoding injects word-order information into an architecture that would otherwise be order-blind. In this article, you'll learn why position information is needed, how the original sinusoidal approach works, and how relative methods like RoPE and ALiBi change long-context behavior.
Self-attention computes:
Compare every query with all keys (), scale by (key vector dimension) for stable softmax, convert scores to attention weights, then mix value vectors using those weights.
This unmasked operation is permutation equivariant: if you shuffle the input token order, the output is shuffled in the same way. The attention weights depend on content, not position.[1]
Without positional information, transformers can't distinguish between different word orders, which can completely alter the meaning of a sentence. For example, these two sequences contain identical tokens but have very different semantics. When processed by a position-blind transformer, they produce the same attention pattern up to the same row/column permutation:
1"The package left the warehouse"
2"warehouse the left package The"Meaning depends on word order, so the model needs a mechanism to inject position information. A causal mask also adds directional structure in decoder-only models, but it doesn't replace positional encoding: the mask tells a token which earlier slots are visible, not how far away those slots are.
This tiny attention implementation verifies the property directly. It permutes input rows, recomputes attention, and gets exactly the permuted original output.
1import math
2
3tokens = [[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]]
4permutation = [2, 0, 1]
5
6def attention(rows: list[list[float]]) -> list[list[float]]:
7 scores = [
8 [sum(q_i * k_i for q_i, k_i in zip(q, k)) / math.sqrt(2) for k in rows]
9 for q in rows
10 ]
11 outputs = []
12 for row in scores:
13 normalizer = sum(math.exp(value) for value in row)
14 weights = [math.exp(value) / normalizer for value in row]
15 outputs.append([
16 sum(weight * value[column] for weight, value in zip(weights, rows))
17 for column in range(2)
18 ])
19 return outputs
20
21baseline = attention(tokens)
22shuffled = attention([tokens[index] for index in permutation])
23expected = [baseline[index] for index in permutation]
24
25same_output_up_to_permutation = all(
26 abs(actual - target) < 1e-12
27 for actual_row, target_row in zip(shuffled, expected)
28 for actual, target in zip(actual_row, target_row)
29)
30print(same_output_up_to_permutation)
31print([round(row[0], 3) for row in baseline])
32print([round(row[0], 3) for row in shuffled])1True
2[0.802, 0.599, 0.752]
3[0.752, 0.802, 0.599]Here is the map of approaches we will explore, so you know where each method fits before we walk through it step by step.
| Approach | How It Works | Modifies | Examples |
|---|---|---|---|
| Absolute (additive) | Add position vector to token embedding | Embeddings | Original Transformer, BERT, GPT-2[1][2][3] |
| Relative (multiplicative) | Rotate Q/K vectors based on position | Q/K projections | RoFormer, Llama family[4][5] |
| Relative (additive bias) | Add distance bias to attention logits | Attention scores | ALiBi paper[6] |
The methods enter the transformer at different points. The three paths below are alternatives, not steps that every model applies together.
Think of positional encoding like reading a clock. A clock uses multiple hands moving at different speeds (seconds, minutes, hours) to uniquely identify any moment in a 12-hour period. Sinusoidal positional encoding does the same thing for words in a sequence: each position gets a unique "fingerprint" made of sine and cosine waves at different frequencies. Low dimensions oscillate quickly (like seconds), while high dimensions oscillate slowly (like hours).
For position and dimension (where is the embedding dimension):
The combination of these sine and cosine waves creates a unique pattern for every position. Each pair of dimensions acts like one clock hand spinning at a particular speed. Position 0 starts every hand at 0 or 1. As you move down the sequence, each hand turns a little more, and the exact angle of every hand together forms a fingerprint that no other position shares.
Because can be expressed as a linear function of :
where is a rotation matrix that depends only on offset , not absolute position. This means the model can, in principle, learn to attend based on relative position.[1]
Here's how we can generate these multi-frequency sinusoidal positional encodings with plain Python. The function takes the maximum sequence length and embedding dimension as inputs to compute the required frequencies. It returns a table of positional encodings that can be added directly to token embeddings before they enter attention layers.
1import math
2
3def sinusoidal_positional_encoding(
4 max_len: int,
5 d_model: int,
6) -> list[list[float]]:
7 """Generate sinusoidal positional encodings."""
8 assert d_model % 2 == 0
9 pe: list[list[float]] = []
10 for pos in range(max_len):
11 row = [0.0] * d_model
12 for i in range(0, d_model, 2):
13 frequency = math.exp(-(math.log(10000.0) * i) / d_model)
14 angle = pos * frequency
15 row[i] = math.sin(angle)
16 row[i + 1] = math.cos(angle)
17 pe.append(row)
18
19 return pe # (max_len, d_model)
20
21pe = sinusoidal_positional_encoding(max_len=8, d_model=8)
22print(len(pe), len(pe[0]))
23print([round(x, 3) for x in pe[0]])
24print(round(pe[2][0], 3))
25print(round(pe[2][1], 3))18 8
2[0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0]
30.909
4-0.416For the fastest pair (dimensions 0 and 1), position 2 means angle 2 radians, so we get and . For the slowest pair (dimensions 6 and 7), position 2 means angle 0.002 radians, so the values stay extremely close to 0 and 1. This is the multi-scale fingerprint in action.
| Dimension pair | Wavelength (d=512) | What it captures |
|---|---|---|
| tokens | Very fine-grained local position | |
| tokens | Medium-scale structure | |
| tokens | Very coarse global position |
The multi-scale frequencies mean the model can attend to both local and distant positions through different embedding dimensions, similar to how Fourier series decompose signals at multiple scales.
1import math
2
3d_model = 512
4for first_dimension in [0, 64, 510]:
5 angular_frequency = 10000 ** (-first_dimension / d_model)
6 wavelength = 2 * math.pi / angular_frequency
7 print(first_dimension, round(wavelength))10 6
264 20
3510 60611
Instead of fixed sinusoids, models like BERT (Bidirectional Encoder Representations from Transformers) and GPT-2 (Generative Pre-trained Transformer 2) use learned position embeddings: a trainable matrix where each position gets its own optimizable vector.[2][3]
| Property | Sinusoidal | Learned |
|---|---|---|
| Parameters | 0 (fixed formula) | added params |
| Flexibility | Fixed inductive bias | Adapts to task |
| Extrapolation | Beyond (in theory) | Hard limit at |
| Performance | Strong baseline | Comparable; no significant win |
Vaswani et al. (2017) tested both and found "nearly identical results."[1] Later architectures also explore relative-position methods, which insert position into attention in different ways.
A common intuition is that concatenating the positional vector to the token embedding (creating a vector) would perfectly preserve both pieces of information. However, concatenation would either force the whole transformer to run at width or require an extra projection layer to get back to . If you keep the model width at , square matrices like , , and become roughly 4x larger. Addition keeps the hidden width unchanged and lets downstream learned projections use the combined representation.
Addition doesn't guarantee that content and position stay neatly separated. It keeps the hidden width fixed and gives later learned projections one combined representation to use. Whether that works well is empirical: Vaswani et al. report nearly identical results for their sinusoidal and learned-position variants.[1]
| Limitation | Why it matters |
|---|---|
| Poor extrapolation | Quality often degrades for positions beyond the training length. |
| Implicit relative position | The model must learn relative offsets indirectly from absolute vectors. |
| Content-position mixing | Adding position vectors to embeddings mixes content and position in the same stream. |
RoPE (Rotary Position Embeddings) was introduced in RoFormer and is used in Llama-family models described in Meta's Llama 3 report.[4][5]
Instead of adding position to embeddings, rotate the query and key vectors in 2D subspaces. The dot product between rotated vectors naturally encodes relative position.
Imagine two dancers (a query and a key) on a circular stage. RoPE assigns each dancer a rotation angle based on their position in the sequence. When they face each other to "interact" (dot product), only the difference in their angles matters, not their absolute positions. This is why RoPE naturally encodes relative position: a dancer at position 5 interacting with position 3 looks the same as position 105 interacting with position 103.
For a 2D subspace at position , apply rotation:
Take each consecutive pair of numbers in the query vector and spin them like a clock hand. How far they spin depends on two things: the token's position (further in the sequence = more rotation) and the frequency (each pair rotates at a different speed). The result: every position creates a unique rotational pattern that the model can use to figure out "how far apart are these two words?"
The frequency base follows the same pattern as sinusoidal encoding, but production RoPE usually applies it over the per-head rotary dimension, not the full model width:
If every head dimension is rotated, . Some architectures rotate only part of each head, so the config's rotary dimension is the value that matters.
When computing , the rotation matrices combine:
Toy numerical example (2D subspace). Suppose a query vector pair at position 5 with frequency for simplicity, and a key at position 2:
1import math
2
3def rotate_2d(vec: list[float], pos: int, theta: float) -> list[float]:
4 """Apply RoPE rotation to a 2D pair."""
5 c, s = math.cos(pos * theta), math.sin(pos * theta)
6 x, y = vec
7 return [x * c - y * s, x * s + y * c]
8
9q = [0.8, 0.6] # original query pair
10k = [0.7, 0.5] # original key pair
11
12def dot(a: list[float], b: list[float]) -> float:
13 return sum(x * y for x, y in zip(a, b))
14
15theta = 0.5
16score_a = dot(rotate_2d(q, 5, theta), rotate_2d(k, 2, theta))
17score_b = dot(rotate_2d(q, 105, theta), rotate_2d(k, 102, theta))
18
19print(round(score_a, 6), round(score_b, 6))10.040884 0.040884Running this shows that the interaction depends only on the offset, exactly as the math predicts. This relative-position property is useful, but alone provides no proof that a model works beyond its trained context window.
The attention score between a query at position and a key at position depends only on the signed relative offset between them, not on their absolute positions. Token 5 attending to token 3 produces the same positional structure as token 1005 attending to token 1003. This is what makes RoPE "relative": it captures offset, not absolute address.
Picture a single 2D pair where both the query and the key start as the simple unit vector . Let the rotation frequency for this pair be radians.
Query at position , key at position : relative distance . After rotation, the dot product equals .
Query at position , key at position : relative distance is still . After rotation, the dot product is again .
The absolute positions changed by 100, but the positional part of the attention score stayed identical because only the distance between the tokens matters. If the distance were , this particular pair would score , which is smaller. Avoid generalizing that single comparison into a monotonic distance penalty: cosine phases can rise again at larger offsets, and a real head sums many frequency pairs.
The code below demonstrates how RoPE applies these rotations in practice. It takes the query and key tensors as inputs, along with precomputed rotation frequencies based on the sequence length. The function reshapes the vectors into 2D pairs, applies the complex rotations, and returns the rotated query and key tensors.
1import torch
2
3def precompute_freqs_cis(
4 dim: int,
5 max_seq_len: int,
6 theta: float = 10000.0,
7) -> torch.Tensor:
8 """Precompute rotation frequencies (complex exponentials)."""
9 freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
10 t = torch.arange(max_seq_len, dtype=torch.float)
11 freqs = torch.outer(t, freqs) # (seq_len, dim/2)
12 return torch.polar(torch.ones_like(freqs), freqs) # complex64
13
14def apply_rotary_emb(
15 xq: torch.Tensor, # (batch, seq_len, n_heads, d_k)
16 xk: torch.Tensor, # (batch, seq_len, n_heads, d_k)
17 freqs_cis: torch.Tensor, # (seq_len, d_k/2)
18) -> tuple[torch.Tensor, torch.Tensor]:
19 """Apply RoPE to queries and keys."""
20 # View as complex numbers: pairs (x_2i, x_{2i+1}) -> x_2i + j·x_{2i+1}
21 xq_complex = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
22 xk_complex = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
23
24 # Multiply by rotation (complex multiplication = 2D rotation)
25 freqs_cis = freqs_cis.unsqueeze(0).unsqueeze(2) # broadcast
26 xq_out = torch.view_as_real(xq_complex * freqs_cis).flatten(-2)
27 xk_out = torch.view_as_real(xk_complex * freqs_cis).flatten(-2)
28
29 return xq_out.type_as(xq), xk_out.type_as(xk)
30
31# --- Verify the relative-distance property on tiny tensors ---
32d_k = 4
33freqs = precompute_freqs_cis(dim=d_k, max_seq_len=4)
34# Both q and k are unit vectors [1,0,1,0] at different positions
35q = torch.tensor([[[[1.0, 0.0, 1.0, 0.0]]]]) # (1, 1, 1, 4) at pos 0
36k_far = torch.tensor([[[[1.0, 0.0, 1.0, 0.0]]]]) # same vector at pos 2
37q_rot, _ = apply_rotary_emb(q, k_far, freqs[0:1])
38_, k_rot = apply_rotary_emb(q, k_far, freqs[2:3])
39print("Dot product at distance 2:", (q_rot * k_rot).sum().item())
40
41# Now shift both by +1: q at pos 1, k at pos 3
42q2 = torch.tensor([[[[1.0, 0.0, 1.0, 0.0]]]])
43k2 = torch.tensor([[[[1.0, 0.0, 1.0, 0.0]]]])
44q_rot2, _ = apply_rotary_emb(q2, k2, freqs[1:2])
45_, k_rot2 = apply_rotary_emb(q2, k2, freqs[3:4])
46print("Dot product at distance 2 (shifted):", (q_rot2 * k_rot2).sum().item())
47# Both prints show the same number, because only relative distance matters.1Dot product at distance 2: 0.5836532115936279
2Dot product at distance 2 (shifted): 0.5836531519889832| Property | Sinusoidal | Learned | RoPE |
|---|---|---|---|
| Position info | Added to embeddings | Added to embeddings | Applied to Q, K only |
| Relative position | Implicit (model must learn) | Implicit | Explicit in dot product |
| Where position enters | Input embedding sum | Input embedding sum | Q/K rotation |
| Direct RoPE transform on V | Not applicable | Not applicable | None |
| Extra parameters | 0 | 0 |
RoPE directly rotates Q and K, not V. This places its relative-position operation in attention routing rather than applying the same rotation to aggregated values. In deeper layers, V still comes from hidden states that already include context mixed by earlier position-aware attention.
RoPE gives attention a structured function of relative offset. Farther tokens aren't guaranteed to receive smaller positional contributions. For a single pair of identical unit vectors, the positional score is:
This quantity oscillates as distance changes. A full head combines multiple frequencies and content-dependent coefficients, so its score can also rise or fall across offsets.
1import math
2
3theta = 0.5
4distances = [1, 2, 4, 6, 8]
5scores = [math.cos(distance * theta) for distance in distances]
6print(distances)
7print([round(score, 3) for score in scores])
8print(scores[-1] > scores[-2])1[1, 2, 4, 6, 8]
2[0.878, 0.54, -0.416, -0.99, -0.654]
3TrueThe RoFormer paper analyzes a long-term decay bound aggregated over RoPE frequencies, and Men et al. (2024) study how the RoPE base affects long-context retrieval behavior.[4][7] Those analyses motivate careful base selection and evaluation. They remain distinct from ALiBi's explicit linear distance penalty.
Important caution: simply increasing the base frequency can give "superficial" long-context capability in the evaluations reported by Men et al.: perplexity may look good while retrieval accuracy degrades.[7] A changed RoPE configuration therefore needs retrieval and task evaluation at its deployed length.
ALiBi (Attention with Linear Biases) doesn't modify embeddings at all. Instead, it adds a head-specific linear bias directly to attention scores:
Imagine you're at a crowded party. People standing right next to you speak at full volume, but the farther away someone stands, the quieter they sound, linearly. ALiBi works like this: it applies a simple distance penalty to attention scores, so nearby tokens "speak loudly" and distant tokens are naturally muffled. Different attention heads act like listeners with different hearing ranges: some focus on nearby whispers, others can hear across the room.
This is the normal attention formula with one modification: before applying softmax, we add a negative bias that gets more negative the farther back a key token sits from the current query. In a causal decoder, the current token gets zero penalty, one token back gets , two tokens back gets , and so on. A steep slope makes the head short-sighted. A gentle slope lets it see further.
Here, is the slope for head . Future positions still get removed by the causal mask.
For heads, the paper uses a fixed geometric schedule. For power-of-two head counts, this reduces to a clean sequence like .
For 8 heads (n=8): .
| Head | Slope | Effect |
|---|---|---|
| Head 1 | Strong local attention | |
| Head 4 | Moderate range | |
| Head 8 | Very long-range |
Different heads get different slopes through a fixed schedule, giving the model a spectrum of attention ranges, from very local (steep slope) to nearly global (gentle slope). For non-power-of-two head counts, the reference implementation uses a recursive workaround so slopes remain well spread.[6]
Picture the token "left" at position 3 attending back to earlier words in "The package left the warehouse." The raw dot-product score for "package" at position 1 is 2.5. Head 4 has slope .
For "The" at position 0, the distance is 3, so the penalty is , and the score entering softmax is . The farther token is slightly quieter, but not silenced. With Head 1 (), the same distances would produce penalties of and , making the head far more myopic.
1def alibi_score(raw_score: float, query_pos: int, key_pos: int, slope: float) -> float:
2 distance = max(query_pos - key_pos, 0)
3 return raw_score - slope * distance
4
5raw = 4.0
6query_pos = 10
7key_pos = 6
8
9head_1 = alibi_score(raw, query_pos, key_pos, slope=1 / 2)
10head_8 = alibi_score(raw, query_pos, key_pos, slope=1 / 256)
11
12print(head_1, round(head_8, 3))12.0 3.984The following snippet shows how to construct the ALiBi bias tensor for a causal decoder. It follows the reference slope helper, including non-power-of-two head counts, and bakes the causal mask into the returned bias matrix.
1import math
2import torch
3
4def get_alibi_slopes_power_of_two(n_heads: int) -> torch.Tensor:
5 start = 2 ** (-2 ** -(math.log2(n_heads) - 3))
6 ratio = start
7 return torch.tensor(
8 [start * (ratio ** head) for head in range(n_heads)],
9 dtype=torch.float32,
10 )
11
12def get_alibi_slopes(n_heads: int) -> torch.Tensor:
13 assert n_heads > 0
14 if math.log2(n_heads).is_integer():
15 return get_alibi_slopes_power_of_two(n_heads)
16
17 closest_power_of_two = 2 ** math.floor(math.log2(n_heads))
18 return torch.cat(
19 [
20 get_alibi_slopes_power_of_two(closest_power_of_two),
21 get_alibi_slopes(2 * closest_power_of_two)[0::2][
22 : n_heads - closest_power_of_two
23 ],
24 ]
25 )
26
27def build_alibi_bias(
28 n_heads: int,
29 seq_len: int,
30) -> torch.Tensor:
31 """Build ALiBi logits bias for a causal decoder."""
32 slopes = get_alibi_slopes(n_heads)
33
34 q_pos = torch.arange(seq_len)[:, None]
35 k_pos = torch.arange(seq_len)[None, :]
36
37 # Causal distance: current token = 0, one step back = 1, etc.
38 relative = (q_pos - k_pos).clamp(min=0).to(torch.float32)
39 bias = -slopes[:, None, None] * relative
40
41 causal_mask = k_pos > q_pos
42 bias = bias.masked_fill(causal_mask.unsqueeze(0), float("-inf"))
43
44 return bias # Add directly to attention logits before softmax
45
46# --- Inspect a tiny bias matrix for 4 heads and 4 tokens ---
47bias = build_alibi_bias(n_heads=4, seq_len=4)
48print(bias.shape)
49print(bias[0, 3, :])
50print(bias[-1, 3, :])
51# Notice how distant tokens get more negative values, and the last head is the most forgiving.1torch.Size([4, 4, 4])
2tensor([-0.7500, -0.5000, -0.2500, -0.0000])
3tensor([-0.0117, -0.0078, -0.0039, -0.0000])In one original-paper experiment, a 1.3 billion parameter ALiBi model is trained at length 1024 and evaluated at length 2048, matching the perplexity of a sinusoidal model trained at length 2048 while using less training time and memory in that setup.[6] The mechanism has several useful properties:
ALiBi interacts elegantly with standard softmax optimizations. ALiBi biases can produce very negative values for distant tokens. In practice, softmax is computed using the max-shift trick to prevent numerical overflow:
Where is the raw attention score and is the ALiBi distance bias for token .
When computing this in finite precision, very negative ALiBi biases can make distant tokens' shifted exponent values extremely small. Conceptually ALiBi is still a soft distance-dependent recency bias, not a hard window. The causal mask remains separate, and every distant past weight remains mathematically possible.
1import math
2
3def softmax(logits: list[float]) -> list[float]:
4 maximum = max(logits)
5 shifted = [math.exp(value - maximum) for value in logits]
6 total = sum(shifted)
7 return [value / total for value in shifted]
8
9raw_scores = [2.0, 2.0, 2.0]
10distances = [2, 1, 0]
11slope = 0.5
12biased_scores = [
13 score - slope * distance
14 for score, distance in zip(raw_scores, distances)
15]
16weights = softmax(biased_scores)
17
18print([round(weight, 3) for weight in weights])
19print(weights[0] > 0)
20print("future token uses causal mask separately")1[0.186, 0.307, 0.506]
2True
3future token uses causal mask separatelyThis is a critical practical challenge in production systems:
"Your model was trained on 4K tokens. How does it perform at 32K?"
| Method | Mechanism beyond training length | Reported evidence and required check |
|---|---|---|
| Sinusoidal | Formula can compute unseen positions | Press et al. report sharp perplexity degradation for their WikiText-103 setup; evaluate your task. |
| Learned absolute | No row exists past configured without changing the table | Extend or replace the table, then train and evaluate. |
| RoPE (unchanged) | Relative rotations continue at unseen offsets | Chen et al. report poor direct extension for their Llama experiments; evaluate retrieval and task quality. |
| ALiBi | Same linear logit-bias rule applies at new distances | Press et al. report 1024 to 2048 extrapolation in their setup. |
| RoPE with PI or YaRN | Modify the rotary geometry, typically with adaptation | Chen et al. report PI up to 32K; Peng et al. report YaRN up to 128K in their evaluated models. |
These are results from different experiments, not a universal ranking. Context length is a model-and-evaluation claim, not a property proved by selecting a positional formula.[6][8][9]
When adapting a RoPE-based model for longer contexts, these methods illustrate different interventions:
Scale position indices down to fit within training range[8]
If you trained on 4K tokens but want to handle 16K, you shrink all position indices by 4x so they fit within the range the model already knows. Position 16,000 gets mapped to effective position 4,000. Critically, the rotation frequencies theta themselves stay unchanged. Only the position index that multiplies theta gets rescaled. Chen et al. report extensions to 32K with up to 1,000 fine-tuning steps in their Llama experiments.[8]
1train_length = 4096
2target_length = 16384
3scale = train_length / target_length
4
5for target_position in [0, 4096, 8192, 16383]:
6 effective_position = target_position * scale
7 print(target_position, round(effective_position, 2))10 0.0
24096 1024.0
38192 2048.0
416383 4095.75Scale the frequency base instead of positions, trading off between low and high frequency components[9]
where is the scale factor and is the per-head RoPE dimension (the head dimension, not the full hidden size).
Instead of compressing every position, this formulation changes the base so lower-frequency bands are stretched more than high-frequency bands. Peng et al. document this as earlier NTK-aware work while developing YaRN; any quality claim remains dependent on model, target length, adaptation data, and evaluation.[9]
Combines NTK-by-parts interpolation with attention temperature adjustment.[9] Peng et al. report RoPE-based Llama-family variants evaluated at contexts up to 128K.
The paper's recommended temperature rule for its Llama experiments is , where . NTK-by-parts interpolates frequency bands differently; temperature adjustment changes attention scale. Both choices remain hyperparameters to validate on the deployed model.
| Goal | Candidate mechanism | Validation obligation |
|---|---|---|
| Reproduce original Transformer behavior | Sinusoidal absolute encoding | Confirm quality at trained lengths and any extrapolated lengths used. |
| Use learned finite position slots | Learned absolute embeddings | Define how new positions are initialized and adapted. |
| Encode relative offset inside query-key scores | RoPE | Measure target-length retrieval and generation quality, especially after any scaling. |
| Add an explicit monotonic distance bias | ALiBi | Measure whether its recency bias retains evidence needed by long tasks. |
| Extend an existing RoPE checkpoint | PI or YaRN-style adaptation | Validate short-context regression and target-context behavior after adaptation. |
The implementation choice decides where position enters attention. It still leaves the trained checkpoint to be evaluated on the exact lengths and tasks being served.
| Tier | You should be able to defend |
|---|---|
| Foundational | Explain permutation equivariance as the core reason transformers need positional encoding. |
| Intermediate | Derive the sinusoidal positional encoding formula and explain the multi-frequency intuition. |
| Advanced | Show how RoPE rotates Q/K in 2D subspaces so dot products depend on relative distance. |
| Advanced | Explain why RoPE rotates Q and K without directly rotating V. |
| Advanced | Describe ALiBi as a fixed attention bias linear in distance with geometric head slopes. |
| Advanced | Compare extrapolation behavior for learned absolute positions, sinusoidal encodings, ALiBi, vanilla RoPE, and scaled RoPE. |
| Advanced | Explain how PI, NTK-aware scaling, YaRN, and RoPE base changes affect context extension. |
Strong: can derive why RoPE makes the attention score depend on relative offset, can explain why V stays unrotated, and can pick a context-extension method plus validation plan for a real deployment.Good: can compare sinusoidal, learned, RoPE, and ALiBi correctly, but still needs help connecting the math to serving tradeoffs.Needs work: still mixes up where position enters the model, or treats a longer configured context window as proof that long-context retrieval will work.Without position encodings, swapping inputs just swaps outputs. Sinusoidal encodings add multi-frequency position fingerprints to token embeddings. RoPE rotates Q and K so the attention dot product can represent relative offset without directly rotating V. ALiBi adds distance penalties directly to attention logits. Long-context claims need validation at the target length, because a method can be mathematically defined for a longer sequence and still fail retrieval behavior.
Order still looks like magic: Symptom: you say attention already knows sequence order. Cause: you skipped the permutation-equivariance step. Fix: start from the fact that self-attention is content-based unless a position signal is injected.All methods sound same: Symptom: sinusoidal, learned, RoPE, and ALiBi blur into one idea. Cause: you aren't tracking where position enters. Fix: separate embeddings, Q/K geometry, and attention-logit bias explicitly.RoPE rotates wrong tensor: Symptom: you describe applying RoPE rotation to embeddings or V. Cause: you lost the routing-versus-aggregation distinction. Fix: rotate Q and K after projection; leave V without direct RoPE rotation.Long context looks solved on paper: Symptom: perplexity improves but retrieval still fails. Cause: you treated mathematical validity as behavioral proof. Fix: run passkey, retrieval, summarization, and target-task evals at deployed length.Bigger base must be enough: Symptom: config advertises a larger window but distant facts still disappear. Cause: base change alone doesn't teach the model to use long-range evidence. Fix: pair geometry changes with rescaling method, long-context data, and task-level checks.Concatenate by default: Symptom: width and projection cost balloon without a clear gain. Cause: you tried to preserve content and position separately at all costs. Fix: add position vectors unless you have a strong reason and budget for a wider architecture.Concatenation would either force the transformer to run at width or require an extra projection layer to get back to . If you keep the model width at , square projection matrices like , , and become roughly 4x larger. Addition keeps the width constant while learned projections operate on the resulting combined representation.
At position , . At position , .
After rotation, the dot product depends only on the relative offset , not on absolute positions. This is why RoPE naturally encodes relative position information.
Men et al. derive and evaluate a relationship between RoPE base and effective long-context retrieval, while also showing that perplexity alone can hide retrieval failures.[7] Base size is therefore a parameter to choose and validate alongside adaptation data and task-level long-context evaluations, not a standalone certificate of usable context length.
Attention Is All You Need.
Vaswani, A., et al. · 2017
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. · 2019 · NAACL 2019
GPT-2 Source Implementation.
OpenAI · 2019
RoFormer: Enhanced Transformer with Rotary Position Embedding.
Su, J., et al. · 2021
The Llama 3 Herd of Models.
Touvron, H., et al. · 2024 · arXiv preprint
Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.
Press, O., Smith, N. A., & Lewis, M. · 2022 · ICLR 2022
Base of RoPE Bounds Context Length.
Men, X., et al. · 2024 · NeurIPS 2024
Extending Context Window of Large Language Models via Positional Interpolation.
Chen, S., et al. · 2023
YaRN: Efficient Context Window Extension of Large Language Models.
Peng, B., et al. · 2023