LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

Ā© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

🧪AI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & Model Evaluation
⚔Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time Compute
šŸ”Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
šŸ¤–Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
šŸ“ŠEvaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
šŸ› ļøLLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge Distillation for LLMsModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
šŸ—ļøSystem Design Case Studies0/10
Design an Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnAI Engineering FoundationsPositional Encoding: RoPE & ALiBi
🧠HardTransformer Architecture

Positional Encoding: RoPE & ALiBi

Understand why transformers need position info, derive sinusoidal encodings, explore how RoPE encodes relative position through rotation, compare ALiBi's linear bias approach, and analyze long-context extrapolation methods.

45 min readGoogle, Meta, OpenAI +310 key concepts

Consider these two sentences: "The dog bit the man" and "The man bit the dog." They use the same words but have completely different meanings because word order matters. Humans get this instantly, but transformers have a surprising blind spot: the core attention mechanism is "order-blind" and treats both sentences identically. To a transformer without position information, "The dog bit the man" and "man the bit dog The" look the same.

This is called permutation equivariance, and it means we need a way to tell the model where each word appears in a sequence. That's exactly what positional encoding does, and the choice of method directly determines how well a Large Language Model (LLM) handles long documents, conversations, and code. If you haven't read about the attention mechanism yet, start there. It's the foundation this article builds on.

šŸ’” Key insight: Positional encoding injects word-order information into an architecture that would otherwise be "order-blind." In this article, you'll learn why position info is needed, how the original approach works, and why modern models use a clever rotation trick called RoPE.


The permutation equivariance problem

Self-attention computes:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) VAttention(Q,K,V)=softmax(dk​​QKT​)V

Reading the formula

Compare every query with all keys (QKTQK^TQKT), scale by dk\sqrt{d_k}dk​​ (key vector dimension) for stable softmax, convert scores to attention weights, then mix value vectors VVV using those weights.

This operation is set-based: if you shuffle the input token order, the output is simply shuffled in the same way. The attention weights QKTQK^TQKT depend on content, not position.[1]

Without positional information, transformers can't distinguish between different word orders, which can completely alter the meaning of a sentence. For example, these two sequences contain identical tokens but have very different semantics. When processed by a position-blind transformer, they produce identical attention patterns:

text
1"The cat sat on the mat" 2"mat the on sat cat The"

But meaning depends entirely on word order! We need a mechanism to inject position information.

Diagram Diagram

Three families of solutions

ApproachHow It WorksModifiesExamples
Absolute (additive)Add position vector to token embeddingEmbeddingsSinusoidal (Vaswani), Learned (GPT-2, BERT)
Relative (multiplicative)Rotate Q/K vectors based on positionAttention scoresRoPE (Llama, Gemma, Qwen)
Relative (additive bias)Add distance bias to attention logitsAttention scoresALiBi (BLOOM, MPT)
Positional encoding methods compared: sinusoidal (fixed), learned (trained), RoPE (rotary), and ALiBi (linear bias), with their extrapolation properties. Positional encoding methods compared: sinusoidal (fixed), learned (trained), RoPE (rotary), and ALiBi (linear bias), with their extrapolation properties.

Method 1: Sinusoidal positional encoding

The original approach[1]

šŸ’” Key insight: Think of positional encoding like reading a clock. A clock uses multiple hands moving at different speeds (seconds, minutes, hours) to uniquely identify any moment in a 12-hour period. Sinusoidal positional encoding does the exact same thing for words in a sequence: each position gets a unique "fingerprint" made of sine and cosine waves at different frequencies. Low dimensions oscillate quickly (like seconds), while high dimensions oscillate slowly (like hours).

For position pos\text{pos}pos and dimension iii (where dmodeld_{\text{model}}dmodel​ is the embedding dimension):

PE(pos,2i)=sin⁔(pos100002i/dmodel)PE_{(\text{pos}, 2i)} = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)PE(pos,2i)​=sin(100002i/dmodel​pos​)

PE(pos,2i+1)=cos⁔(pos100002i/dmodel)PE_{(\text{pos}, 2i+1)} = \cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)PE(pos,2i+1)​=cos(100002i/dmodel​pos​)

Reading the formula

The combination of these sine and cosine waves creates a unique pattern for every position.

Why sinusoids?

Because PEpos+kPE_{\text{pos}+k}PEpos+k​ can be expressed as a linear function of PEposPE_{\text{pos}}PEpos​:

PEpos+k=Rkā‹…PEposPE_{\text{pos}+k} = R_k \cdot PE_{\text{pos}}PEpos+k​=Rk​⋅PEpos​

where RkR_kRk​ is a rotation matrix that depends only on offset kkk, not absolute position. This means the model can, in principle, learn to attend based on relative position.[1]

Here's how we can generate these multi-frequency sinusoidal positional encodings using PyTorch. The function takes the maximum sequence length and embedding dimension as inputs to compute the required frequencies. It returns a tensor of positional encodings that can be added directly to the token embeddings before they enter the attention layers.

python
1import torch 2import math 3 4def sinusoidal_positional_encoding( 5 max_len: int, 6 d_model: int, 7) -> torch.Tensor: 8 """Generate sinusoidal positional encodings.""" 9 pe = torch.zeros(max_len, d_model) 10 position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) 11 12 # Compute the frequency for each dimension pair 13 # Low dimensions → high frequency (local position) 14 # High dimensions → low frequency (coarse position) 15 # Īø_i = 10000^(-2i/d_model), implemented via exp(-2i/d_model * ln(10000)) 16 div_term = torch.exp( 17 torch.arange(0, d_model, 2).float() 18 * (-math.log(10000.0) / d_model) 19 ) 20 21 pe[:, 0::2] = torch.sin(position * div_term) # Even dims 22 pe[:, 1::2] = torch.cos(position * div_term) # Odd dims 23 24 return pe # (max_len, d_model) 25 26# Usage: simply ADD to token embeddings 27# x = token_embeddings + pe[:seq_len]

Frequency interpretation

Dimension pairWavelength (d=512)What it captures
(0,1)(0, 1)(0,1)∼6\sim 6∼6 tokensVery fine-grained local position
(64,65)(64, 65)(64,65)∼21\sim 21∼21 tokensMedium-scale structure
(510,511)(510, 511)(510,511)∼40,000\sim 40{,}000∼40,000 tokensVery coarse global position

The multi-scale frequencies mean the model can attend to both local and distant positions through different embedding dimensions, similar to how Fourier series decompose signals at multiple scales.

Learned vs sinusoidal absolute encodings

Instead of fixed sinusoids, models like BERT (Bidirectional Encoder Representations from Transformers) and GPT-2 use learned position embeddings: a trainable matrix P∈RLmax⁔×dP \in \mathbb{R}^{L_{\max} \times d}P∈RLmax​×d where each position gets its own optimizable vector.

PropertySinusoidalLearned
Parameters0 (fixed formula)Lmax⁔×dL_{\max} \times dLmax​×d added params
FlexibilityFixed inductive biasAdapts to task
ExtrapolationBeyond Lmax⁔L_{\max}Lmax​ (in theory)āŒ Hard limit at Lmax⁔L_{\max}Lmax​
PerformanceStrong baselineComparable; no significant win

šŸ’” Key insight: Vaswani et al. (2017) tested both and found "nearly identical results."[1] Both have been superseded by relative position methods for modern LLMs.

Why add instead of concatenate?

šŸ’” Key insight: Why do transformers add positional encodings to embeddings instead of concatenating them?

A common intuition is that concatenating the positional vector to the token embedding (creating a 2d2d2d vector) would perfectly preserve both pieces of information. However, concatenation would increase the embedding dimension from ddd to 2d2d2d, mathematically doubling the parameters of all linear layers and quadrupling the computational cost of downstream attention operations. Addition elegantly keeps the dimension constant.

But how does the model avoid confusing content with position when they are summed into the exact same values? Empirically, addition works well because the embedding space is extremely high-dimensional (typically d=1024d = 1024d=1024 or more). In high-dimensional spaces, random vectors are nearly orthogonal.

During training, the learned projections WQ,WK,WVW_Q, W_K, W_VWQ​,WK​,WV​ naturally separate the content and position signals into different, nearly orthogonal subspaces of the same ddd-dimensional space. The model learns to route the position information primarily into the attention score calculations (QQQ and KKK), while the value vectors (VVV) preserve the semantic content.

Limitations of absolute methods

  1. •Poor extrapolation: Performance degrades for positions beyond training length
  2. •Implicit relative position: The model must learn to derive relative positions from absolute ones, an indirect signal
  3. •Content-position mixing: Adding PE to embeddings mixes content and position in the same vector

Method 2: Rotary Position Embeddings (RoPE)

The dominant modern approach

RoPE (Rotary Position Embeddings) is used by Llama, Gemma, Qwen, Mistral, Phi, DeepSeek, which is virtually every modern open LLM.[2]

Core insight

Instead of adding position to embeddings, rotate the query and key vectors in 2D subspaces. The dot product between rotated vectors naturally encodes relative position.

RoPE rotation visualization: pairs of embedding dimensions are rotated by angles proportional to their position, encoding relative positions through rotation. RoPE rotation visualization: pairs of embedding dimensions are rotated by angles proportional to their position, encoding relative positions through rotation.

šŸ’” Key insight: Imagine two dancers (a query and a key) on a circular stage. RoPE assigns each dancer a rotation angle based on their position in the sequence. When they face each other to "interact" (dot product), only the difference in their angles matters, not their absolute positions. This is why RoPE naturally encodes relative position: a dancer at position 5 interacting with position 3 looks the same as position 105 interacting with position 103.

The math

For a 2D subspace (q2i,q2i+1)(q_{2i}, q_{2i+1})(q2i​,q2i+1​) at position mmm, apply rotation:

(q2i(m)q2i+1(m))=(cos⁔(mĪøi)āˆ’sin⁔(mĪøi)sin⁔(mĪøi)cos⁔(mĪøi))(q2iq2i+1)\begin{pmatrix} q_{2i}^{(m)} \\ q_{2i+1}^{(m)} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}(q2i(m)​q2i+1(m)​​)=(cos(mĪøi​)sin(mĪøi​)ā€‹āˆ’sin(mĪøi​)cos(mĪøi​)​)(q2i​q2i+1​​)

Reading the formula

Take each consecutive pair of numbers in the query vector and spin them like a clock hand. How far they spin depends on two things: the token's position mmm (further in the sequence = more rotation) and the frequency Īøi\theta_iĪøi​ (each pair rotates at a different speed). The result: every position creates a unique rotational pattern that the model can use to figure out "how far apart are these two words?"

The frequency base is Īøi=10000āˆ’2i/dmodel\theta_i = 10000^{-2i/d_{\text{model}}}Īøi​=10000āˆ’2i/dmodel​ (same as sinusoidal encoding).

The key property

When computing q(m)ā‹…k(n)q^{(m)} \cdot k^{(n)}q(m)ā‹…k(n), the rotation matrices combine:

q(m)Tk(n)=qTRĪø(māˆ’n)kq^{(m)T} k^{(n)} = q^T R_{\theta}^{(m-n)} kq(m)Tk(n)=qTRĪø(māˆ’n)​k

Why this matters

The attention score between a query at position mmm and a key at position nnn depends only on the gap (māˆ’n)(m - n)(māˆ’n) between them, not on where they sit in the sequence. Token 5 attending to token 3 produces the same positional effect as token 1005 attending to token 1003. This is what makes RoPE "relative": it captures distance, not address.

Diagram Diagram

Implementation

The code below demonstrates how RoPE applies these rotations in practice. It takes the query and key tensors as inputs, along with precomputed rotation frequencies based on the sequence length. The function reshapes the vectors into 2D pairs, applies the complex rotations, and returns the rotated query and key tensors.

python
1import torch 2 3def precompute_freqs_cis( 4 dim: int, 5 max_seq_len: int, 6 theta: float = 10000.0, 7) -> torch.Tensor: 8 """Precompute rotation frequencies (complex exponentials).""" 9 freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim)) 10 t = torch.arange(max_seq_len, dtype=torch.float) 11 freqs = torch.outer(t, freqs) # (seq_len, dim/2) 12 return torch.polar(torch.ones_like(freqs), freqs) # complex64 13 14def apply_rotary_emb( 15 xq: torch.Tensor, # (batch, seq_len, n_heads, d_k) 16 xk: torch.Tensor, # (batch, seq_len, n_heads, d_k) 17 freqs_cis: torch.Tensor, # (seq_len, d_k/2) 18) -> tuple[torch.Tensor, torch.Tensor]: 19 """Apply RoPE to queries and keys.""" 20 # View as complex numbers: pairs (x_2i, x_{2i+1}) → x_2i + jĀ·x_{2i+1} 21 xq_complex = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2)) 22 xk_complex = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2)) 23 24 # Multiply by rotation (complex multiplication = 2D rotation) 25 freqs_cis = freqs_cis.unsqueeze(0).unsqueeze(2) # broadcast 26 xq_out = torch.view_as_real(xq_complex * freqs_cis).flatten(-2) 27 xk_out = torch.view_as_real(xk_complex * freqs_cis).flatten(-2) 28 29 return xq_out.type_as(xq), xk_out.type_as(xk)

Why RoPE wins over absolute methods

PropertySinusoidalLearnedRoPE
Position infoAdded to embeddingsAdded to embeddingsApplied to Q, K only
Relative positionImplicit (model must learn)ImplicitExplicit in dot product
Content-position mixingYes (additive)Yes (additive)No (multiplicative)
Value vectorsModified by positionModified by positionUnmodified āœ…
Extra parameters0Lmax⁔×dL_{\max} \times dLmax​×d0

šŸ’” Key insight: RoPE only modifies Q and K. The value vectors remain position-free. This is elegant because V carries the actual content that gets aggregated; only the routing (which tokens to attend to) should be position-aware.

Long-range decay property

An important theoretical result: RoPE introduces a natural long-range decay in attention scores. The expected dot product between "similar" tokens decays as their distance increases, formalized as:[3]

Smāˆ’n=āˆ‘icos⁔((māˆ’n)Īøi)S_{m-n} = \sum_i \cos((m-n)\theta_i)Smāˆ’n​=āˆ‘i​cos((māˆ’n)Īøi​)

Where mmm and nnn are token positions, Īøi\theta_iĪøi​ is the rotation frequency for dimension pair iii, and Smāˆ’nS_{m-n}Smāˆ’n​ is the attention score as a function of relative distance.

As ∣māˆ’n∣|m - n|∣māˆ’n∣ grows, this sum decays, creating an inductive bias favoring local dependencies. This is typically desirable (nearby tokens are usually more relevant), but it means vanilla RoPE struggles at very long contexts. The base frequency, Īøbase\theta_{\text{base}}Īøbase​, controls this decay rate. Men et al. (2024) proved that it places a theoretical lower bound on the achievable context length: to support context length LLL, you need Īøbase≳10000ā‹…(L/4096)2\theta_{\text{base}} \gtrsim 10000 \cdot (L/4096)^2Īøbase​≳10000ā‹…(L/4096)2.[3] This means doubling your target context requires roughly quadrupling the base frequency.

Īøbase\theta_{\text{base}}Īøbase​Estimated safe contextUsed by
10,000~4KOriginal RoPE, Gemma
100,000~32KSome custom models
500,000~64-128KLlama 3.1

āš ļø Warning: Simply increasing the base frequency gives "superficial" long-context capability: perplexity looks good, but retrieval accuracy degrades. This is why RoPE scaling methods (PI, NTK-aware scaling, YaRN) are needed for genuine long-context support.


Method 3: Attention with Linear Biases (ALiBi)

The simplest approach[4]

ALiBi (Attention with Linear Biases) doesn't modify embeddings at all. Instead, it subtracts a linear bias directly from attention scores that increases with token distance:

šŸ’” Key insight: Imagine you're at a crowded party. People standing right next to you speak at full volume, but the farther away someone stands, the quieter they sound, linearly. ALiBi works exactly like this: it applies a simple distance penalty to attention scores, so nearby tokens "speak loudly" and distant tokens are naturally muffled. Different attention heads act like listeners with different hearing ranges: some focus on nearby whispers, others can hear across the room.

Attention(Q,K,V)=softmax(QKTdkāˆ’mā‹…āˆ£iāˆ’j∣)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} - m \cdot |i - j|\right) VAttention(Q,K,V)=softmax(dk​​QKTā€‹āˆ’mā‹…āˆ£iāˆ’j∣)V

Reading the formula

This is the normal attention formula with one modification: before applying softmax, we subtract a distance penalty mā‹…āˆ£iāˆ’j∣m \cdot |i - j|mā‹…āˆ£iāˆ’j∣. Since this term grows with distance, nearby tokens keep their original scores while distant tokens get pushed down. The slope mmm controls how aggressively: a steep slope makes the model short-sighted, a gentle slope lets it see further.

where mmm is a head-specific slope and ∣iāˆ’j∣|i-j|∣iāˆ’j∣ is the distance between tokens.

Head slopes

For nnn heads, slopes follow a geometric sequence that starts at 2āˆ’8/n2^{-8/n}2āˆ’8/n with ratio 2āˆ’8/n2^{-8/n}2āˆ’8/n, giving slopes mk=2āˆ’8k/nm_k = 2^{-8k/n}mk​=2āˆ’8k/n for k=1,…,nk = 1, \ldots, nk=1,…,n.

For 8 heads (n=8): m=12,14,18,116,132,164,1128,1256m = \frac{1}{2}, \frac{1}{4}, \frac{1}{8}, \frac{1}{16}, \frac{1}{32}, \frac{1}{64}, \frac{1}{128}, \frac{1}{256}m=21​,41​,81​,161​,321​,641​,1281​,2561​.

Diagram Diagram
HeadSlope mmmEffect
Head 11/21/21/2Strong local attention
Head 41/161/161/16Moderate range
Head 81/2561/2561/256Very long-range

Different heads get different slopes through a fixed geometric sequence, giving the model a spectrum of attention ranges, from very local (steep slope) to nearly global (gentle slope).

The following snippet shows how to construct the ALiBi bias tensor. It takes the number of attention heads and maximum sequence length as inputs, computing a geometric sequence of slopes and a distance matrix. The output is a bias matrix that's added directly to the attention scores before the softmax step.

python
1import torch 2 3def build_alibi_bias( 4 n_heads: int, 5 max_len: int, 6) -> torch.Tensor: 7 """Build ALiBi attention bias matrix.""" 8 # Geometric slopes: 2^(-8/n), 2^(-16/n), ... for each head 9 slopes = torch.pow(2, -8 * torch.arange(1, n_heads + 1) / n_heads) 10 11 # Distance matrix: -|i - j| 12 positions = torch.arange(max_len) 13 distance = -(positions.unsqueeze(0) - positions.unsqueeze(1)).abs() 14 15 # slope Ɨ distance → (n_heads, max_len, max_len) 16 alibi = slopes.unsqueeze(1).unsqueeze(1) * distance.unsqueeze(0) 17 18 return alibi # Add this to attention scores before softmax

ALiBi's superpower: zero-shot extrapolation

ALiBi was trained on 1K tokens but tested on 2K-11K tokens without quality degradation.[4] Why?

  • •Linear biases are monotonic: distant tokens always get penalized
  • •The penalty is unbounded: truly distant tokens get āˆ’āˆž-\inftyāˆ’āˆž scores
  • •No learned position parameters → nothing to extrapolate
  • •The fixed slopes create a natural "soft sliding window" without explicit masking

Softmax numerical stability with ALiBi

ALiBi interacts elegantly with standard softmax optimizations. ALiBi biases can produce very negative values for distant tokens. In practice, softmax is computed using the max-shift trick to prevent numerical overflow:

softmax(si+bi)=esi+biāˆ’max⁔(s+b)āˆ‘jesj+bjāˆ’max⁔(s+b)\text{softmax}(s_i + b_i) = \frac{e^{s_i + b_i - \max(s + b)}}{\sum_j e^{s_j + b_j - \max(s + b)}}softmax(si​+bi​)=āˆ‘j​esj​+bjā€‹āˆ’max(s+b)esi​+biā€‹āˆ’max(s+b)​

Where sis_isi​ is the raw attention score and bib_ibi​ is the ALiBi distance bias for token iii.

When computing this in hardware, the very large negative biases applied to distant tokens mean that their shifted exponent values (elargeĀ negativeĀ numbere^{\text{large negative number}}elargeĀ negativeĀ number) become effectively zero. This causes them to drop out of the attention computation entirely.

Because of this property, ALiBi provides a natural, hardware-friendly "soft sliding window" without requiring explicit boolean masking or truncation of the sequence. The model naturally learns to ignore distant context simply because the math forces those attention weights to zero.


The extrapolation challenge

šŸ’” Key insight: Imagine learning to swim in a 25-meter pool (your training context length). Sinusoidal encoding is like a swimmer who panics the instant they're in a 50-meter pool, with total failure. ALiBi is like a swimmer who learned proper stroke technique and can handle a 100-meter pool with no extra training. RoPE with scaling (YaRN) is like a swimmer who does a few warm-up laps in the bigger pool (fine-tuning) and then performs beautifully.

This is a critical practical challenge in production systems:

"Your model was trained on 4K tokens. How does it perform at 32K?"

MethodTrain lengthMax usable lengthExtrapolation
Sinusoidal4K~4KāŒ Catastrophic
Learned4K4K exactlyāŒ No extrapolation
RoPE (vanilla)4K~6Kāš ļø Degrades gradually
ALiBi4K~11Kāœ… Graceful
RoPE + NTK4K~32Kāœ… With scaling
RoPE + YaRN4K~128Kāœ… Best scaling

RoPE context extension techniques

When vanilla RoPE fails at long contexts, three scaling methods exist:

1. Position Interpolation (PI)

Scale position indices down to fit within training range[5]

LetĀ L=trainĀ length,Ā L′=targetĀ length\text{Let } L = \text{train length},\ L' = \text{target length}LetĀ L=trainĀ length,Ā L′=targetĀ length f′(x,m)=f(x,mā‹…L/L′)f'(x, m) = f(x, m \cdot L / L')f′(x,m)=f(x,mā‹…L/L′)

Reading the formula

If you trained on 4K tokens but want to handle 16K, you shrink all position indices by 4x so they fit within the range the model already knows. Position 16,000 gets mapped to effective position 4,000. Critically, the rotation frequencies theta themselves stay unchanged. Only the position index that multiplies theta gets rescaled. The model needs a few hundred steps of fine-tuning to adjust to the compressed spacing.

2. NTK (Neural Tangent Kernel)-aware Scaling

Scale the frequency base instead of positions, trading off between low and high frequency components[6]

Īøbase′=Īøbase⋅αdmodel/(dmodelāˆ’2)\theta'_{\text{base}} = \theta_{\text{base}} \cdot \alpha^{d_{\text{model}}/(d_{\text{model}}-2)}Īøbase′​=Īøbase​⋅αdmodel​/(dmodelā€‹āˆ’2)

where α=Ltarget/Ltrain\alpha = L_{\text{target}} / L_{\text{train}}α=Ltarget​/Ltrain​ is the scale factor and dmodeld_{\text{model}}dmodel​ is the full hidden dimension (e.g., 4096).

In plain terms

Instead of squishing the positions, we slow down the rotation frequencies so that longer contexts still produce reasonable rotation angles. The scaling factor alpha equals how much you want to extend. This works without fine-tuning for moderate extensions because we're changing the ruler rather than the positions.

3. YaRN (Yet another RoPE extensioN)

Combines NTK scaling with attention temperature adjustment[6]. Achieves the best quality for large extensions (4K to 128K) and requires fine-tuning.

The temperature scaling adjusts the attention softmax: 1/t=0.1ln⁔(s)+1\sqrt{1/t} = 0.1 \ln(s) + 11/t​=0.1ln(s)+1, where s=Ltarget/Ltrains = L_{\text{target}} / L_{\text{train}}s=Ltarget​/Ltrain​. This preserves attention patterns at longer contexts. NTK-by-parts interpolates high-frequency and low-frequency components differently, preventing fine details from getting blurred during extension.

Diagram Diagram

What modern architectures use

Model family / eraMethodTypical context behaviorNotes
GPT-2 / BERT eraLearned absolute positionsLimited to training windowStrong baseline historically, weak extrapolation
Modern decoder-only open models (Llama, Qwen, Mistral families)RoPEUsually extended with scaling methodsDominant choice in current decoder-only LLM design
Long-context decoder variantsRoPE + scaling (NTK, YaRN, interpolation)Better long-range extrapolationCommon when pushing models far beyond pretraining length
BLOOM / early MPT-style variantsALiBiGood extrapolation without retrainingSimple and effective, but less common now
Sliding-window architecturesRoPE + local/sliding attentionEfficient long-context servingPairs rotary positions with sparse attention patterns

The current consensus

RoPE dominates across most modern decoder-only LLMs, especially in open-weight model families where the architecture is published and inspectable. Different variants of RoPE handle context extension: scaling the base frequency, Position Interpolation, NTK-aware methods, or YaRN. ALiBi remains important historically and still appears in some long-context variants, but it is no longer the default choice for new large decoder models. Learned absolute positions, once common in GPT-2 and BERT, are now mostly associated with earlier transformer generations.


Key takeaways

Core mental models

  1. •Permutation Invariance: Transformers process sets, not sequences. Without position encodings, "AB" equals "BA".
  2. •Sinusoidal Derivation: Uses multi-scale frequencies to create unique fingerprints. PEpos+kPE_{pos+k}PEpos+k​ is a linear rotation of PEposPE_{pos}PEpos​.
  3. •RoPE: Rotates Q/K in 2D subspaces. Relative position is explicitly encoded in the dot product q(m)Tk(n)q^{(m)T} k^{(n)}q(m)Tk(n). Only modifies Q/K, leaving V position-free.
  4. •ALiBi: Adds a static linear bias to attention scores based on distance. Extrapolates best out-of-the-box but less flexible than scaled RoPE.
  5. •Context Extension: Base frequency Īøbase\theta_{\text{base}}Īøbase​ bounds context length. Extrapolation requires scaling methods like NTK or YaRN. See our long-context management deep-dive for practical techniques.

Common misconceptions

  • •"RoPE modifies embeddings": Incorrect. It only rotates Q and K vectors before the dot product. The value vectors VVV (which carry the semantic content) are untouched.
  • •"All extrapolation is equal": Sinusoidal and Learned encodings fail catastrophically beyond their training window. ALiBi works natively. RoPE requires scaling.
  • •"Just increase theta": Increasing Īøbase\theta_{\text{base}}Īøbase​ helps length but hurts retrieval accuracy unless combined with NTK/YaRN scaling.

Going deeper

Why not concatenate positional encodings?

Concatenation increases the embedding dimension from ddd to 2d2d2d, doubling the size of all projection matrices (WQW_QWQ​, WKW_KWK​, WVW_VWV​) and doubling their parameter counts. Addition keeps the dimension constant, and the model learns to separate content and position into different subspaces within the same ddd-dimensional space.

RoPE relative derivation

At position mmm, q(m)=R(mĪø)ā‹…qq^{(m)} = R(m\theta) \cdot qq(m)=R(mĪø)ā‹…q. At position nnn, k(n)=R(nĪø)ā‹…kk^{(n)} = R(n\theta) \cdot kk(n)=R(nĪø)ā‹…k. q(m)Tk(n)=qTR(mĪø)TR(nĪø)k=qTR((māˆ’n)Īø)kq^{(m)T} k^{(n)} = q^T R(m\theta)^T R(n\theta) k = q^T R((m-n)\theta) kq(m)Tk(n)=qTR(mĪø)TR(nĪø)k=qTR((māˆ’n)Īø)k

After rotation, the dot product depends only on the relative offset (māˆ’n)(m-n)(māˆ’n), not on absolute positions. This is why RoPE naturally encodes relative position information.

Why does Llama use Īøbase=500,000\theta_{\text{base}} = 500{,}000Īøbase​=500,000?

Higher base frequencies slow down the cosine decay of Smāˆ’n=āˆ‘icos⁔((māˆ’n)Īøi)S_{m-n} = \sum_i \cos((m-n)\theta_i)Smāˆ’n​=āˆ‘i​cos((māˆ’n)Īøi​), allowing the model to maintain meaningful attention scores at larger distances. Men et al. (2024) proved that the base frequency bounds the maximum effective context length: Īøbase=10,000\theta_{\text{base}} = 10{,}000Īøbase​=10,000 supports ~4K, while Īøbase=500,000\theta_{\text{base}} = 500{,}000Īøbase​=500,000 supports ~128K.[3]

Evaluation Rubric
  • 1
    Explains permutation equivariance as the core motivation for positional encoding
  • 2
    Derives sinusoidal PE formula and explains multi-frequency intuition
  • 3
    Shows RoPE rotates Q/K in 2D subspaces such that dot product depends only on relative distance
  • 4
    Explains that RoPE only modifies Q and K, leaving V position-free
  • 5
    Describes ALiBi as fixed attention bias linear in distance with geometric head slopes
  • 6
    Compares extrapolation: sinusoidal/learned fail, ALiBi extrapolates, RoPE degrades moderately
  • 7
    Understands RoPE context extension methods (PI, NTK, YaRN) and their tradeoffs
  • 8
    Mentions base frequency and its role in bounding effective context length
Common Pitfalls
  • Failing to understand that transformers are fundamentally order-blind without positional encoding
  • Confusing absolute (sinusoidal, learned) vs relative (RoPE, ALiBi) positional encoding methods
  • Believing RoPE modifies token embeddings, when it only rotates Q and K vectors
  • Treating all context extrapolation methods as equivalent despite dramatic architectural differences
  • Assuming increasing the base frequency provides genuine long-context reasoning without a retrieval quality drop
  • Misunderstanding the mathematical and dimensional implications of adding vs. concatenating absolute encodings
Follow-up Questions to Expect

Key Concepts Tested
Permutation InvarianceSinusoidal Positional EncodingLearned EmbeddingsRotary Position Embeddings (RoPE)ALiBiLong-Context ExtrapolationPosition InterpolationNTK-Aware ScalingYaRNAttention Bias
References

Attention Is All You Need.

Vaswani, A., et al. Ā· 2017

RoFormer: Enhanced Transformer with Rotary Position Embedding.

Su, J., et al. Ā· 2021

Base of RoPE Bounds Context Length.

Men, X., et al. Ā· 2024 Ā· NeurIPS 2024

Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.

Press, O., Smith, N. A., & Lewis, M. Ā· 2022 Ā· ICLR 2022

Extending Context Window of Large Language Models via Positional Interpolation.

Chen, S., et al. Ā· 2023

YaRN: Efficient Context Window Extension of Large Language Models.

Peng, B., et al. Ā· 2023

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail

Your account is free and you can post anonymously if you choose.