LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

🧪AI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & model evaluation
⚡Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts (MoE)Mamba & State Space ModelsReasoning & Test-Time Compute
🔍Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
🤖Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
📊Evaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
🛠️LLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering and Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge DistillationModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
🏗️System Design Case Studies0/10
Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnAI Engineering FoundationsPositional Encoding: RoPE & ALiBi
🧠HardTransformer Architecture

Positional Encoding: RoPE & ALiBi

Understand why transformers need position info, derive sinusoidal encodings, explore how RoPE encodes relative position through rotation, compare ALiBi's linear bias approach, and analyze long-context extrapolation methods.

45 min readGoogle, Meta, OpenAI +310 key concepts

Consider these two sentences: "The dog bit the man" and "The man bit the dog." Same words, completely different meanings, because word order matters. Humans take this for granted, but transformers have a surprising blind spot: the core attention mechanism treats both sentences identically. To a transformer without position information, "The dog bit the man" and "man the bit dog The" look the same.

This is called permutation invariance, and it means we need a way to tell the model where each word appears in a sequence. That's exactly what positional encoding does, and the choice of method directly determines how well a model handles long documents, conversations, and code. (If you haven't read about the attention mechanism yet, start there. It's the foundation this article builds on.)

💡 Key insight: Positional encoding injects word-order information into an architecture that would otherwise be "order-blind." In this article, you'll learn why position info is needed, how the original approach works, and why modern models use a clever rotation trick called RoPE.


The permutation invariance problem

Self-attention computes:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) VAttention(Q,K,V)=softmax(dk​​QKT​)V

Reading the formula: compare every query with all keys (QKTQK^TQKT), scale by dk\sqrt{d_k}dk​​ (key vector dimension) for stable softmax, convert scores to attention weights, then mix value vectors VVV using those weights.

This operation is set-based: if you shuffle the input token order, the output is simply shuffled in the same way. The attention weights QKTQK^TQKT depend on content, not position.[1]

Without positional information, these two sentences produce identical attention patterns:

text
1"The cat sat on the mat" 2"mat the on sat cat The"

But meaning depends entirely on word order! We need a mechanism to inject position information.

Diagram Diagram

Three Families of Solutions

ApproachHow It WorksModifiesExamples
Absolute (additive)Add position vector to token embeddingEmbeddingsSinusoidal (Vaswani), Learned (GPT-2, BERT)
Relative (multiplicative)Rotate Q/K vectors based on positionAttention scoresRoPE (Llama, Gemma, Qwen)
Relative (additive bias)Add distance bias to attention logitsAttention scoresALiBi (BLOOM, MPT)
Positional encoding methods compared: sinusoidal (fixed), learned (trained), RoPE (rotary), and ALiBi (linear bias), with their extrapolation properties. Positional encoding methods compared: sinusoidal (fixed), learned (trained), RoPE (rotary), and ALiBi (linear bias), with their extrapolation properties.

Method 1: sinusoidal positional encoding

The original approach[1]

For position pos\text{pos}pos and dimension iii (where dmodeld_{\text{model}}dmodel​ is the embedding dimension):

PE(pos,2i)=sin⁡(pos100002i/dmodel)PE_{(\text{pos}, 2i)} = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)PE(pos,2i)​=sin(100002i/dmodel​pos​)

PE(pos,2i+1)=cos⁡(pos100002i/dmodel)PE_{(\text{pos}, 2i+1)} = \cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)PE(pos,2i+1)​=cos(100002i/dmodel​pos​)

In plain terms: each position gets a unique "fingerprint" made of sine and cosine waves at different frequencies. Low dimensions oscillate quickly (like seconds on a clock), while high dimensions oscillate slowly (like hours). Together they create a unique pattern for every position, much like how a clock's hour, minute, and second hands uniquely identify each moment.

Why sinusoids? Because PEpos+kPE_{\text{pos}+k}PEpos+k​ can be expressed as a linear function of PEposPE_{\text{pos}}PEpos​:

PEpos+k=Rk⋅PEposPE_{\text{pos}+k} = R_k \cdot PE_{\text{pos}}PEpos+k​=Rk​⋅PEpos​

where RkR_kRk​ is a rotation matrix that depends only on offset kkk, not absolute position. This means the model can, in principle, learn to attend based on relative position.[1]

Here is how we can generate these multi-frequency sinusoidal positional encodings. The function takes the maximum sequence length and embedding dimension as inputs, and returns a tensor of positional encodings that can be added directly to the token embeddings.

python
1import torch 2import math 3 4def sinusoidal_positional_encoding( 5 max_len: int, 6 d_model: int, 7) -> torch.Tensor: 8 """Generate sinusoidal positional encodings.""" 9 pe = torch.zeros(max_len, d_model) 10 position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) 11 12 # Compute the frequency for each dimension pair 13 # Low dimensions → high frequency (local position) 14 # High dimensions → low frequency (coarse position) 15 div_term = torch.exp( 16 torch.arange(0, d_model, 2).float() 17 * (-math.log(10000.0) / d_model) 18 ) 19 20 pe[:, 0::2] = torch.sin(position * div_term) # Even dims 21 pe[:, 1::2] = torch.cos(position * div_term) # Odd dims 22 23 return pe # (max_len, d_model) 24 25# Usage: simply ADD to token embeddings 26# x = token_embeddings + pe[:seq_len]

Frequency interpretation

Dimension pairWavelengthWhat it captures
(0,1)(0, 1)(0,1)2π≈6.32\pi \approx 6.32π≈6.3 tokensVery fine-grained position
(d/4,d/4+1)(d/4, d/4+1)(d/4,d/4+1)∼628\sim 628∼628 tokensMedium-scale structure
(d−2,d−1)(d-2, d-1)(d−2,d−1)10000⋅2π≈62,83210000 \cdot 2\pi \approx 62{,}83210000⋅2π≈62,832 tokensVery coarse position

The multi-scale frequencies mean the model can attend to both local and distant positions through different embedding dimensions, similar to how Fourier series decompose signals at multiple scales.

Learned vs sinusoidal absolute encodings

Instead of fixed sinusoids, models like BERT and GPT-2 use learned position embeddings: a trainable matrix P∈RLmax⁡×dP \in \mathbb{R}^{L_{\max} \times d}P∈RLmax​×d where each position gets its own optimizable vector.

PropertySinusoidalLearned
Parameters0 (fixed formula)Lmax⁡×dL_{\max} \times dLmax​×d added params
FlexibilityFixed inductive biasAdapts to task
ExtrapolationBeyond Lmax⁡L_{\max}Lmax​ (in theory)❌ Hard limit at Lmax⁡L_{\max}Lmax​
PerformanceStrong baselineComparable; no significant win

💡 Key insight: Vaswani et al. (2017) tested both and found "nearly identical results."[1] Both have been superseded by relative position methods for modern LLMs.

Why add instead of concatenate?

💡 Design Choice: "Why do transformers add positional encodings to embeddings instead of concatenating them?"

A common intuition is that concatenating the positional vector to the token embedding (creating a 2d2d2d vector) would perfectly preserve both pieces of information. However, concatenation would increase the embedding dimension from ddd to 2d2d2d, mathematically doubling the parameters of all linear layers and quadrupling the computational cost of downstream attention operations. Addition elegantly keeps the dimension constant.

But how does the model avoid confusing content with position when they are summed into the exact same values? Empirically, addition works well because the embedding space is extremely high-dimensional (typically d=1024d = 1024d=1024 or more). In high-dimensional spaces, random vectors are nearly orthogonal.

During training, the learned projections WQ,WK,WVW_Q, W_K, W_VWQ​,WK​,WV​ naturally separate the content and position signals into different, nearly orthogonal subspaces of the same ddd-dimensional space. The model learns to route the position information primarily into the attention score calculations (QQQ and KKK), while preserving the semantic content in the value vectors (VVV).

Limitations of absolute methods

  1. •Poor extrapolation: Performance degrades for positions beyond training length
  2. •Implicit relative position: The model must learn to derive relative positions from absolute ones, an indirect signal
  3. •Content-position mixing: Adding PE to embeddings mixes content and position in the same vector

Method 2: rotary position embeddings (RoPE)

The dominant modern approach

RoPE[2] is used by Llama, Gemma, Qwen, Mistral, Phi, DeepSeek, which is virtually every modern open LLM.

Core insight: Instead of adding position to embeddings, rotate the query and key vectors in 2D subspaces. The dot product between rotated vectors naturally encodes relative position.

RoPE rotation visualization: pairs of embedding dimensions are rotated by angles proportional to their position, encoding relative positions through rotation. RoPE rotation visualization: pairs of embedding dimensions are rotated by angles proportional to their position, encoding relative positions through rotation.

The math

For a 2D subspace (q2i,q2i+1)(q_{2i}, q_{2i+1})(q2i​,q2i+1​) at position mmm, apply rotation:

(q2i(m)q2i+1(m))=(cos⁡(mθi)−sin⁡(mθi)sin⁡(mθi)cos⁡(mθi))(q2iq2i+1)\begin{pmatrix} q_{2i}^{(m)} \\ q_{2i+1}^{(m)} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}(q2i(m)​q2i+1(m)​​)=(cos(mθi​)sin(mθi​)​−sin(mθi​)cos(mθi​)​)(q2i​q2i+1​​)

Reading the formula: take each consecutive pair of numbers in the query vector and spin them like a clock hand. How far they spin depends on two things: the token's position mmm (further in the sequence = more rotation) and the frequency θi\theta_iθi​ (each pair rotates at a different speed). The result: every position creates a unique rotational pattern that the model can use to figure out "how far apart are these two words?"

The frequency base is θi=10000−2i/d\theta_i = 10000^{-2i/d}θi​=10000−2i/d (same as sinusoidal encoding).

The key property: When computing q(m)⋅k(n)q^{(m)} \cdot k^{(n)}q(m)⋅k(n), the rotation matrices combine:

q(m)Tk(n)=qTRθ(m−n)kq^{(m)T} k^{(n)} = q^T R_{\theta}^{(m-n)} kq(m)Tk(n)=qTRθ(m−n)​k

Why this matters: the attention score between a query at position mmm and a key at position nnn depends only on the gap (m−n)(m - n)(m−n) between them, not on where they sit in the sequence. Token 5 attending to token 3 produces the same positional effect as token 1005 attending to token 1003. This is what makes RoPE "relative": it captures distance, not address.

Diagram Diagram

Implementation

The code below demonstrates how RoPE applies these rotations in practice. It takes the query and key tensors as inputs, along with precomputed rotation frequencies based on the sequence length. The function reshapes the vectors into 2D pairs, applies the complex rotations, and returns the rotated query and key tensors.

python
1import torch 2 3def precompute_freqs_cis( 4 dim: int, 5 max_seq_len: int, 6 theta: float = 10000.0, 7) -> torch.Tensor: 8 """Precompute rotation frequencies (complex exponentials).""" 9 freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim)) 10 t = torch.arange(max_seq_len, dtype=torch.float) 11 freqs = torch.outer(t, freqs) # (seq_len, dim/2) 12 return torch.polar(torch.ones_like(freqs), freqs) # complex64 13 14def apply_rotary_emb( 15 xq: torch.Tensor, # (batch, seq_len, n_heads, d_k) 16 xk: torch.Tensor, # (batch, seq_len, n_heads, d_k) 17 freqs_cis: torch.Tensor, # (seq_len, d_k/2) 18) -> tuple[torch.Tensor, torch.Tensor]: 19 """Apply RoPE to queries and keys.""" 20 # View as complex numbers: pairs (x_2i, x_{2i+1}) → x_2i + j·x_{2i+1} 21 xq_complex = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2)) 22 xk_complex = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2)) 23 24 # Multiply by rotation (complex multiplication = 2D rotation) 25 freqs_cis = freqs_cis.unsqueeze(0).unsqueeze(2) # broadcast 26 xq_out = torch.view_as_real(xq_complex * freqs_cis).flatten(-2) 27 xk_out = torch.view_as_real(xk_complex * freqs_cis).flatten(-2) 28 29 return xq_out.type_as(xq), xk_out.type_as(xk)

💡 Key insight: Imagine two dancers (a query and a key) on a circular stage. RoPE assigns each dancer a rotation angle based on their position in the sequence. When they face each other to "interact" (dot product), only the difference in their angles matters, not their absolute positions. This is why RoPE naturally encodes relative position: dancer at position 5 interacting with position 3 looks the same as position 105 interacting with position 103.

Why RoPE wins over absolute methods

PropertySinusoidalLearnedRoPE
Position infoAdded to embeddingsAdded to embeddingsApplied to Q, K only
Relative positionImplicit (model must learn)ImplicitExplicit in dot product
Content-position mixingYes (additive)Yes (additive)No (multiplicative)
Value vectorsModified by positionModified by positionUnmodified ✅
Extra parameters0Lmax⁡×dL_{\max} \times dLmax​×d0

💡 Key insight: RoPE only modifies Q and K. The value vectors remain position-free. This is elegant because V carries the actual content that gets aggregated; only the routing (which tokens to attend to) should be position-aware.

Long-range decay property

An important theoretical result: RoPE introduces a natural long-range decay in attention scores. The expected dot product between "similar" tokens decays as their distance increases, formalized as:[3]

Bm,θ=∑icos⁡((m−n)θi)B_{m,\theta} = \sum_i \cos((m-n)\theta_i)Bm,θ​=∑i​cos((m−n)θi​)

Where mmm and nnn are token positions, and θi\theta_iθi​ is the rotation frequency for dimension pair iii.

As ∣m−n∣|m - n|∣m−n∣ grows, this sum decays, creating an inductive bias favoring local dependencies. This is typically desirable (nearby tokens are usually more relevant), but it means vanilla RoPE struggles at very long contexts.

The base frequency θbase\theta_{\text{base}}θbase​ controls this decay rate. Men et al. (2024) proved that the base frequency places a lower bound on the achievable context length:[3]

θbase\theta_{\text{base}}θbase​Safe context lengthUsed by
10,000~4KOriginal RoPE, Gemma 2
500,000~128KLlama 3.1
1,000,000~200K+Llama 4, some custom models

⚠️ Warning: Simply increasing the base frequency gives "superficial" long-context capability: perplexity looks good, but retrieval accuracy degrades. This is why RoPE scaling methods (PI, NTK-aware scaling, YaRN) are needed for genuine long-context support.


Method 3: ALiBi (attention with linear biases)

The simplest approach[4]

ALiBi doesn't modify embeddings at all. Instead, it adds a linear bias directly to attention scores:

💡 Key insight: Imagine you're at a crowded party. People standing right next to you speak at full volume, but the farther away someone stands, the quieter they sound, linearly. ALiBi works exactly like this: it applies a simple distance penalty to attention scores, so nearby tokens "speak loudly" and distant tokens are naturally muffled. Different attention heads act like listeners with different hearing ranges: some focus on nearby whispers, others can hear across the room.

Attention(Q,K,V)=softmax(QKTdk+m⋅dist(i,j))V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + m \cdot \text{dist}(i,j)\right) VAttention(Q,K,V)=softmax(dk​​QKT​+m⋅dist(i,j))V

Reading the formula: this is the normal attention formula with one addition: before applying softmax, we add a distance penalty m⋅dist(i,j)m \cdot \text{dist}(i,j)m⋅dist(i,j). Since dist\text{dist}dist is negative (closer tokens have less penalty), nearby tokens keep their original scores while distant tokens get pushed down. The slope mmm controls how aggressively: a steep slope makes the model short-sighted, a gentle slope lets it see further.

where:

  • •dist(i,j)=−(∣i−j∣)\text{dist}(i, j) = -(|i - j|)dist(i,j)=−(∣i−j∣) is the negative distance between positions
  • •mmm is a head-specific slope (not learned, but a fixed geometric sequence)

Head slopes: For hhh heads, mk=128k/hm_k = \frac{1}{2^{8k/h}}mk​=28k/h1​ for k=1,…,hk = 1, \ldots, hk=1,…,h

Diagram Diagram
HeadSlope mmmEffect
Head 11/21/21/2Strong local attention
Head 41/161/161/16Moderate range
Head 81/2561/2561/256Very long-range

Different heads get different slopes through a fixed geometric sequence, giving the model a spectrum of attention ranges, from very local (steep slope) to nearly global (gentle slope).

The following snippet shows how to construct the ALiBi bias tensor. It takes the number of attention heads and maximum sequence length as inputs, computing a geometric sequence of slopes and a distance matrix. The output is a bias matrix that is added directly to the attention scores before the softmax step.

python
1import torch 2 3def build_alibi_bias( 4 n_heads: int, 5 max_len: int, 6) -> torch.Tensor: 7 """Build ALiBi attention bias matrix.""" 8 # Geometric slopes: 1/2^(8/n_heads), 1/2^(16/n_heads), ... 9 slopes = torch.pow(2, -8 * torch.arange(1, n_heads + 1) / n_heads) 10 11 # Distance matrix: -|i - j| 12 positions = torch.arange(max_len) 13 distance = -(positions.unsqueeze(0) - positions.unsqueeze(1)).abs() 14 15 # slope × distance → (n_heads, max_len, max_len) 16 alibi = slopes.unsqueeze(1).unsqueeze(1) * distance.unsqueeze(0) 17 18 return alibi # Add this to attention scores before softmax

ALiBi's superpower: zero-shot extrapolation

ALiBi was trained on 1K tokens but tested on 2K–11K tokens without quality degradation.[4] Why?

  • •Linear biases are monotonic: distant tokens always get penalized
  • •The penalty is unbounded: truly distant tokens get −∞-\infty−∞ scores
  • •No learned position parameters → nothing to extrapolate
  • •The fixed slopes create a natural "soft sliding window" without explicit masking

Softmax numerical stability with ALiBi

An implementation detail worth noting is how ALiBi interacts with standard softmax optimizations. ALiBi biases can produce very negative values for distant tokens. In practice, softmax is computed using the max-shift trick to prevent numerical overflow:

softmax(si+bi)=esi+bi−max⁡(s+b)∑jesj+bj−max⁡(s+b)\text{softmax}(s_i + b_i) = \frac{e^{s_i + b_i - \max(s + b)}}{\sum_j e^{s_j + b_j - \max(s + b)}}softmax(si​+bi​)=∑j​esj​+bj​−max(s+b)esi​+bi​−max(s+b)​

Where sis_isi​ is the raw attention score and bib_ibi​ is the ALiBi distance bias for token iii.

When computing this in hardware, the very large negative biases applied to distant tokens mean that their shifted exponent values (elarge negative numbere^{\text{large negative number}}elarge negative number) become effectively zero. This causes them to drop out of the attention computation entirely.

Because of this property, ALiBi provides a natural, hardware-friendly "soft sliding window" without requiring explicit boolean masking or truncation of the sequence. The model naturally learns to ignore distant context simply because the math forces those attention weights to zero.


The extrapolation challenge

💡 Key insight: Imagine learning to swim in a 25-meter pool (your training context length). Sinusoidal encoding is like a swimmer who panics the instant they're in a 50-meter pool, with total failure. ALiBi is like a swimmer who learned proper stroke technique and can handle a 100-meter pool with no extra training. RoPE with scaling (YaRN) is like a swimmer who does a few warm-up laps in the bigger pool (fine-tuning) and then performs beautifully.

This is a critical practical challenge in production systems:

"Your model was trained on 4K tokens. How does it perform at 32K?"

MethodTrain lengthMax usable lengthExtrapolation
Sinusoidal4K~4K❌ Catastrophic
Learned4K4K exactly❌ No extrapolation
RoPE (vanilla)4K~6K⚠️ Degrades gradually
ALiBi4K~11K✅ Graceful
RoPE + NTK4K~32K✅ With scaling
RoPE + YaRN4K~128K✅ Best scaling

RoPE context extension techniques

When vanilla RoPE fails at long contexts, three scaling methods exist:

1. Position Interpolation (PI): Scale positions down to fit within training range[5]

θi′=θi⋅LtrainLtarget\theta'_i = \theta_i \cdot \frac{L_{\text{train}}}{L_{\text{target}}}θi′​=θi​⋅Ltarget​Ltrain​​

Reading the formula: if you trained on 4K tokens but want to handle 16K, you shrink all positions by 4× so they fit within the range the model already knows. Position 16,000 gets mapped to effective position 4,000. The model sees familiar territory, but it needs a few hundred steps of fine-tuning to adjust to the compressed spacing.

2. NTK (Neural Tangent Kernel)-aware Scaling: Scale the frequency base instead of positions[6]

θbase′=θbase⋅αd/(d−2)\theta'_{\text{base}} = \theta_{\text{base}} \cdot \alpha^{d/(d-2)}θbase′​=θbase​⋅αd/(d−2)

In plain terms: instead of squishing the positions, we slow down the rotation frequencies so that longer contexts still produce "reasonable" rotation angles. The scaling factor α\alphaα equals how much you want to extend (Ltarget/LtrainL_{\text{target}} / L_{\text{train}}Ltarget​/Ltrain​). This works without fine-tuning for moderate extensions because we're changing the ruler rather than the positions.

3. YaRN (Yet another RoPE extensioN): Combines NTK scaling with attention temperature adjustment[7]

Attention′=1t⋅Attention\text{Attention}' = \frac{1}{\sqrt{t}} \cdot \text{Attention}Attention′=t​1​⋅Attention

where ttt is a temperature factor. Achieves the best extrapolation results and is used to extend Llama to 128K context.

Diagram Diagram

What modern LLMs actually use

ModelMethodContext lengthNotes
GPT-4oLearned absolute128KMassive training budget makes this viable
Llama 3.1/4RoPE (scaled)128K+θbase=500,000\theta_{\text{base}} = 500{,}000θbase​=500,000
Gemma 2RoPE8KStandard θbase=10,000\theta_{\text{base}} = 10{,}000θbase​=10,000
MistralRoPE + Sliding Window32KCombines RoPE with local attention
BLOOMALiBi2KDesigned for multilingual, shorter contexts
MPTALiBi65KExtended via ALiBi extrapolation
Qwen 2.5RoPE (YaRN)128KYaRN scaling for long context
DeepSeek V3RoPE (YaRN)128KExtended from 4K training

The industry consensus (2025–2026): RoPE with frequency base scaling or YaRN is dominant. ALiBi is elegant but less used in the latest models. Learned positions only work if you can afford massive compute (GPT-4 scale).


Key takeaways

Core mental models

  1. •Permutation Invariance: Transformers process sets, not sequences. Without position encodings, "AB" equals "BA".
  2. •Sinusoidal Derivation: Uses multi-scale frequencies to create unique fingerprints. PEpos+kPE_{pos+k}PEpos+k​ is a linear rotation of PEposPE_{pos}PEpos​.
  3. •RoPE: Rotates Q/K in 2D subspaces. Relative position is explicitly encoded in the dot product q(m)Tk(n)q^{(m)T} k^{(n)}q(m)Tk(n). Only modifies Q/K, leaving V position-free.
  4. •ALiBi: Adds a static linear bias to attention scores based on distance. Extrapolates best out-of-the-box but less flexible than scaled RoPE.
  5. •Context Extension: Base frequency θbase\theta_{\text{base}}θbase​ bounds context length. Extrapolation requires scaling methods like NTK or YaRN. See our long-context management deep-dive for practical techniques.

Common misconceptions

  • •"RoPE modifies embeddings": Incorrect. It only rotates Q and K vectors before the dot product. The value vectors VVV (which carry the semantic content) are untouched.
  • •"All extrapolation is equal": Sinusoidal and Learned encodings fail catastrophically beyond their training window. ALiBi works natively. RoPE requires scaling.
  • •"Just increase theta": Increasing θbase\theta_{\text{base}}θbase​ helps length but hurts retrieval accuracy unless combined with NTK/YaRN scaling.

Going deeper

"Why not concatenate positional encodings?" Concatenation increases the dimension from ddd to 2d2d2d, quadrupling the attention matrix size and doubling projection costs. Addition works because high-dimensional spaces allow the model to learn orthogonal subspaces for content vs. position.

"RoPE Relative Derivation" At position mmm, q(m)=R(mθ)⋅qq^{(m)} = R(m\theta) \cdot qq(m)=R(mθ)⋅q. At position nnn, k(n)=R(nθ)⋅kk^{(n)} = R(n\theta) \cdot kk(n)=R(nθ)⋅k. q(m)Tk(n)=qTR(mθ)TR(nθ)k=qTR((n−m)θ)kq^{(m)T} k^{(n)} = q^T R(m\theta)^T R(n\theta) k = q^T R((n-m)\theta) kq(m)Tk(n)=qTR(mθ)TR(nθ)k=qTR((n−m)θ)k Reading the formula: after rotation, the dot product depends only on the relative offset (n−m)(n-m)(n−m), not on absolute positions. This is why RoPE naturally encodes relative position information. The result depends only on (n−m)(n-m)(n−m).

"Why does Llama use θbase=500,000\theta_{\text{base}} = 500{,}000θbase​=500,000?" Higher base frequencies slow down the cosine decay of Bm,θ=∑icos⁡((m−n)θi)B_{m,\theta} = \sum_i \cos((m-n)\theta_i)Bm,θ​=∑i​cos((m−n)θi​), allowing the model to maintain meaningful attention scores at larger distances. Men et al. (2024) proved that the base frequency bounds the maximum effective context length: θbase=10,000\theta_{\text{base}} = 10{,}000θbase​=10,000 supports ~4K, while θbase=500,000\theta_{\text{base}} = 500{,}000θbase​=500,000 supports ~128K.[3]

Evaluation Rubric
  • 1
    Explains permutation invariance as the core motivation for positional encoding
  • 2
    Derives sinusoidal PE formula and explains multi-frequency intuition
  • 3
    Shows RoPE rotates Q/K in 2D subspaces such that dot product depends only on relative distance
  • 4
    Explains that RoPE only modifies Q and K, leaving V position-free
  • 5
    Describes ALiBi as fixed attention bias linear in distance with geometric head slopes
  • 6
    Compares extrapolation: sinusoidal/learned fail, ALiBi extrapolates, RoPE degrades moderately
  • 7
    Knows RoPE context extension methods (PI, NTK, YaRN) and their tradeoffs
  • 8
    Mentions base frequency and its role in bounding effective context length
Common Pitfalls
  • Not explaining WHY transformers need position info (permutation invariance)
  • Confusing absolute (sinusoidal, learned) vs relative (RoPE, ALiBi) methods
  • Saying RoPE modifies embeddings: it only modifies Q and K, never V
  • Treating all extrapolation as equivalent: methods differ dramatically
  • Not knowing that increasing θ_base gives superficial long-context ability
  • Forgetting the add-vs-concatenate tradeoff for absolute encodings
Follow-up Questions to Expect

Key Concepts Tested
Permutation InvarianceSinusoidal Positional EncodingLearned EmbeddingsRotary Position Embeddings (RoPE)ALiBiLong-Context ExtrapolationPosition InterpolationNTK-Aware ScalingYaRNAttention Bias
References

Attention Is All You Need.

Vaswani, A., et al. · 2017

RoFormer: Enhanced Transformer with Rotary Position Embedding.

Su, J., et al. · 2021

Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.

Press, O., Smith, N. A., & Lewis, M. · 2022 · ICLR 2022

Extending Context Window of Large Language Models via Positional Interpolation.

Chen, S., et al. · 2023

YaRN: Efficient Context Window Extension of Large Language Models.

Peng, B., et al. · 2023

Base of RoPE Bounds Context Length.

Men, X., et al. · 2024 · NeurIPS 2024

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.

bloc97 · 2023

Your account is free and you can post anonymously if you choose.