LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnTransformer Deep DivesPositional Encoding: RoPE & ALiBi
🧠HardTransformer Architecture

Positional Encoding: RoPE & ALiBi

Understand why transformers need position information, how sinusoidal encodings work, how RoPE and ALiBi encode relative position, and why long-context extrapolation needs careful evaluation.

33 min read
Learning path
Step 89 of 155 in the full curriculum
Vision Transformers and Image EncodersLayer Normalization: Pre-LN vs Post-LN

Positional Encoding: RoPE & ALiBi

Vision Transformers showed why image patches need position. Text has the same problem: "The package left the warehouse" and "The warehouse left the package" use nearly the same words, but the order changes the meaning.

Self-attention has a surprising blind spot: by itself, it has no built-in notion of "first", "second", or "third" token. If you permute the input sequence, the outputs permute in the same way. This is called permutation equivariance, and it means the model needs a separate signal for where each token sits.

Positional encoding injects word-order information into an architecture that would otherwise be order-blind. In this article, you'll learn why position information is needed, how the original sinusoidal approach works, and how relative methods like RoPE and ALiBi change long-context behavior.


The permutation equivariance problem

Self-attention computes:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) VAttention(Q,K,V)=softmax(dk​​QKT​)V

What this equation says

Compare every query with all keys (QKTQK^TQKT), scale by dk\sqrt{d_k}dk​​ (key vector dimension) for stable softmax, convert scores to attention weights, then mix value vectors VVV using those weights.

This unmasked operation is permutation equivariant: if you shuffle the input token order, the output is shuffled in the same way. The attention weights QKTQK^TQKT depend on content, not position.[1]

Without positional information, transformers can't distinguish between different word orders, which can completely alter the meaning of a sentence. For example, these two sequences contain identical tokens but have very different semantics. When processed by a position-blind transformer, they produce the same attention pattern up to the same row/column permutation:

text
1"The package left the warehouse" 2"warehouse the left package The"

Meaning depends on word order, so the model needs a mechanism to inject position information. A causal mask also adds directional structure in decoder-only models, but it doesn't replace positional encoding: the mask tells a token which earlier slots are visible, not how far away those slots are.

Position-blind attention visual showing reordered tokens and why self-attention needs a separate position signal. Position-blind attention visual showing reordered tokens and why self-attention needs a separate position signal.
Self-attention is content-based. Position encodings provide the missing signal for index, direction, and distance.

This tiny attention implementation verifies the property directly. It permutes input rows, recomputes attention, and gets exactly the permuted original output.

permutation-equivariance.py
1import math 2 3tokens = [[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]] 4permutation = [2, 0, 1] 5 6def attention(rows: list[list[float]]) -> list[list[float]]: 7 scores = [ 8 [sum(q_i * k_i for q_i, k_i in zip(q, k)) / math.sqrt(2) for k in rows] 9 for q in rows 10 ] 11 outputs = [] 12 for row in scores: 13 normalizer = sum(math.exp(value) for value in row) 14 weights = [math.exp(value) / normalizer for value in row] 15 outputs.append([ 16 sum(weight * value[column] for weight, value in zip(weights, rows)) 17 for column in range(2) 18 ]) 19 return outputs 20 21baseline = attention(tokens) 22shuffled = attention([tokens[index] for index in permutation]) 23expected = [baseline[index] for index in permutation] 24 25same_output_up_to_permutation = all( 26 abs(actual - target) < 1e-12 27 for actual_row, target_row in zip(shuffled, expected) 28 for actual, target in zip(actual_row, target_row) 29) 30print(same_output_up_to_permutation) 31print([round(row[0], 3) for row in baseline]) 32print([round(row[0], 3) for row in shuffled])
Output
1True 2[0.802, 0.599, 0.752] 3[0.752, 0.802, 0.599]

Three families of solutions

Here is the map of approaches we will explore, so you know where each method fits before we walk through it step by step.

ApproachHow It WorksModifiesExamples
Absolute (additive)Add position vector to token embeddingEmbeddingsOriginal Transformer, BERT, GPT-2[1][2][3]
Relative (multiplicative)Rotate Q/K vectors based on positionQ/K projectionsRoFormer, Llama family[4][5]
Relative (additive bias)Add distance bias to attention logitsAttention scoresALiBi paper[6]

The methods enter the transformer at different points. The three paths below are alternatives, not steps that every model applies together.

Diagram showing Token embedding, Absolute: add position vector, Project Q, K, V, and Projected Q, K. Diagram showing Token embedding, Absolute: add position vector, Project Q, K, V, and Projected Q, K.
Token embedding, Absolute: add position vector, Project Q, K, V, and Projected Q, K.
Comparison of absolute positional encodings, RoPE, and ALiBi by where position enters the transformer. Comparison of absolute positional encodings, RoPE, and ALiBi by where position enters the transformer.
Absolute encodings modify embeddings, RoPE modifies Q/K geometry, and ALiBi modifies attention logits.

Method 1: Sinusoidal positional encoding

The original approach[1]

Think of positional encoding like reading a clock. A clock uses multiple hands moving at different speeds (seconds, minutes, hours) to uniquely identify any moment in a 12-hour period. Sinusoidal positional encoding does the same thing for words in a sequence: each position gets a unique "fingerprint" made of sine and cosine waves at different frequencies. Low dimensions oscillate quickly (like seconds), while high dimensions oscillate slowly (like hours).

For position pos\text{pos}pos and dimension iii (where dmodeld_{\text{model}}dmodel​ is the embedding dimension):

PE(pos,2i)=sin⁡(pos100002i/dmodel)PE_{(\text{pos}, 2i)} = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)PE(pos,2i)​=sin(100002i/dmodel​pos​)

PE(pos,2i+1)=cos⁡(pos100002i/dmodel)PE_{(\text{pos}, 2i+1)} = \cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)PE(pos,2i+1)​=cos(100002i/dmodel​pos​)

What the sine and cosine waves mean

The combination of these sine and cosine waves creates a unique pattern for every position. Each pair of dimensions acts like one clock hand spinning at a particular speed. Position 0 starts every hand at 0 or 1. As you move down the sequence, each hand turns a little more, and the exact angle of every hand together forms a fingerprint that no other position shares.

Why sinusoids?

Because PEpos+kPE_{\text{pos}+k}PEpos+k​ can be expressed as a linear function of PEposPE_{\text{pos}}PEpos​:

PEpos+k=Rk⋅PEposPE_{\text{pos}+k} = R_k \cdot PE_{\text{pos}}PEpos+k​=Rk​⋅PEpos​

where RkR_kRk​ is a rotation matrix that depends only on offset kkk, not absolute position. This means the model can, in principle, learn to attend based on relative position.[1]

Here's how we can generate these multi-frequency sinusoidal positional encodings with plain Python. The function takes the maximum sequence length and embedding dimension as inputs to compute the required frequencies. It returns a table of positional encodings that can be added directly to token embeddings before they enter attention layers.

why-sinusoids.py
1import math 2 3def sinusoidal_positional_encoding( 4 max_len: int, 5 d_model: int, 6) -> list[list[float]]: 7 """Generate sinusoidal positional encodings.""" 8 assert d_model % 2 == 0 9 pe: list[list[float]] = [] 10 for pos in range(max_len): 11 row = [0.0] * d_model 12 for i in range(0, d_model, 2): 13 frequency = math.exp(-(math.log(10000.0) * i) / d_model) 14 angle = pos * frequency 15 row[i] = math.sin(angle) 16 row[i + 1] = math.cos(angle) 17 pe.append(row) 18 19 return pe # (max_len, d_model) 20 21pe = sinusoidal_positional_encoding(max_len=8, d_model=8) 22print(len(pe), len(pe[0])) 23print([round(x, 3) for x in pe[0]]) 24print(round(pe[2][0], 3)) 25print(round(pe[2][1], 3))
Output
18 8 2[0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0] 30.909 4-0.416

For the fastest pair (dimensions 0 and 1), position 2 means angle 2 radians, so we get sin⁡(2)≈0.909\sin(2) \approx 0.909sin(2)≈0.909 and cos⁡(2)≈−0.416\cos(2) \approx -0.416cos(2)≈−0.416. For the slowest pair (dimensions 6 and 7), position 2 means angle 0.002 radians, so the values stay extremely close to 0 and 1. This is the multi-scale fingerprint in action.

Frequency interpretation

Dimension pairWavelength (d=512)What it captures
(0,1)(0, 1)(0,1)∼6\sim 6∼6 tokensVery fine-grained local position
(64,65)(64, 65)(64,65)∼20\sim 20∼20 tokensMedium-scale structure
(510,511)(510, 511)(510,511)∼60,000\sim 60{,}000∼60,000 tokensVery coarse global position

The multi-scale frequencies mean the model can attend to both local and distant positions through different embedding dimensions, similar to how Fourier series decompose signals at multiple scales.

sinusoidal-wavelengths.py
1import math 2 3d_model = 512 4for first_dimension in [0, 64, 510]: 5 angular_frequency = 10000 ** (-first_dimension / d_model) 6 wavelength = 2 * math.pi / angular_frequency 7 print(first_dimension, round(wavelength))
Output
10 6 264 20 3510 60611
Sinusoidal positional encoding chart showing fast, medium, and slow frequency bands over positions. Sinusoidal positional encoding chart showing fast, medium, and slow frequency bands over positions.
Sinusoidal encodings combine fast, medium, and slow waves. Each token position gets a multi-scale fingerprint.

Learned vs sinusoidal absolute encodings

Instead of fixed sinusoids, models like BERT (Bidirectional Encoder Representations from Transformers) and GPT-2 (Generative Pre-trained Transformer 2) use learned position embeddings: a trainable matrix P∈RLmax⁡×dP \in \mathbb{R}^{L_{\max} \times d}P∈RLmax​×d where each position gets its own optimizable vector.[2][3]

PropertySinusoidalLearned
Parameters0 (fixed formula)Lmax⁡×dL_{\max} \times dLmax​×d added params
FlexibilityFixed inductive biasAdapts to task
ExtrapolationBeyond Lmax⁡L_{\max}Lmax​ (in theory)Hard limit at Lmax⁡L_{\max}Lmax​
PerformanceStrong baselineComparable; no significant win

Vaswani et al. (2017) tested both and found "nearly identical results."[1] Later architectures also explore relative-position methods, which insert position into attention in different ways.

Why add instead of concatenate?

A common intuition is that concatenating the positional vector to the token embedding (creating a 2d2d2d vector) would perfectly preserve both pieces of information. However, concatenation would either force the whole transformer to run at width 2d2d2d or require an extra projection layer to get back to ddd. If you keep the model width at 2d2d2d, square matrices like WQW_QWQ​, WKW_KWK​, and WVW_VWV​ become roughly 4x larger. Addition keeps the hidden width unchanged and lets downstream learned projections use the combined representation.

Addition doesn't guarantee that content and position stay neatly separated. It keeps the hidden width fixed and gives later learned projections one combined representation to use. Whether that works well is empirical: Vaswani et al. report nearly identical results for their sinusoidal and learned-position variants.[1]

Limitations of absolute methods

LimitationWhy it matters
Poor extrapolationQuality often degrades for positions beyond the training length.
Implicit relative positionThe model must learn relative offsets indirectly from absolute vectors.
Content-position mixingAdding position vectors to embeddings mixes content and position in the same stream.

Method 2: Rotary Position Embeddings (RoPE)

A relative-position approach

RoPE (Rotary Position Embeddings) was introduced in RoFormer and is used in Llama-family models described in Meta's Llama 3 report.[4][5]

Core insight

Instead of adding position to embeddings, rotate the query and key vectors in 2D subspaces. The dot product between rotated vectors naturally encodes relative position.

RoPE rotation comparison showing two query-key pairs with same relative offset and identical positional contribution. RoPE rotation comparison showing two query-key pairs with same relative offset and identical positional contribution.
RoPE turns position into phase. Shift both tokens together and absolute angles change, but relative offset stays the same.

Imagine two dancers (a query and a key) on a circular stage. RoPE assigns each dancer a rotation angle based on their position in the sequence. When they face each other to "interact" (dot product), only the difference in their angles matters, not their absolute positions. This is why RoPE naturally encodes relative position: a dancer at position 5 interacting with position 3 looks the same as position 105 interacting with position 103.

The math

For a 2D subspace (q2i,q2i+1)(q_{2i}, q_{2i+1})(q2i​,q2i+1​) at position mmm, apply rotation:

(q2i(m)q2i+1(m))=(cos⁡(mθi)−sin⁡(mθi)sin⁡(mθi)cos⁡(mθi))(q2iq2i+1)\begin{pmatrix} q_{2i}^{(m)} \\ q_{2i+1}^{(m)} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}(q2i(m)​q2i+1(m)​​)=(cos(mθi​)sin(mθi​)​−sin(mθi​)cos(mθi​)​)(q2i​q2i+1​​)

How to read the rotation matrix

Take each consecutive pair of numbers in the query vector and spin them like a clock hand. How far they spin depends on two things: the token's position mmm (further in the sequence = more rotation) and the frequency θi\theta_iθi​ (each pair rotates at a different speed). The result: every position creates a unique rotational pattern that the model can use to figure out "how far apart are these two words?"

The frequency base follows the same pattern as sinusoidal encoding, but production RoPE usually applies it over the per-head rotary dimension, not the full model width:

θi=10000−2i/drotary\theta_i = 10000^{-2i/d_{\text{rotary}}}θi​=10000−2i/drotary​

If every head dimension is rotated, drotary=dkd_{\text{rotary}} = d_kdrotary​=dk​. Some architectures rotate only part of each head, so the config's rotary dimension is the value that matters.

The key property

When computing q(m)⋅k(n)q^{(m)} \cdot k^{(n)}q(m)⋅k(n), the rotation matrices combine:

q(m)Tk(n)=qTRθ(n−m)kq^{(m)T} k^{(n)} = q^T R_{\theta}^{(n-m)} kq(m)Tk(n)=qTRθ(n−m)​k

Toy numerical example (2D subspace). Suppose a query vector pair at position 5 with frequency θ=1.0\theta = 1.0θ=1.0 for simplicity, and a key at position 2:

the-key-property.py
1import math 2 3def rotate_2d(vec: list[float], pos: int, theta: float) -> list[float]: 4 """Apply RoPE rotation to a 2D pair.""" 5 c, s = math.cos(pos * theta), math.sin(pos * theta) 6 x, y = vec 7 return [x * c - y * s, x * s + y * c] 8 9q = [0.8, 0.6] # original query pair 10k = [0.7, 0.5] # original key pair 11 12def dot(a: list[float], b: list[float]) -> float: 13 return sum(x * y for x, y in zip(a, b)) 14 15theta = 0.5 16score_a = dot(rotate_2d(q, 5, theta), rotate_2d(k, 2, theta)) 17score_b = dot(rotate_2d(q, 105, theta), rotate_2d(k, 102, theta)) 18 19print(round(score_a, 6), round(score_b, 6))
Output
10.040884 0.040884

Running this shows that the interaction depends only on the offset, exactly as the math predicts. This relative-position property is useful, but alone provides no proof that a model works beyond its trained context window.

Why this matters

The attention score between a query at position mmm and a key at position nnn depends only on the signed relative offset (n−m)(n - m)(n−m) between them, not on their absolute positions. Token 5 attending to token 3 produces the same positional structure as token 1005 attending to token 1003. This is what makes RoPE "relative": it captures offset, not absolute address.

Worked example: the same distance always gives the same score

Picture a single 2D pair where both the query and the key start as the simple unit vector [1,0][1, 0][1,0]. Let the rotation frequency for this pair be θ=0.5\theta = 0.5θ=0.5 radians.

  • Query at position m=2m = 2m=2, key at position n=1n = 1n=1: relative distance d=n−m=−1d = n - m = -1d=n−m=−1. After rotation, the dot product equals cos⁡(d⋅θ)=cos⁡(−0.5)≈0.878\cos(d \cdot \theta) = \cos(-0.5) \approx 0.878cos(d⋅θ)=cos(−0.5)≈0.878.

  • Query at position m=102m = 102m=102, key at position n=101n = 101n=101: relative distance is still d=−1d = -1d=−1. After rotation, the dot product is again cos⁡(−0.5)≈0.878\cos(-0.5) \approx 0.878cos(−0.5)≈0.878.

The absolute positions changed by 100, but the positional part of the attention score stayed identical because only the distance between the tokens matters. If the distance were d=−2d = -2d=−2, this particular pair would score cos⁡(−1.0)≈0.540\cos(-1.0) \approx 0.540cos(−1.0)≈0.540, which is smaller. Avoid generalizing that single comparison into a monotonic distance penalty: cosine phases can rise again at larger offsets, and a real head sums many frequency pairs.

RoPE pair mechanics flow showing vectors split into 2D pairs, rotated by per-pair frequencies, then compared through relative offset. RoPE pair mechanics flow showing vectors split into 2D pairs, rotated by per-pair frequencies, then compared through relative offset.
RoPE applies one reusable pairwise recipe across the head. Each pair gets its own frequency, but every pair collapses absolute rotation into relative offset.

Implementation

The code below demonstrates how RoPE applies these rotations in practice. It takes the query and key tensors as inputs, along with precomputed rotation frequencies based on the sequence length. The function reshapes the vectors into 2D pairs, applies the complex rotations, and returns the rotated query and key tensors.

implementation.py
1import torch 2 3def precompute_freqs_cis( 4 dim: int, 5 max_seq_len: int, 6 theta: float = 10000.0, 7) -> torch.Tensor: 8 """Precompute rotation frequencies (complex exponentials).""" 9 freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim)) 10 t = torch.arange(max_seq_len, dtype=torch.float) 11 freqs = torch.outer(t, freqs) # (seq_len, dim/2) 12 return torch.polar(torch.ones_like(freqs), freqs) # complex64 13 14def apply_rotary_emb( 15 xq: torch.Tensor, # (batch, seq_len, n_heads, d_k) 16 xk: torch.Tensor, # (batch, seq_len, n_heads, d_k) 17 freqs_cis: torch.Tensor, # (seq_len, d_k/2) 18) -> tuple[torch.Tensor, torch.Tensor]: 19 """Apply RoPE to queries and keys.""" 20 # View as complex numbers: pairs (x_2i, x_{2i+1}) -> x_2i + j·x_{2i+1} 21 xq_complex = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2)) 22 xk_complex = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2)) 23 24 # Multiply by rotation (complex multiplication = 2D rotation) 25 freqs_cis = freqs_cis.unsqueeze(0).unsqueeze(2) # broadcast 26 xq_out = torch.view_as_real(xq_complex * freqs_cis).flatten(-2) 27 xk_out = torch.view_as_real(xk_complex * freqs_cis).flatten(-2) 28 29 return xq_out.type_as(xq), xk_out.type_as(xk) 30 31# --- Verify the relative-distance property on tiny tensors --- 32d_k = 4 33freqs = precompute_freqs_cis(dim=d_k, max_seq_len=4) 34# Both q and k are unit vectors [1,0,1,0] at different positions 35q = torch.tensor([[[[1.0, 0.0, 1.0, 0.0]]]]) # (1, 1, 1, 4) at pos 0 36k_far = torch.tensor([[[[1.0, 0.0, 1.0, 0.0]]]]) # same vector at pos 2 37q_rot, _ = apply_rotary_emb(q, k_far, freqs[0:1]) 38_, k_rot = apply_rotary_emb(q, k_far, freqs[2:3]) 39print("Dot product at distance 2:", (q_rot * k_rot).sum().item()) 40 41# Now shift both by +1: q at pos 1, k at pos 3 42q2 = torch.tensor([[[[1.0, 0.0, 1.0, 0.0]]]]) 43k2 = torch.tensor([[[[1.0, 0.0, 1.0, 0.0]]]]) 44q_rot2, _ = apply_rotary_emb(q2, k2, freqs[1:2]) 45_, k_rot2 = apply_rotary_emb(q2, k2, freqs[3:4]) 46print("Dot product at distance 2 (shifted):", (q_rot2 * k_rot2).sum().item()) 47# Both prints show the same number, because only relative distance matters.
Output
1Dot product at distance 2: 0.5836532115936279 2Dot product at distance 2 (shifted): 0.5836531519889832

What RoPE changes

PropertySinusoidalLearnedRoPE
Position infoAdded to embeddingsAdded to embeddingsApplied to Q, K only
Relative positionImplicit (model must learn)ImplicitExplicit in dot product
Where position entersInput embedding sumInput embedding sumQ/K rotation
Direct RoPE transform on VNot applicableNot applicableNone
Extra parameters0Lmax⁡×dL_{\max} \times dLmax​×d0

RoPE directly rotates Q and K, not V. This places its relative-position operation in attention routing rather than applying the same rotation to aggregated values. In deeper layers, V still comes from hidden states that already include context mixed by earlier position-aware attention.

Relative offset isn't a monotonic penalty

RoPE gives attention a structured function of relative offset. Farther tokens aren't guaranteed to receive smaller positional contributions. For a single pair of identical unit vectors, the positional score is:

SΔ=cos⁡(Δθ)S_{\Delta} = \cos(\Delta\theta)SΔ​=cos(Δθ)

This quantity oscillates as distance Δ\DeltaΔ changes. A full head combines multiple frequencies and content-dependent coefficients, so its score can also rise or fall across offsets.

rope-offset-is-not-monotonic.py
1import math 2 3theta = 0.5 4distances = [1, 2, 4, 6, 8] 5scores = [math.cos(distance * theta) for distance in distances] 6print(distances) 7print([round(score, 3) for score in scores]) 8print(scores[-1] > scores[-2])
Output
1[1, 2, 4, 6, 8] 2[0.878, 0.54, -0.416, -0.99, -0.654] 3True

The RoFormer paper analyzes a long-term decay bound aggregated over RoPE frequencies, and Men et al. (2024) study how the RoPE base affects long-context retrieval behavior.[4][7] Those analyses motivate careful base selection and evaluation. They remain distinct from ALiBi's explicit linear distance penalty.

Toy chart contrasting an oscillatory RoPE frequency contribution with an explicit monotonic ALiBi penalty. Toy chart contrasting an oscillatory RoPE frequency contribution with an explicit monotonic ALiBi penalty.
Individual RoPE frequency contributions oscillate. Papers analyze aggregate bounds and measured long-context behavior, which still require task evaluation.

Important caution: simply increasing the base frequency can give "superficial" long-context capability in the evaluations reported by Men et al.: perplexity may look good while retrieval accuracy degrades.[7] A changed RoPE configuration therefore needs retrieval and task evaluation at its deployed length.


Method 3: Attention with Linear Biases (ALiBi)

The simplest approach[6]

ALiBi (Attention with Linear Biases) doesn't modify embeddings at all. Instead, it adds a head-specific linear bias directly to attention scores:

Imagine you're at a crowded party. People standing right next to you speak at full volume, but the farther away someone stands, the quieter they sound, linearly. ALiBi works like this: it applies a simple distance penalty to attention scores, so nearby tokens "speak loudly" and distant tokens are naturally muffled. Different attention heads act like listeners with different hearing ranges: some focus on nearby whispers, others can hear across the room.

Attention(Q,K,V)=softmax(QKTdk+B(h))V,Bij(h)=−mh(i−j) for j≤i\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + B^{(h)}\right) V,\quad B^{(h)}_{ij} = -m_h(i-j)\ \text{for}\ j \le iAttention(Q,K,V)=softmax(dk​​QKT​+B(h))V,Bij(h)​=−mh​(i−j) for j≤i

What the ALiBi formula does

This is the normal attention formula with one modification: before applying softmax, we add a negative bias that gets more negative the farther back a key token sits from the current query. In a causal decoder, the current token gets zero penalty, one token back gets −mh-m_h−mh​, two tokens back gets −2mh-2m_h−2mh​, and so on. A steep slope makes the head short-sighted. A gentle slope lets it see further.

Here, mhm_hmh​ is the slope for head hhh. Future positions still get removed by the causal mask.

Head slopes

For nnn heads, the paper uses a fixed geometric schedule. For power-of-two head counts, this reduces to a clean sequence like mk=2−8k/nm_k = 2^{-8k/n}mk​=2−8k/n.

For 8 heads (n=8): m=12,14,18,116,132,164,1128,1256m = \frac{1}{2}, \frac{1}{4}, \frac{1}{8}, \frac{1}{16}, \frac{1}{32}, \frac{1}{64}, \frac{1}{128}, \frac{1}{256}m=21​,41​,81​,161​,321​,641​,1281​,2561​.

ALiBi chart showing linear distance penalties for steep, medium, and gentle attention-head slopes. ALiBi chart showing linear distance penalties for steep, medium, and gentle attention-head slopes.
ALiBi adds distance-dependent penalties before softmax. Different heads get different slopes, so some stay local while others remain long-range.
HeadSlope mmmEffect
Head 11/21/21/2Strong local attention
Head 41/161/161/16Moderate range
Head 81/2561/2561/256Very long-range

Different heads get different slopes through a fixed schedule, giving the model a spectrum of attention ranges, from very local (steep slope) to nearly global (gentle slope). For non-power-of-two head counts, the reference implementation uses a recursive workaround so slopes remain well spread.[6]

Worked example: a concrete distance penalty

Picture the token "left" at position 3 attending back to earlier words in "The package left the warehouse." The raw dot-product score for "package" at position 1 is 2.5. Head 4 has slope m=1/16≈0.0625m = 1/16 \approx 0.0625m=1/16≈0.0625.

  • Distance from "left" to "package" = 3−1=23 - 1 = 23−1=2
  • Penalty = −0.0625×2=−0.125-0.0625 \times 2 = -0.125−0.0625×2=−0.125
  • Score entering softmax = 2.5−0.125=2.3752.5 - 0.125 = 2.3752.5−0.125=2.375

For "The" at position 0, the distance is 3, so the penalty is −0.0625×3=−0.1875-0.0625 \times 3 = -0.1875−0.0625×3=−0.1875, and the score entering softmax is 2.31252.31252.3125. The farther token is slightly quieter, but not silenced. With Head 1 (m=0.5m = 0.5m=0.5), the same distances would produce penalties of −1.0-1.0−1.0 and −1.5-1.5−1.5, making the head far more myopic.

worked-example-a-concrete-distance-penalty.py
1def alibi_score(raw_score: float, query_pos: int, key_pos: int, slope: float) -> float: 2 distance = max(query_pos - key_pos, 0) 3 return raw_score - slope * distance 4 5raw = 4.0 6query_pos = 10 7key_pos = 6 8 9head_1 = alibi_score(raw, query_pos, key_pos, slope=1 / 2) 10head_8 = alibi_score(raw, query_pos, key_pos, slope=1 / 256) 11 12print(head_1, round(head_8, 3))
Output
12.0 3.984

The following snippet shows how to construct the ALiBi bias tensor for a causal decoder. It follows the reference slope helper, including non-power-of-two head counts, and bakes the causal mask into the returned bias matrix.

worked-example-a-concrete-distance-penalty-2.py
1import math 2import torch 3 4def get_alibi_slopes_power_of_two(n_heads: int) -> torch.Tensor: 5 start = 2 ** (-2 ** -(math.log2(n_heads) - 3)) 6 ratio = start 7 return torch.tensor( 8 [start * (ratio ** head) for head in range(n_heads)], 9 dtype=torch.float32, 10 ) 11 12def get_alibi_slopes(n_heads: int) -> torch.Tensor: 13 assert n_heads > 0 14 if math.log2(n_heads).is_integer(): 15 return get_alibi_slopes_power_of_two(n_heads) 16 17 closest_power_of_two = 2 ** math.floor(math.log2(n_heads)) 18 return torch.cat( 19 [ 20 get_alibi_slopes_power_of_two(closest_power_of_two), 21 get_alibi_slopes(2 * closest_power_of_two)[0::2][ 22 : n_heads - closest_power_of_two 23 ], 24 ] 25 ) 26 27def build_alibi_bias( 28 n_heads: int, 29 seq_len: int, 30) -> torch.Tensor: 31 """Build ALiBi logits bias for a causal decoder.""" 32 slopes = get_alibi_slopes(n_heads) 33 34 q_pos = torch.arange(seq_len)[:, None] 35 k_pos = torch.arange(seq_len)[None, :] 36 37 # Causal distance: current token = 0, one step back = 1, etc. 38 relative = (q_pos - k_pos).clamp(min=0).to(torch.float32) 39 bias = -slopes[:, None, None] * relative 40 41 causal_mask = k_pos > q_pos 42 bias = bias.masked_fill(causal_mask.unsqueeze(0), float("-inf")) 43 44 return bias # Add directly to attention logits before softmax 45 46# --- Inspect a tiny bias matrix for 4 heads and 4 tokens --- 47bias = build_alibi_bias(n_heads=4, seq_len=4) 48print(bias.shape) 49print(bias[0, 3, :]) 50print(bias[-1, 3, :]) 51# Notice how distant tokens get more negative values, and the last head is the most forgiving.
Output
1torch.Size([4, 4, 4]) 2tensor([-0.7500, -0.5000, -0.2500, -0.0000]) 3tensor([-0.0117, -0.0078, -0.0039, -0.0000])

What ALiBi demonstrated

In one original-paper experiment, a 1.3 billion parameter ALiBi model is trained at length 1024 and evaluated at length 2048, matching the perplexity of a sinusoidal model trained at length 2048 while using less training time and memory in that setup.[6] The mechanism has several useful properties:

  • Linear biases are monotonic: distant tokens always get penalized more
  • The bias rule is defined for any distance, so there's no learned position table to run out
  • No learned position parameters -> nothing needs to be interpolated
  • The fixed slopes create a natural spectrum from local heads to longer-range heads

Softmax numerical stability with ALiBi

ALiBi interacts elegantly with standard softmax optimizations. ALiBi biases can produce very negative values for distant tokens. In practice, softmax is computed using the max-shift trick to prevent numerical overflow:

softmax(si+bi)=esi+bi−max⁡(s+b)∑jesj+bj−max⁡(s+b)\text{softmax}(s_i + b_i) = \frac{e^{s_i + b_i - \max(s + b)}}{\sum_j e^{s_j + b_j - \max(s + b)}}softmax(si​+bi​)=∑j​esj​+bj​−max(s+b)esi​+bi​−max(s+b)​

Where sis_isi​ is the raw attention score and bib_ibi​ is the ALiBi distance bias for token iii.

When computing this in finite precision, very negative ALiBi biases can make distant tokens' shifted exponent values extremely small. Conceptually ALiBi is still a soft distance-dependent recency bias, not a hard window. The causal mask remains separate, and every distant past weight remains mathematically possible.

alibi-bias-is-not-a-mask.py
1import math 2 3def softmax(logits: list[float]) -> list[float]: 4 maximum = max(logits) 5 shifted = [math.exp(value - maximum) for value in logits] 6 total = sum(shifted) 7 return [value / total for value in shifted] 8 9raw_scores = [2.0, 2.0, 2.0] 10distances = [2, 1, 0] 11slope = 0.5 12biased_scores = [ 13 score - slope * distance 14 for score, distance in zip(raw_scores, distances) 15] 16weights = softmax(biased_scores) 17 18print([round(weight, 3) for weight in weights]) 19print(weights[0] > 0) 20print("future token uses causal mask separately")
Output
1[0.186, 0.307, 0.506] 2True 3future token uses causal mask separately

The extrapolation challenge

This is a critical practical challenge in production systems:

"Your model was trained on 4K tokens. How does it perform at 32K?"

Context extrapolation comparison chart for learned absolute positions, sinusoidal encodings, RoPE, ALiBi, and scaled RoPE. Context extrapolation comparison chart for learned absolute positions, sinusoidal encodings, RoPE, ALiBi, and scaled RoPE.
Being defined beyond the training window isn't enough. Long-context quality must be checked at the target context length.
MethodMechanism beyond training lengthReported evidence and required check
SinusoidalFormula can compute unseen positionsPress et al. report sharp perplexity degradation for their WikiText-103 setup; evaluate your task.
Learned absoluteNo row exists past configured Lmax⁡L_{\max}Lmax​ without changing the tableExtend or replace the table, then train and evaluate.
RoPE (unchanged)Relative rotations continue at unseen offsetsChen et al. report poor direct extension for their Llama experiments; evaluate retrieval and task quality.
ALiBiSame linear logit-bias rule applies at new distancesPress et al. report 1024 to 2048 extrapolation in their setup.
RoPE with PI or YaRNModify the rotary geometry, typically with adaptationChen et al. report PI up to 32K; Peng et al. report YaRN up to 128K in their evaluated models.

These are results from different experiments, not a universal ranking. Context length is a model-and-evaluation claim, not a property proved by selecting a positional formula.[6][8][9]

RoPE context extension techniques

When adapting a RoPE-based model for longer contexts, these methods illustrate different interventions:

1. Position Interpolation (PI)

Scale position indices down to fit within training range[8]

Let L=train length, L′=target length\text{Let } L = \text{train length},\ L' = \text{target length}Let L=train length, L′=target length f′(x,m)=f(x,m⋅L/L′)f'(x, m) = f(x, m \cdot L / L')f′(x,m)=f(x,m⋅L/L′)

What rescaling the position means

If you trained on 4K tokens but want to handle 16K, you shrink all position indices by 4x so they fit within the range the model already knows. Position 16,000 gets mapped to effective position 4,000. Critically, the rotation frequencies theta themselves stay unchanged. Only the position index that multiplies theta gets rescaled. Chen et al. report extensions to 32K with up to 1,000 fine-tuning steps in their Llama experiments.[8]

position-interpolation-map.py
1train_length = 4096 2target_length = 16384 3scale = train_length / target_length 4 5for target_position in [0, 4096, 8192, 16383]: 6 effective_position = target_position * scale 7 print(target_position, round(effective_position, 2))
Output
10 0.0 24096 1024.0 38192 2048.0 416383 4095.75

2. NTK (Neural Tangent Kernel)-aware Scaling

Scale the frequency base instead of positions, trading off between low and high frequency components[9]

θbase′=θbase⋅sd/(d−2)\theta'_{\text{base}} = \theta_{\text{base}} \cdot s^{d/(d-2)}θbase′​=θbase​⋅sd/(d−2)

where s=Ltarget/Ltrains = L_{\text{target}} / L_{\text{train}}s=Ltarget​/Ltrain​ is the scale factor and ddd is the per-head RoPE dimension (the head dimension, not the full hidden size).

What scaling the base changes

Instead of compressing every position, this formulation changes the base so lower-frequency bands are stretched more than high-frequency bands. Peng et al. document this as earlier NTK-aware work while developing YaRN; any quality claim remains dependent on model, target length, adaptation data, and evaluation.[9]

3. YaRN (Yet another RoPE extensioN)

Combines NTK-by-parts interpolation with attention temperature adjustment.[9] Peng et al. report RoPE-based Llama-family variants evaluated at contexts up to 128K.

The paper's recommended temperature rule for its Llama experiments is 1/t=0.1ln⁡(s)+1\sqrt{1/t} = 0.1 \ln(s) + 11/t​=0.1ln(s)+1, where s=Ltarget/Ltrains = L_{\text{target}} / L_{\text{train}}s=Ltarget​/Ltrain​. NTK-by-parts interpolates frequency bands differently; temperature adjustment changes attention scale. Both choices remain hyperparameters to validate on the deployed model.

RoPE context-extension methods comparison for Position Interpolation, NTK-aware scaling, and YaRN. RoPE context-extension methods comparison for Position Interpolation, NTK-aware scaling, and YaRN.
RoPE extension methods change different parts of the positional geometry: position index, frequency base, and attention temperature.

Choosing a positional method

GoalCandidate mechanismValidation obligation
Reproduce original Transformer behaviorSinusoidal absolute encodingConfirm quality at trained lengths and any extrapolated lengths used.
Use learned finite position slotsLearned absolute embeddingsDefine how new positions are initialized and adapted.
Encode relative offset inside query-key scoresRoPEMeasure target-length retrieval and generation quality, especially after any scaling.
Add an explicit monotonic distance biasALiBiMeasure whether its recency bias retains evidence needed by long tasks.
Extend an existing RoPE checkpointPI or YaRN-style adaptationValidate short-context regression and target-context behavior after adaptation.

The implementation choice decides where position enters attention. It still leaves the trained checkpoint to be evaluated on the exact lengths and tasks being served.


Mastery check

Key concepts

TierYou should be able to defend
FoundationalExplain permutation equivariance as the core reason transformers need positional encoding.
IntermediateDerive the sinusoidal positional encoding formula and explain the multi-frequency intuition.
AdvancedShow how RoPE rotates Q/K in 2D subspaces so dot products depend on relative distance.
AdvancedExplain why RoPE rotates Q and K without directly rotating V.
AdvancedDescribe ALiBi as a fixed attention bias linear in distance with geometric head slopes.
AdvancedCompare extrapolation behavior for learned absolute positions, sinusoidal encodings, ALiBi, vanilla RoPE, and scaled RoPE.
AdvancedExplain how PI, NTK-aware scaling, YaRN, and RoPE base changes affect context extension.

Evaluation rubric

  • Strong: can derive why RoPE makes the attention score depend on relative offset, can explain why V stays unrotated, and can pick a context-extension method plus validation plan for a real deployment.
  • Good: can compare sinusoidal, learned, RoPE, and ALiBi correctly, but still needs help connecting the math to serving tradeoffs.
  • Needs work: still mixes up where position enters the model, or treats a longer configured context window as proof that long-context retrieval will work.

Follow-up questions


What to remember

Without position encodings, swapping inputs just swaps outputs. Sinusoidal encodings add multi-frequency position fingerprints to token embeddings. RoPE rotates Q and K so the attention dot product can represent relative offset without directly rotating V. ALiBi adds distance penalties directly to attention logits. Long-context claims need validation at the target length, because a method can be mathematically defined for a longer sequence and still fail retrieval behavior.

Common pitfalls

  • Order still looks like magic: Symptom: you say attention already knows sequence order. Cause: you skipped the permutation-equivariance step. Fix: start from the fact that self-attention is content-based unless a position signal is injected.
  • All methods sound same: Symptom: sinusoidal, learned, RoPE, and ALiBi blur into one idea. Cause: you aren't tracking where position enters. Fix: separate embeddings, Q/K geometry, and attention-logit bias explicitly.
  • RoPE rotates wrong tensor: Symptom: you describe applying RoPE rotation to embeddings or V. Cause: you lost the routing-versus-aggregation distinction. Fix: rotate Q and K after projection; leave V without direct RoPE rotation.
  • Long context looks solved on paper: Symptom: perplexity improves but retrieval still fails. Cause: you treated mathematical validity as behavioral proof. Fix: run passkey, retrieval, summarization, and target-task evals at deployed length.
  • Bigger base must be enough: Symptom: config advertises a larger window but distant facts still disappear. Cause: base change alone doesn't teach the model to use long-range evidence. Fix: pair geometry changes with rescaling method, long-context data, and task-level checks.
  • Concatenate by default: Symptom: width and projection cost balloon without a clear gain. Cause: you tried to preserve content and position separately at all costs. Fix: add position vectors unless you have a strong reason and budget for a wider architecture.

Going deeper

Why not concatenate positional encodings?

Concatenation would either force the transformer to run at width 2d2d2d or require an extra projection layer to get back to ddd. If you keep the model width at 2d2d2d, square projection matrices like WQW_QWQ​, WKW_KWK​, and WVW_VWV​ become roughly 4x larger. Addition keeps the width constant while learned projections operate on the resulting combined representation.

RoPE relative derivation

At position mmm, q(m)=R(mθ)⋅qq^{(m)} = R(m\theta) \cdot qq(m)=R(mθ)⋅q. At position nnn, k(n)=R(nθ)⋅kk^{(n)} = R(n\theta) \cdot kk(n)=R(nθ)⋅k. q(m)Tk(n)=qTR(mθ)TR(nθ)k=qTR((n−m)θ)kq^{(m)T} k^{(n)} = q^T R(m\theta)^T R(n\theta) k = q^T R((n-m)\theta) kq(m)Tk(n)=qTR(mθ)TR(nθ)k=qTR((n−m)θ)k

After rotation, the dot product depends only on the relative offset (n−m)(n-m)(n−m), not on absolute positions. This is why RoPE naturally encodes relative position information.

Why is RoPE base size part of a long-context design?

Men et al. derive and evaluate a relationship between RoPE base and effective long-context retrieval, while also showing that perplexity alone can hide retrieval failures.[7] Base size is therefore a parameter to choose and validate alongside adaptation data and task-level long-context evaluations, not a standalone certificate of usable context length.

Next Step
Continue to Layer Normalization: Pre-LN vs Post-LN

Positional methods decide where attention can look; normalization decides whether deep attention blocks train stably as depth grows.

PreviousVision Transformers and Image Encoders
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Attention Is All You Need.

Vaswani, A., et al. · 2017

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

GPT-2 Source Implementation.

OpenAI · 2019

RoFormer: Enhanced Transformer with Rotary Position Embedding.

Su, J., et al. · 2021

The Llama 3 Herd of Models.

Touvron, H., et al. · 2024 · arXiv preprint

Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.

Press, O., Smith, N. A., & Lewis, M. · 2022 · ICLR 2022

Base of RoPE Bounds Context Length.

Men, X., et al. · 2024 · NeurIPS 2024

Extending Context Window of Large Language Models via Positional Interpolation.

Chen, S., et al. · 2023

YaRN: Efficient Context Window Extension of Large Language Models.

Peng, B., et al. · 2023