LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnTransformer Deep DivesLayer Normalization: Pre-LN vs Post-LN
🧠HardTransformer Architecture

Layer Normalization: Pre-LN vs Post-LN

Understand LayerNorm mechanics, Pre-LN versus Post-LN placement, RMSNorm simplification, gradient stability, and hybrid normalization layouts for deep transformers.

26 min read
Learning path
Step 90 of 155 in the full curriculum
Positional Encoding: RoPE & ALiBiMechanistic Interpretability

Positional encoding showed how attention knows where tokens sit. Layer normalization controls a different part of the : the scale of the residual stream that carries token features through dozens of blocks.

Imagine a warehouse routing model that decides which shelf a package should visit next. At each processing station, the model adds a small update to the package's feature vector: current location, priority, weight, and destination zip. After fifty stations, poorly scaled updates can compound. Some features may become far larger than others; a downstream may become sharply saturated and gradients may weaken. Deep networks face this scale-control risk. Layer Normalization helps manage it, and its placement inside a transformer block changes the optimization behavior.

This article assumes you've seen residual connections before. If you need a quick refresher: a residual block computes output = input + sublayer(input). The input travels along a skip path, and the sublayer (attention or a feed-forward network) adds a small change. Layer Normalization decides whether you normalize the input before the sublayer, or normalize the sum after it.


Why scale drift becomes a training problem

Every residual block adds a change to a hidden state. Residual updates aren't literal multipliers, but a toy multiplicative drift makes the risk easy to see: a scale factor of 1.1 repeated across fifty blocks gives 1.1^50, roughly 117. Backward face a related issue because they multiply block Jacobians; their product can become too large or too small.

Layer Normalization helps by re-centering and rescaling each token's hidden state. Think of it as a calibration step. Before learned scale and shift are applied, the normalized vector sits near zero with a spread close to one. The model then learns scale parameters to stretch those values back into the range it needs.


LayerNorm by hand

Before the formula, here is a concrete example. Take a single token's hidden state, a four-dimensional vector:

layernorm-by-hand.py
1x = [3.0, 1.0, -1.0, 5.0]

Step 1: compute the mean. (3 + 1 + (-1) + 5) / 4 = 2.0

Step 2: subtract the mean from every entry. [1.0, -1.0, -3.0, 3.0]

Step 3: compute the variance. (1^2 + (-1)^2 + (-3)^2 + 3^2) / 4 = (1 + 1 + 9 + 9) / 4 = 5.0

Step 4: divide by the standard deviation. sqrt(5.0) ≈ 2.236 [0.45, -0.45, -1.34, 1.34]

Step 5: scale and shift with learned parameters. If γ = [1.0, 1.0, 1.0, 1.0] and β = [0.0, 0.0, 0.0, 0.0], the output is the same normalized vector. In practice, the model learns the γ and β that help it predict the next token.

That's the entire operation. It happens independently for every token in the sequence. Other tokens in the batch have no role in its statistics, so the same rule applies during training and during autoregressive generation with batch size one.

layernorm-by-hand-2.py
1from math import isclose, sqrt 2 3def layer_norm(values, gamma=None, beta=None, eps=1e-5): 4 gamma = gamma or [1.0] * len(values) 5 beta = beta or [0.0] * len(values) 6 mean = sum(values) / len(values) 7 variance = sum((value - mean) ** 2 for value in values) / len(values) 8 scale = sqrt(variance + eps) 9 return [ 10 gamma_i * ((value - mean) / scale) + beta_i 11 for value, gamma_i, beta_i in zip(values, gamma, beta) 12 ] 13 14normalized = layer_norm([3.0, 1.0, -1.0, 5.0]) 15rounded = [round(value, 2) for value in normalized] 16assert rounded == [0.45, -0.45, -1.34, 1.34] 17assert isclose(sum(normalized), 0.0, abs_tol=1e-12)

The general formula

For a vector x ∈ ℝ^d (a single token's hidden state), LayerNorm computes:[1]

LayerNorm(x)=γ⊙x−μσ2+ϵ+β\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \betaLayerNorm(x)=γ⊙σ2+ϵ​x−μ​+β

where:

  • μ = (1/d) Σ x_i: mean over the hidden dimension
  • σ² = (1/d) Σ (x_i - μ)²: variance over the hidden dimension
  • γ, β ∈ ℝ^d: learnable scale and shift parameters
  • ε ≈ 10^-5: tiny constant to prevent division by zero

The subtraction centers the values around zero. The division squeezes them to a similar scale. The learned γ (scale) and β (shift) let the model decide whether the normalized values should be stretched, compressed, or offset for the next layer.

Key distinction from BatchNorm

LayerNorm normalizes across the feature dimension for each token independently, while Batch Normalization (BatchNorm) normalizes across the batch dimension. This makes LayerNorm independent of batch size, which is critical for variable-length sequences and autoregressive generation.

PropertyBatchNormLayerNorm
Normalizes acrossBatch dimensionFeature dimension
Depends on batch sizeYesNo
Running statisticsYes (train/eval mismatch)No
Works for variable-length sequencesAwkwardNatural
Works for autoregressive decodingAwkwardNatural
Used in modern LLMsRareRoutine

This small program changes a neighboring token while holding one token fixed. Its LayerNorm result stays fixed; a batch-normalized feature coordinate changes.

layernorm-is-per-token.py
1from math import sqrt 2 3def layer_norm(values, eps=1e-5): 4 mean = sum(values) / len(values) 5 variance = sum((value - mean) ** 2 for value in values) / len(values) 6 return [(value - mean) / sqrt(variance + eps) for value in values] 7 8def batch_norm_one_feature(values, eps=1e-5): 9 mean = sum(values) / len(values) 10 variance = sum((value - mean) ** 2 for value in values) / len(values) 11 return [(value - mean) / sqrt(variance + eps) for value in values] 12 13token = [1.0, 3.0, 5.0, 7.0] 14ln_before = [layer_norm(row) for row in [token, [2.0, 4.0, 6.0, 8.0]]][0] 15ln_after = [layer_norm(row) for row in [token, [100.0, 100.0, 100.0, 100.0]]][0] 16batch_a = batch_norm_one_feature([token[0], 2.0, 3.0]) 17batch_b = batch_norm_one_feature([token[0], 2.0, 100.0]) 18 19print("LayerNorm token:", [round(value, 3) for value in ln_before]) 20print("Batch feature with peers=2,3:", round(batch_a[0], 3)) 21print("Batch feature with peers=2,100:", round(batch_b[0], 3)) 22assert ln_before == ln_after 23assert round(batch_a[0], 3) != round(batch_b[0], 3)
Output
1LayerNorm token: [-1.342, -0.447, 0.447, 1.342] 2Batch feature with peers=2,3: -1.225 3Batch feature with peers=2,100: -0.718

PyTorch implementation

The following code demonstrates a basic PyTorch implementation. It takes an input tensor of shape (batch, seq_len, d_model) and calculates the mean and variance across the final feature dimension. It then applies the learnable scale and shift, returning the normalized tensor.

pytorch-implementation.py
1import torch 2import torch.nn as nn 3 4class LayerNorm(nn.Module): 5 """Layer Normalization from scratch.""" 6 def __init__(self, d_model: int, eps: float = 1e-5): 7 super().__init__() 8 self.gamma = nn.Parameter(torch.ones(d_model)) 9 self.beta = nn.Parameter(torch.zeros(d_model)) 10 self.eps = eps 11 12 def forward(self, x: torch.Tensor) -> torch.Tensor: 13 # x shape: (batch, seq_len, d_model) 14 mean = x.mean(dim=-1, keepdim=True) 15 var = x.var(dim=-1, keepdim=True, unbiased=False) 16 x_norm = (x - mean) / torch.sqrt(var + self.eps) 17 return self.gamma * x_norm + self.beta 18 19# Quick check with the hand-worked example 20x = torch.tensor([[[3.0, 1.0, -1.0, 5.0]]]) # shape (1, 1, 4) 21ln = LayerNorm(d_model=4) 22# Fix gamma=1, beta=0 to match the hand calculation 23nn.init.constant_(ln.gamma, 1.0) 24nn.init.constant_(ln.beta, 0.0) 25out = ln(x) 26print(out.round(decimals=2).detach()) 27expected = torch.tensor([[[0.45, -0.45, -1.34, 1.34]]]) 28assert torch.allclose(out.round(decimals=2), expected)
Output
1tensor([[[ 0.4500, -0.4500, -1.3400, 1.3400]]])
Layer normalization visualization showing raw input, centered values, normalized values, and final scaled output. Layer normalization visualization showing raw input, centered values, normalized values, and final scaled output.
LayerNorm standardizes one token at a time across its feature values. The same four numbers move through centering, scaling, and optional learned rescaling.

The illustration above traces the same four numbers through each step: raw input, centered, normalized, and final output. Notice how the bars shrink to a controlled range after normalization.


Where you put LayerNorm matters

Now that you know what LayerNorm does, the next question is where to place it inside a transformer block. There are two choices, and the difference is simply whether the normalization happens before or after the residual addition.

The unbounded update problem

  • Post-LN normalizes after the residual update has been added. Its residual path therefore crosses a normalization operation in every block.
  • Pre-LN normalizes before the sublayer update. Its residual path retains an identity contribution to the backward Jacobian.

Post-LN: the original transformer

In this arrangement (used by the original 2017 Transformer and BERT-style encoders), normalization happens after the residual addition:[2][3]

xl+1=LayerNorm(xl+Sublayer(xl))x_{l+1} = \text{LayerNorm}(x_l + \text{Sublayer}(x_l))xl+1​=LayerNorm(xl​+Sublayer(xl​))

In plain terms: compute the sublayer output (attention or FFN), add it to the original input along the skip connection, then normalize the sum. The problem is that normalization sits directly on the main highway, so gradients through the residual path pass through a LayerNorm Jacobian in each Post-LN block.

Pre-LN: the GPT-2 layout

In this arrangement, used by GPT-2, normalization happens before the sublayer:[4]

xl+1=xl+Sublayer(LayerNorm(xl))x_{l+1} = x_l + \text{Sublayer}(\text{LayerNorm}(x_l))xl+1​=xl​+Sublayer(LayerNorm(xl​))

Normalize the input first, feed it into the sublayer, then add the result to a direct copy of the input. The residual path from x_l to x_{l+1} keeps an identity contribution to the backward pass.

At the full-model level, GPT-2 and the Pre-LN architecture analyzed by Xiong et al. apply a final normalization before prediction.[4][5] That final operation normalizes the accumulated residual stream before the output projection. Removing it produces a different architecture whose output scale needs evaluation.

Pre-LN and Post-LN transformer block layouts showing where LayerNorm sits relative to the residual addition. Pre-LN and Post-LN transformer block layouts showing where LayerNorm sits relative to the residual addition.
Post-LN normalizes after the residual add. Pre-LN normalizes the sublayer input while preserving an identity contribution on the residual path.

The diagram on the left shows Pre-LN: LayerNorm cleans the sublayer input first, and the raw input still jumps straight over to the addition. The diagram on the right shows Post-LN: the sublayer and the skip path meet first, then LayerNorm cleans up the result.

Holding a sublayer update fixed isolates the placement difference. Post-LN centers and rescales the updated stream immediately; Pre-LN lets the updated stream continue along the residual path.

pre-post-forward-placement.py
1from math import sqrt 2 3def layer_norm(values, eps=1e-5): 4 mean = sum(values) / len(values) 5 variance = sum((value - mean) ** 2 for value in values) / len(values) 6 return [(value - mean) / sqrt(variance + eps) for value in values] 7 8x = [1.0, 2.0, 4.0, 8.0] 9fixed_update = [0.5, -0.5, 1.0, -1.0] 10pre_ln_output = [value + update for value, update in zip(x, fixed_update)] 11post_ln_output = layer_norm(pre_ln_output) 12 13print("Pre-LN stream mean:", round(sum(pre_ln_output) / len(pre_ln_output), 3)) 14print("Post-LN stream mean:", round(sum(post_ln_output) / len(post_ln_output), 3)) 15print("Post-LN output:", [round(value, 3) for value in post_ln_output]) 16assert abs(sum(post_ln_output) / len(post_ln_output)) < 1e-12 17assert sum(pre_ln_output) / len(pre_ln_output) != 0.0
Output
1Pre-LN stream mean: 3.75 2Post-LN stream mean: 0.0 3Post-LN output: [-0.954, -0.954, 0.53, 1.378]

A two-layer walkthrough

To see how the placement handles a uniform offset, trace a deliberately simplified stream through two blocks. Assume each block's sublayer produces the same update, 0.5, for every feature. Also set LayerNorm's learned scale to γ = 1 and shift to β = 0. A trained attention or feed-forward sublayer would usually produce nonuniform updates.

Post-LN stack

  • Start: x_0 = [1.0, 1.0, 1.0, 1.0]
  • After block 1: x_1 = LayerNorm(x_0 + 0.5) = LayerNorm([1.5, 1.5, 1.5, 1.5]) = [0.0, 0.0, 0.0, 0.0] (mean-centered, then scaled)
  • After block 2: x_2 = LayerNorm(x_1 + 0.5) = LayerNorm([0.5, 0.5, 0.5, 0.5]) = [0.0, 0.0, 0.0, 0.0]

The constant offset is reset at every layer. This toy is intentionally stark: LayerNorm removes a uniform shift, so the residual stream doesn't preserve that offset.

Pre-LN stack

  • Start: x_0 = [1.0, 1.0, 1.0, 1.0]
  • After block 1: x_1 = x_0 + Sublayer(LayerNorm(x_0)) = [1.0, 1.0, 1.0, 1.0] + 0.5 = [1.5, 1.5, 1.5, 1.5]
  • After block 2: x_2 = x_1 + Sublayer(LayerNorm(x_1)) = [1.5, 1.5, 1.5, 1.5] + 0.5 = [2.0, 2.0, 2.0, 2.0]

The running sum flows through the network because Pre-LN leaves the residual path outside the block normalization. The backward equation below shows why this placement includes an identity gradient route.

pre-ln-stack.py
1from math import sqrt 2 3def layer_norm(values, eps=1e-5): 4 mean = sum(values) / len(values) 5 variance = sum((value - mean) ** 2 for value in values) / len(values) 6 scale = sqrt(variance + eps) 7 return [(value - mean) / scale for value in values] 8 9def sublayer(values): 10 # Tiny deterministic stand-in for attention or an FFN update. 11 return [0.2 * value + 0.1 for value in values] 12 13def pre_ln_step(values): 14 return [value + update for value, update in zip(values, sublayer(layer_norm(values)))] 15 16def post_ln_step(values): 17 raw = [value + update for value, update in zip(values, sublayer(values))] 18 return layer_norm(raw) 19 20start = [1.0, 2.0, 4.0, 8.0] 21pre_after_two = pre_ln_step(pre_ln_step(start)) 22post_after_two = post_ln_step(post_ln_step(start)) 23 24assert sum(abs(value) for value in pre_after_two) > sum(abs(value) for value in start) 25assert abs(sum(post_after_two) / len(post_after_two)) < 1e-12

Why Pre-LN can train more stably

Post-LN gradient flow is like sending a package through a long sorting line where every station re-labels it before passing it on. Each relabeling step slightly changes the signal. After eighty stations, the original priority mark is hard to track. Pre-LN is like keeping a clean bypass label that travels beside the package while each station adds useful updates without disrupting the main path.

Post-LN gradient problem

The key result from Xiong et al. is that Post-LN produces especially large expected gradients near the output layers at initialization under their analysis.[5] For one block, the backward Jacobian is:

∂xl+1∂xl=JLN,l⋅(I+JF,l)\frac{\partial x_{l+1}}{\partial x_l} = J_{\text{LN},l} \cdot (I + J_{F,l})∂xl​∂xl+1​​=JLN,l​⋅(I+JF,l​)

where J_LN,l is the LayerNorm Jacobian and J_F,l is the sublayer Jacobian. Across many layers, the backward pass multiplies many such terms together. In practice that means:

  • Gradient scale becomes uneven across depth. Early layers may receive tiny updates while late layers receive huge ones.
  • Top layers can receive very large updates at initialization. A single step with a moderately high learning rate can push weights far from a good basin.
  • Learning-rate warmup addresses this risk in the evaluated Post-LN setups. Start the learning rate near zero and ramp it up so early updates are smaller.

Pre-LN gradient advantage

For Pre-LN, the Jacobian is:

∂xl+1∂xl=I+JF,l⋅JLN,l\frac{\partial x_{l+1}}{\partial x_l} = I + J_{F,l} \cdot J_{\text{LN},l}∂xl​∂xl+1​​=I+JF,l​⋅JLN,l​

That leading identity term I is the key difference. Every block includes a direct residual contribution to the gradient, so the backward signal avoids relying entirely on repeated normalization Jacobians. Xiong et al. report that, in their evaluated tasks:

  • Pre-LN trains without learning-rate warmup in experiments where Post-LN needed it for stable optimization.[5]
  • Its initialization-time gradients are better behaved in their analysis.

This scalar toy isn't a training experiment. It only shows how putting a normalization factor on the identity path changes a product across blocks.

jacobian-route-toy.py
1depth = 12 2j_layer_norm = 0.75 3j_sublayer = 0.10 4 5post_block = j_layer_norm * (1.0 + j_sublayer) 6pre_block = 1.0 + j_sublayer * j_layer_norm 7post_path = post_block ** depth 8pre_path = pre_block ** depth 9 10print("one block: post=", round(post_block, 3), "pre=", round(pre_block, 3)) 11print("twelve-block product: post=", round(post_path, 3), "pre=", round(pre_path, 3)) 12assert post_path < 1.0 13assert pre_path > 1.0
Output
1one block: post= 0.825 pre= 1.075 2twelve-block product: post= 0.099 pre= 2.382

A reported Pre-LN risk: representation collapse

Pre-LN isn't free of trade-offs. Analyses including ResiDual report that its residual stream can dominate newer sublayer updates as depth increases. Under that behavior, late layers make smaller relative changes to the hidden representation.[6]

This reported behavior is called representation collapse. Treat it as a design risk to measure, not a guarantee for every Pre-LN model or training run.

Here is the practical summary:

PropertyPost-LNPre-LN
Warmup result in Xiong et al.Needed in evaluated stable runsRemoved in evaluated stable runs
Initialization gradients in Xiong et al.Large near output layersBetter behaved
Reported deep-layer riskGradient flow can become difficultResidual stream can dominate later updates
Representative architectureOriginal Transformer, BERTGPT-2

There is no universal winner. Xiong et al. demonstrate the Pre-LN optimization advantage in their evaluated tasks, while DeepNorm later scales a Post-LN-derived design to a 1,000-layer machine-translation experiment using residual scaling and matching initialization.[5][7]


Common mistakes and how to debug them

SymptomLikely causeFix
Model diverges at step 1 with loss = NaNFor a Post-LN run, an overly large early update is one hypothesis; Xiong et al. connect Post-LN initialization gradients to warmup sensitivity.Inspect layerwise gradient norms and learning-rate schedule. Compare a warmer start, smaller initial rate, or a Pre-LN variant under the same setup.
One normalization placement underperforms anotherPlacement interacts with depth, initialization, optimizer, data, and objective.Treat placement as an ablation, then compare training loss, gradient profile, and downstream quality under matched conditions.
A Pre-LN implementation differs sharply from a reference modelA final normalization used by the target Pre-LN architecture may be absent.Check the architecture definition and restore its final LayerNorm or RMSNorm before the output projection when applicable.[5]
Attention becomes nearly one-hot during unstable trainingQuery-key dot products may have grown enough to saturate softmax.Inspect attention-logit statistics; QK-Norm or softmax capping are targeted candidates to evaluate in such runs.

RMSNorm: scale-only normalization

RMSNorm (Root Mean Square Layer Normalization) replaces LayerNorm's centered standard deviation with a root-mean-square scale. Zhang and Sennrich propose it as a cheaper normalization that retains rescaling invariance while removing mean-centering; their evaluated models achieve comparable quality to LayerNorm with run-time reductions that vary by setup.[8] Qwen2.5 and Gemma 2 reports provide later architecture examples using RMSNorm.[9][10]

RMSNorm(x)=γ⊙x1d∑i=1dxi2+ϵ\text{RMSNorm}(x) = \gamma \odot \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2 + \epsilon}}RMSNorm(x)=γ⊙d1​∑i=1d​xi2​+ϵ​x​

Instead of centering (subtracting the mean) and then scaling (dividing by standard deviation), RMSNorm just divides by the root-mean-square. It doesn't subtract the mean, and many implementations also omit a learned bias term.

StepLayerNormRMSNorm
Compute meanYesNo
Subtract meanYesNo
Scale statisticCentered standard deviationRoot mean square
Divide by scale statisticYesYes
Learned scale γYesYes
Learned shift βYesOften omitted
LayerNorm and RMSNorm comparison showing that LayerNorm centers and rescales each token, while RMSNorm keeps only the magnitude-based rescaling step. LayerNorm and RMSNorm comparison showing that LayerNorm centers and rescales each token, while RMSNorm keeps only the magnitude-based rescaling step.
RMSNorm rescales each token by its root mean square while omitting LayerNorm's mean-subtraction step.

The original RMSNorm paper reports comparable quality to LayerNorm on its evaluated tasks and observed run-time reductions from 7% to 64% across different models and implementations.[8] Fused kernels, hardware, and the fraction of time spent in normalization determine the result in another stack, so those measurements are evidence from the paper rather than a promised speedup.

rmsnorm-a-common-modern-shortcut.py
1from math import sqrt 2 3def rms_norm(values, weight=None, eps=1e-6): 4 weight = weight or [1.0] * len(values) 5 rms = sqrt(sum(value * value for value in values) / len(values) + eps) 6 return [value * scale / rms for value, scale in zip(values, weight)] 7 8values = [3.0, 1.0, -1.0, 5.0] 9normalized = rms_norm(values) 10 11assert round(sqrt(sum(value * value for value in normalized) / len(normalized)), 6) == 1.0 12assert round(sum(normalized) / len(normalized), 6) != 0.0
rmsnorm-a-common-modern-shortcut-2.py
1import torch 2import torch.nn as nn 3 4class RMSNorm(nn.Module): 5 """Root Mean Square Layer Normalization.""" 6 def __init__(self, d_model: int, eps: float = 1e-6): 7 super().__init__() 8 self.weight = nn.Parameter(torch.ones(d_model)) 9 self.eps = eps 10 11 def forward(self, x: torch.Tensor) -> torch.Tensor: 12 # x shape: (batch, seq_len, d_model) 13 rms = torch.sqrt(x.pow(2).mean(dim=-1, keepdim=True) + self.eps) 14 return self.weight * (x / rms) 15 16# Quick sanity check 17x = torch.tensor([[[3.0, 1.0, -1.0, 5.0]]]) 18rms = RMSNorm(d_model=4) 19out = rms(x) 20print(out.round(decimals=4).detach()) 21feature_rms = torch.sqrt(out.pow(2).mean(dim=-1)) 22assert torch.allclose(feature_rms, torch.ones_like(feature_rms), atol=1e-5) 23assert not torch.allclose(out.mean(dim=-1), torch.zeros_like(out.mean(dim=-1)))
Output
1tensor([[[ 1.0000, 0.3333, -0.3333, 1.6667]]])

Reported normalization layouts

Once you understand Pre-LN and Post-LN, advanced variants become easier to read. These entries summarize specific reports; their experiments differ in model size, objective, and optimization setup.

RecipeReported exampleWhat changes
Post-LNBERTNormalize after the residual addition
Pre-LNGPT-2Normalize before attention and FFN, retaining an identity residual contribution
Pre + Post normGemma 2Normalize both input and output of each sublayer with RMSNorm[10]
DeepNormDeepNet studyModify Post-LN with a scaled residual and matched initialization; evaluated up to 1,000 layers[7]
QK-Norm (Query-Key Normalization)Rybakov et al. studyNormalize query and key vectors before their dot product; evaluated alone and with softmax capping[11]

No row establishes a globally superior recipe. Peri-LN names an input-and-output normalization layout and reports experiments up to 3.2B parameters; Rybakov et al. evaluate attention-specific normalization and capping on an 830M-parameter model driven into instability with high learning rates.[12][11]

This illustration uses RMS rescaling to expose the mechanism: multiplying a query by twenty changes the raw dot-product logit, but rescaling both query and key by their RMS removes that scale-only change.

qk-norm-controls-dot-scale.py
1from math import isclose, sqrt 2 3def rms_rescale(values): 4 rms = sqrt(sum(value * value for value in values) / len(values)) 5 return [value / rms for value in values] 6 7def attention_logit(query, key): 8 return sum(q * k for q, k in zip(query, key)) / sqrt(len(query)) 9 10query = [6.0, -3.0, 2.0, 1.0] 11key = [4.0, -2.0, 1.0, 3.0] 12large_query = [20.0 * value for value in query] 13 14raw = attention_logit(query, key) 15raw_large = attention_logit(large_query, key) 16controlled = attention_logit(rms_rescale(query), rms_rescale(key)) 17controlled_large = attention_logit(rms_rescale(large_query), rms_rescale(key)) 18 19print("raw logits:", round(raw, 3), round(raw_large, 3)) 20print("RMS-rescaled logits:", round(controlled, 3), round(controlled_large, 3)) 21assert raw_large == 20.0 * raw 22assert isclose(controlled, controlled_large, rel_tol=1e-12)
Output
1raw logits: 17.5 350.0 2RMS-rescaled logits: 1.807 1.807

Peri-LN and hybrids

Kim et al. use Peri-LN for normalization placed both before and after each sublayer, and report more balanced variance growth and steadier gradients than their compared layouts in experiments up to 3.2B parameters.[12] Gemma 2's technical report states that it applies RMSNorm to both the input and output of each transformer sublayer; this matches an input-and-output normalization layout, although that report uses no Peri-LN label.[10]


Practice: build a toggle-switch transformer block

A practical way to internalize the difference is to write one block that can switch between Post-LN and Pre-LN. Below is a minimal PyTorch module. Run it, initialize stacks of each type, and compare the gradient norms across blocks at initialization.

practice-build-a-toggle-switch-transformer.py
1import torch 2import torch.nn as nn 3 4class ToggleBlock(nn.Module): 5 """Transformer block with style='pre' or style='post'.""" 6 def __init__(self, d_model: int, style: str = "pre"): 7 super().__init__() 8 assert style in ("pre", "post") 9 self.style = style 10 self.norm = nn.LayerNorm(d_model) 11 # Simplified sublayer: single linear + ReLU 12 self.sublayer = nn.Sequential( 13 nn.Linear(d_model, d_model * 4), 14 nn.ReLU(), 15 nn.Linear(d_model * 4, d_model), 16 ) 17 18 def forward(self, x: torch.Tensor) -> torch.Tensor: 19 if self.style == "pre": 20 # x_l + Sublayer(LayerNorm(x_l)) 21 return x + self.sublayer(self.norm(x)) 22 else: 23 # LayerNorm(x_l + Sublayer(x_l)) 24 return self.norm(x + self.sublayer(x)) 25 26class TinyStack(nn.Module): 27 def __init__(self, d_model: int = 64, depth: int = 6, style: str = "pre"): 28 super().__init__() 29 self.blocks = nn.ModuleList([ToggleBlock(d_model, style) for _ in range(depth)]) 30 self.final_norm = nn.LayerNorm(d_model) if style == "pre" else nn.Identity() 31 self.head = nn.Linear(d_model, 16) 32 33 def forward(self, x: torch.Tensor) -> torch.Tensor: 34 for block in self.blocks: 35 x = block(x) 36 return self.head(self.final_norm(x)) 37 38def inspect_grad(style: str, d_model: int = 64, depth: int = 6, seed: int = 0): 39 torch.manual_seed(seed) 40 model = TinyStack(d_model=d_model, depth=depth, style=style) 41 x = torch.randn(2, 8, d_model) 42 target = torch.randn(2, 8, 16) 43 out = model(x) 44 loss = nn.functional.mse_loss(out, target) 45 loss.backward() 46 grad_norms = [block.sublayer[0].weight.grad.norm().item() for block in model.blocks] 47 rounded_norms = [round(value, 4) for value in grad_norms] 48 print(f"{style:4} | loss={loss.item():.4f} | block grad norms={rounded_norms}") 49 return grad_norms 50 51pre_grad_norms = inspect_grad("pre") 52post_grad_norms = inspect_grad("post") 53 54assert len(pre_grad_norms) == len(post_grad_norms) == 6 55assert all(value > 0 for value in pre_grad_norms + post_grad_norms)
Output
1pre | loss=1.3247 | block grad norms=[0.2664, 0.2565, 0.2494, 0.2392, 0.2472, 0.2415] 2post | loss=1.3183 | block grad norms=[0.2663, 0.2614, 0.2619, 0.2555, 0.2684, 0.2698]

The profiles differ across depth and seed. This toy is a diagnostic, not a proof of the Xiong et al. result. The real lesson is how to inspect layerwise gradients instead of treating a diverging model as a mystery.

Mini exercise

  1. Change depth from 6 to 12. How do the per-block gradient profiles change?
  2. Replace final_norm with nn.Identity() in Pre-LN mode and inspect output scale and loss under several seeds.
  3. Try replacing nn.LayerNorm with the RMSNorm class from earlier. Does the gradient norm change meaningfully?

Mastery check

Key concepts

  • LayerNorm mean, variance, scale, and shift over one token's feature dimension
  • BatchNorm versus LayerNorm behavior during variable-length and batch-size-1 decoding
  • Pre-LN versus Post-LN block equations
  • Gradient stability and warmup pressure in Post-LN
  • Reported representation-collapse risk in deep Pre-LN stacks
  • RMSNorm as scale-only normalization
  • QK-Norm and hybrid norm layouts

Evaluation rubric

  • Foundational: Derives the LayerNorm formula and computes it by hand on a small vector
  • Intermediate: Identifies Pre-LN or Post-LN placement from a diagram or model.py block
  • Advanced: Explains why Post-LN gradients near the output layer can be large at initialization and why warmup helps
  • Advanced: Explains why Pre-LN includes an identity gradient path and why later work measures representation-collapse risk at depth
  • Advanced: Implements RMSNorm and states exactly what it removes from LayerNorm
  • Advanced: Explains why QK-Norm controls attention-logit scale before softmax
  • Advanced: Compares plain Pre-LN, Post-LN, DeepNorm, and Peri-LN-style hybrid layouts

Follow-up questions

Common pitfalls

"Pre-LN means no final norm is needed"

Symptom: An implementation of a reference Pre-LN model shows unexpected output-scale behavior. Cause: The final normalization specified by that architecture may have been removed. Fix: Match the reference architecture first; GPT-2 and the Pre-LN setup analyzed by Xiong et al. include a final normalization before prediction.

"RMSNorm is just a faster name for LayerNorm"

Symptom: You describe RMSNorm as if it still subtracts the mean and uses the same statistics. Cause: RMSNorm keeps scale control but removes centering. Fix: State the exact difference: LayerNorm centers and rescales; RMSNorm only rescales by root-mean-square magnitude.

"Warmup matters only because models are deep"

Symptom: A Post-LN run diverges and you only blame model size or optimizer choice. Cause: Post-LN can create especially large top-layer gradients at initialization, even before later training dynamics matter. Fix: Tie warmup reasoning back to normalization placement and layerwise gradient scale, not depth alone.

"QK-Norm replaces residual normalization"

Symptom: Attention logits are controlled, but another instability remains. Cause: QK-Norm targets query-key scale before softmax rather than every residual-path dynamic. Fix: Treat QK-Norm as an attention-specific intervention and separately evaluate the broader normalization layout.

Next Step
Continue to Mechanistic Interpretability and Sparse Autoencoders

Layer normalization keeps the residual stream numerically stable. The next chapter opens that residual stream and shows how sparse autoencoders turn opaque activation vectors into readable features you can inspect, test, and sometimes steer.

PreviousPositional Encoding: RoPE & ALiBi
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Layer Normalization.

Ba, J. L., Kiros, J. R., & Hinton, G. E. · 2016

Attention Is All You Need.

Vaswani, A., et al. · 2017

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. · 2019

On Layer Normalization in the Transformer Architecture.

Xiong, R., et al. · 2020 · ICML 2020

ResiDual: Transformer with Dual Residual Connections.

Xie, Z., et al. · 2023 · ACL 2023

DeepNet: Scaling Transformers to 1,000 Layers.

Wang, H., Ma, S., et al. · 2022

Root Mean Square Layer Normalization.

Zhang, B. & Sennrich, R. · 2019 · NeurIPS 2019

Qwen2.5 Technical Report

Qwen Team · 2024

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Google DeepMind · 2024

QK-Norm: Improving Transformer Training Stability.

Rybakov, O., et al. · 2024

Peri-LN: Revisiting Layer Normalization in the Transformer Architecture.

Kim, J., et al. · 2025 · ICML 2025