LearnTransformer Deep DivesLayer Normalization: Pre-LN vs Post-LN

🧠HardTransformer Architecture

Layer Normalization: Pre-LN vs Post-LN

Understand LayerNorm mechanics, Pre-LN versus Post-LN placement, RMSNorm simplification, gradient stability, and hybrid normalization layouts for deep transformers.

26 min read

Learning path

Step 93 of 158 in the full curriculum

Positional Encoding: RoPE & ALiBi Mechanistic Interpretability

Positional encoding showed how attention knows where tokens sit. Layer normalization controls a different part of the transformer: the scale of the residual stream that carries token features through dozens of blocks.

A transformer block repeatedly updates one token's hidden-state vector. Each layer adds attention and MLP outputs to the same representation. After fifty layers, poorly scaled updates can compound. Some features may become far larger than others; a downstream softmax may become sharply saturated and gradients may weaken. Deep networks face this scale-control risk. Layer Normalization helps manage it, and its placement inside a transformer block changes the optimization behavior.

Assume you've seen residual connections before. If you need a quick refresher: a residual block computes output = input + sublayer(input). The input travels along a skip path, and the sublayer (attention or a feed-forward network) adds a small change. Layer Normalization decides whether you normalize the input before the sublayer, or normalize the sum after it.

Why scale drift becomes a training problem

Every residual block adds a change to a hidden state. Residual updates aren't literal multipliers, but a toy multiplicative drift makes the risk easy to see: a scale factor of 1.1 repeated across fifty blocks gives 1.1^50, roughly 117. During backpropagation, gradients face a related issue because they multiply block Jacobians; their product can become too large or too small.

Layer Normalization helps by re-centering and rescaling each token's hidden state. Before learned scale and shift are applied, the normalized vector sits near zero with a spread close to one. The model then learns scale parameters to stretch those values back into the range it needs.

LayerNorm by hand

Before the formula, start with one concrete token hidden state:

layernorm-by-hand.py

x = [3.0, 1.0, -1.0, 5.0]

Step 1: compute the mean: (3 + 1 + (-1) + 5) / 4 = 2.0

Step 2: subtract the mean from every entry: [1.0, -1.0, -3.0, 3.0]

Step 3: compute the variance: (1^2 + (-1)^2 + (-3)^2 + 3^2) / 4 = (1 + 1 + 9 + 9) / 4 = 5.0

Step 4: divide by the standard deviation: sqrt(5.0) ≈ 2.236 [0.45, -0.45, -1.34, 1.34]

Step 5: scale and shift with learned parameters: If γ = [1.0, 1.0, 1.0, 1.0] and β = [0.0, 0.0, 0.0, 0.0], the output is the same normalized vector. In practice, the model learns the γ and β that help it predict the next token.

That's the entire operation. It happens independently for every token in the sequence. Other tokens in the batch have no role in its statistics, so the same rule applies during training and during autoregressive generation with batch size one.

layernorm-by-hand-2.py

from math import isclose, sqrt

def layer_norm(values, gamma=None, beta=None, eps=1e-5):
    if gamma is None:
        gamma = [1.0] * len(values)
    if beta is None:
        beta = [0.0] * len(values)
    mean = sum(values) / len(values)
    variance = sum((value - mean) ** 2 for value in values) / len(values)
    scale = sqrt(variance + eps)
    return [
        gamma_i * ((value - mean) / scale) + beta_i
        for value, gamma_i, beta_i in zip(values, gamma, beta)
    ]

normalized = layer_norm([3.0, 1.0, -1.0, 5.0])
rounded = [round(value, 2) for value in normalized]
assert rounded == [0.45, -0.45, -1.34, 1.34]
assert isclose(sum(normalized), 0.0, abs_tol=1e-12)
assert layer_norm([3.0, 1.0], gamma=[0.0, 0.0]) == [0.0, 0.0]

The general formula

For a vector x ∈ ℝ^d (a single token's hidden state), LayerNorm computes:^{[1]Reference 1Layer Normalization.https://arxiv.org/abs/1607.06450}

$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$

where:

μ = (1/d) Σ x_i: mean over the hidden dimension
σ² = (1/d) Σ (x_i - μ)²: variance over the hidden dimension
γ, β ∈ ℝ^d: learnable scale and shift parameters
ε: tiny positive constant to prevent division by zero (10^-5 in this example; defaults vary by implementation)

Subtracting the mean centers the values around zero. Dividing by the standard deviation squeezes them to a similar scale. The learned γ (scale) and β (shift) let the model decide whether the normalized values should be stretched, compressed, or offset for the next layer.

Key distinction from BatchNorm

LayerNorm normalizes across the feature dimension for each token independently. Batch Normalization (BatchNorm) instead computes each feature coordinate's statistics across examples in a mini-batch (and, for some data layouts, additional positions). This makes LayerNorm independent of batch size, which is useful for variable-length sequences and autoregressive generation.

Property	BatchNorm	LayerNorm
Normalizes across	Mini-batch examples per feature coordinate	Feature values within one token
Depends on batch size	Yes	No
Running statistics	Yes (train/eval mismatch)	No
Works for variable-length sequences	Awkward	Natural
Works for autoregressive decoding	Awkward	Natural
Used in modern LLMs	Rare	Routine

This small program changes a neighboring token while holding one token fixed. Its LayerNorm result stays fixed; a batch-normalized feature coordinate changes.

layernorm-is-per-token.py

from math import sqrt

def layer_norm(values, eps=1e-5):
    mean = sum(values) / len(values)
    variance = sum((value - mean) ** 2 for value in values) / len(values)
    return [(value - mean) / sqrt(variance + eps) for value in values]

def batch_norm_one_feature(values, eps=1e-5):
    mean = sum(values) / len(values)
    variance = sum((value - mean) ** 2 for value in values) / len(values)
    return [(value - mean) / sqrt(variance + eps) for value in values]

token = [1.0, 3.0, 5.0, 7.0]
ln_before = [layer_norm(row) for row in [token, [2.0, 4.0, 6.0, 8.0]]][0]
ln_after = [layer_norm(row) for row in [token, [100.0, 100.0, 100.0, 100.0]]][0]
batch_a = batch_norm_one_feature([token[0], 2.0, 3.0])
batch_b = batch_norm_one_feature([token[0], 2.0, 100.0])

print("LayerNorm token:", [round(value, 3) for value in ln_before])
print("Batch feature with peers=2,3:", round(batch_a[0], 3))
print("Batch feature with peers=2,100:", round(batch_b[0], 3))
assert ln_before == ln_after
assert round(batch_a[0], 3) != round(batch_b[0], 3)

Output

LayerNorm token: [-1.342, -0.447, 0.447, 1.342]
Batch feature with peers=2,3: -1.225
Batch feature with peers=2,100: -0.718

PyTorch implementation

This PyTorch implementation takes an input tensor of shape (batch, seq_len, d_model), calculates the mean and variance across the final feature dimension, then applies the learnable scale and shift.

pytorch-implementation.py

import torch
import torch.nn as nn

class LayerNorm(nn.Module):
    """Layer Normalization from scratch."""
    def __init__(self, d_model: int, eps: float = 1e-5):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.zeros(d_model))
        self.eps = eps

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x shape: (batch, seq_len, d_model)
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta

# Quick check with the hand-worked example
x = torch.tensor([[[3.0, 1.0, -1.0, 5.0]]])  # shape (1, 1, 4)
ln = LayerNorm(d_model=4)
# Fix gamma=1, beta=0 to match the hand calculation
nn.init.constant_(ln.gamma, 1.0)
nn.init.constant_(ln.beta, 0.0)
out = ln(x)
print(out.round(decimals=2).detach())
expected = torch.tensor([[[0.45, -0.45, -1.34, 1.34]]])
assert torch.allclose(out.round(decimals=2), expected)

Output

tensor([[[ 0.4500, -0.4500, -1.3400,  1.3400]]])

Parallel-coordinate plot tracing four token features from input values through mean centering and standard-deviation scaling, with mean lines and one-standard-deviation bands at each stage. — Subtracting μ = 2 shifts every feature by the same amount, moving the mean to zero without changing σ ≈ 2.24. Dividing by σ contracts the spread to one while preserving feature order. With γ = 1 and β = 0, the normalized vector is also the output.

The colored paths preserve each feature's identity across both operations. Centering translates the whole vector; scaling changes its spread. Learned γ and β can then reshape that standardized vector for the next layer.

Where you put LayerNorm matters

Now that you know what LayerNorm does, the next question is where to place it inside a transformer block. There are two choices, and the difference is whether the normalization happens before or after the residual addition.

The placement difference

Post-LN normalizes after the residual update has been added. Its residual path therefore crosses a normalization operation in every block.
Pre-LN normalizes before the sublayer update. Its residual path retains an identity contribution to the backward Jacobian.

Post-LN: the original transformer

In this arrangement (used by the original 2017 Transformer and BERT-style encoders), normalization happens after the residual addition:^{[2]Reference 2Attention Is All You Need.https://arxiv.org/abs/1706.03762}^{[3]Reference 3BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.https://arxiv.org/abs/1810.04805}

$x_{l+1} = \text{LayerNorm}(x_l + \text{Sublayer}(x_l))$

In plain terms: compute the sublayer output (attention or FFN), add it to the original input along the skip connection, then normalize the sum. Normalization now sits directly on the main highway, so gradients through the residual path pass through a LayerNorm Jacobian in each Post-LN block.

Pre-LN: the GPT-2 layout

In this arrangement, used by GPT-2, normalization happens before the sublayer:^{[4]Reference 4Language Models are Unsupervised Multitask Learners.https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf}

$x_{l+1} = x_l + \text{Sublayer}(\text{LayerNorm}(x_l))$

Normalize the input first, feed it into the sublayer, then add the result to a direct copy of the input. The residual path from x_l to x_{l+1} keeps an identity contribution to the backward pass.

At the full-model level, GPT-2 and the Pre-LN architecture analyzed by Xiong et al. apply a final normalization before prediction.^{[4]Reference 4Language Models are Unsupervised Multitask Learners.https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf}^{[5]Reference 5On Layer Normalization in the Transformer Architecture.https://arxiv.org/abs/2002.04745} That final operation normalizes the accumulated residual stream before the output projection. Removing it produces a different architecture whose output scale needs evaluation.

Side-by-side Pre-LN and Post-LN block graphs with their exact backward Jacobians, highlighting the additive identity route in Pre-LN and the repeated LayerNorm Jacobian on the Post-LN residual path. — Pre-LN gives each block Jacobian an additive identity term, $I + J_FJ_{LN}$. Post-LN left-multiplies the residual contribution by $J_{LN}$, so the depth-wide route repeatedly crosses normalization Jacobians.

The diagram on the left shows Pre-LN: LayerNorm cleans the sublayer input first, and the raw input still jumps straight over to the addition. The diagram on the right shows Post-LN: the sublayer and the skip path meet first, then LayerNorm cleans up the result.

Holding a sublayer update fixed isolates the placement difference. Post-LN centers and rescales the updated stream immediately; Pre-LN lets the updated stream continue along the residual path.

pre-post-forward-placement.py

from math import sqrt

def layer_norm(values, eps=1e-5):
    mean = sum(values) / len(values)
    variance = sum((value - mean) ** 2 for value in values) / len(values)
    return [(value - mean) / sqrt(variance + eps) for value in values]

x = [1.0, 2.0, 4.0, 8.0]
fixed_update = [0.5, -0.5, 1.0, -1.0]
pre_ln_output = [value + update for value, update in zip(x, fixed_update)]
post_ln_output = layer_norm(pre_ln_output)

print("Pre-LN stream mean:", round(sum(pre_ln_output) / len(pre_ln_output), 3))
print("Post-LN stream mean:", round(sum(post_ln_output) / len(post_ln_output), 3))
print("Post-LN output:", [round(value, 3) for value in post_ln_output])
assert abs(sum(post_ln_output) / len(post_ln_output)) < 1e-12
assert sum(pre_ln_output) / len(pre_ln_output) != 0.0

Output

Pre-LN stream mean: 3.75
Post-LN stream mean: 0.0
Post-LN output: [-0.954, -0.954, 0.53, 1.378]

A two-layer walkthrough

To see how the placement handles a uniform offset, trace a deliberately simplified stream through two blocks. Assume each block's sublayer produces the same update, 0.5, for every feature. Also set LayerNorm's learned scale to γ = 1 and shift to β = 0. A trained attention or feed-forward sublayer would usually produce nonuniform updates.

Post-LN stack

Start: x_0 = [1.0, 1.0, 1.0, 1.0]
After block 1: x_1 = LayerNorm(x_0 + 0.5) = LayerNorm([1.5, 1.5, 1.5, 1.5]) = [0.0, 0.0, 0.0, 0.0] (mean-centered, then scaled)
After block 2: x_2 = LayerNorm(x_1 + 0.5) = LayerNorm([0.5, 0.5, 0.5, 0.5]) = [0.0, 0.0, 0.0, 0.0]

The constant offset is reset at every layer. This toy is intentionally stark: LayerNorm removes a uniform shift, so the residual stream doesn't preserve that offset.

Pre-LN stack

Start: x_0 = [1.0, 1.0, 1.0, 1.0]
After block 1: x_1 = x_0 + Sublayer(LayerNorm(x_0)) = [1.0, 1.0, 1.0, 1.0] + 0.5 = [1.5, 1.5, 1.5, 1.5]
After block 2: x_2 = x_1 + Sublayer(LayerNorm(x_1)) = [1.5, 1.5, 1.5, 1.5] + 0.5 = [2.0, 2.0, 2.0, 2.0]

The running sum flows through the network because Pre-LN leaves the residual path outside the block normalization. The backward equation below shows why this placement includes an identity gradient route.

pre-ln-stack.py

from math import sqrt

def layer_norm(values, eps=1e-5):
    mean = sum(values) / len(values)
    variance = sum((value - mean) ** 2 for value in values) / len(values)
    scale = sqrt(variance + eps)
    return [(value - mean) / scale for value in values]

def sublayer(values):
    # Tiny deterministic stand-in for attention or an FFN update.
    return [0.2 * value + 0.1 for value in values]

def pre_ln_step(values):
    return [value + update for value, update in zip(values, sublayer(layer_norm(values)))]

def post_ln_step(values):
    raw = [value + update for value, update in zip(values, sublayer(values))]
    return layer_norm(raw)

start = [1.0, 2.0, 4.0, 8.0]
pre_after_two = pre_ln_step(pre_ln_step(start))
post_after_two = post_ln_step(post_ln_step(start))

assert sum(abs(value) for value in pre_after_two) > sum(abs(value) for value in start)
assert abs(sum(post_after_two) / len(post_after_two)) < 1e-12

Why Pre-LN can train more stably

Post-LN gradient flow is like passing an error signal through a long chain of adapters where every block rescales it before sending it backward. Each rescaling step slightly changes the signal. After eighty blocks, the original direction is hard to track. Pre-LN keeps a cleaner residual path beside the block updates, so useful gradients can travel backward without being repeatedly disrupted.

Post-LN gradient problem

The key result from Xiong et al. is that Post-LN produces especially large expected gradients near the output layers at initialization under their analysis.^{[5]Reference 5On Layer Normalization in the Transformer Architecture.https://arxiv.org/abs/2002.04745} For one block, the backward Jacobian is:

$\frac{\partial x_{l+1}}{\partial x_l} = J_{\text{LN},l} \cdot (I + J_{F,l})$

where J_LN,l is the LayerNorm Jacobian and J_F,l is the sublayer Jacobian. Across many layers, the backward pass multiplies many such terms together. In practice that means:

Gradient scale becomes uneven across depth. Early layers may receive tiny updates while late layers receive huge ones.
Top layers can receive very large updates at initialization. A single step with a moderately high learning rate can push weights far from a good basin.
Learning-rate warmup addresses this risk in the evaluated Post-LN setups. Start the learning rate near zero and ramp it up so early updates are smaller.

Pre-LN gradient advantage

For Pre-LN, the Jacobian is:

$\frac{\partial x_{l+1}}{\partial x_l} = I + J_{F,l} \cdot J_{\text{LN},l}$

That leading identity term I is the key difference. Every block includes a direct residual contribution to the gradient, so the backward signal avoids relying entirely on repeated normalization Jacobians. Xiong et al. report that, in their evaluated tasks:

Pre-LN trains without learning-rate warmup in experiments where Post-LN needed it for stable optimization.^{[5]Reference 5On Layer Normalization in the Transformer Architecture.https://arxiv.org/abs/2002.04745}
Its initialization-time gradients are better behaved in their analysis.

This scalar toy isn't a training experiment. It only shows how putting a normalization factor on the identity path changes a product across blocks.

jacobian-route-toy.py

depth = 12
j_layer_norm = 0.75
j_sublayer = 0.10

post_block = j_layer_norm * (1.0 + j_sublayer)
pre_block = 1.0 + j_sublayer * j_layer_norm
post_path = post_block ** depth
pre_path = pre_block ** depth

print("one block: post=", round(post_block, 3), "pre=", round(pre_block, 3))
print("twelve-block product: post=", round(post_path, 3), "pre=", round(pre_path, 3))
assert post_path < 1.0
assert pre_path > 1.0

Output

one block: post= 0.825 pre= 1.075
twelve-block product: post= 0.099 pre= 2.382

A reported Pre-LN risk: representation collapse

Pre-LN isn't free of trade-offs. Analyses including ResiDual report that its residual stream can dominate newer sublayer updates as depth increases. Under that behavior, late layers make smaller relative changes to the hidden representation.^{[6]Reference 6ResiDual: Transformer with Dual Residual Connections.https://arxiv.org/abs/2304.14802}

The ResiDual authors call this reported behavior representation collapse.^{[6]Reference 6ResiDual: Transformer with Dual Residual Connections.https://arxiv.org/abs/2304.14802} Treat it as a design risk to measure, not a guarantee for every Pre-LN model or training run.

Practical summary:

Property	Post-LN	Pre-LN
Warmup result in Xiong et al.	Needed in evaluated stable runs	Removed in evaluated stable runs
Initialization gradients in Xiong et al.	Large near output layers	Better behaved
Reported deep-layer risk	Gradient flow can become difficult	Residual stream can dominate later updates
Representative architecture	Original Transformer, BERT	GPT-2

There is no universal winner. Xiong et al. demonstrate the Pre-LN optimization advantage in their evaluated tasks, while DeepNorm later scales a Post-LN-derived design to a 1,000-layer machine-translation experiment using residual scaling and matching initialization.^{[5]Reference 5On Layer Normalization in the Transformer Architecture.https://arxiv.org/abs/2002.04745}^{[7]Reference 7DeepNet: Scaling Transformers to 1,000 Layers.https://arxiv.org/abs/2203.00555}

Common mistakes and how to debug them

Symptom	Likely cause	Fix
Model diverges at step 1 with `loss = NaN`	For a Post-LN run, an overly large early update is one hypothesis; Xiong et al. connect Post-LN initialization gradients to warmup sensitivity.	Inspect layerwise gradient norms and learning-rate schedule. Compare a warmer start, smaller initial rate, or a Pre-LN variant under the same setup.
One normalization placement underperforms another	Placement interacts with depth, initialization, optimizer, data, and objective.	Treat placement as an ablation, then compare training loss, gradient profile, and downstream quality under matched conditions.
A Pre-LN implementation differs sharply from a reference model	A final normalization used by the target Pre-LN architecture may be absent.	Check the architecture definition and restore its final `LayerNorm` or `RMSNorm` before the output projection when applicable.^{[5]Reference 5On Layer Normalization in the Transformer Architecture.https://arxiv.org/abs/2002.04745}
Attention becomes nearly one-hot during unstable training	Query-key dot products may have grown enough to saturate softmax.	Inspect attention-logit statistics; QK-Norm or softmax capping are targeted candidates to evaluate in such runs.

RMSNorm: scale-only normalization

RMSNorm (Root Mean Square Layer Normalization) replaces LayerNorm's centered standard deviation with a root-mean-square scale. Zhang and Sennrich propose it as a cheaper normalization that retains rescaling invariance while removing mean-centering; their evaluated models achieve comparable quality to LayerNorm with run-time reductions that vary by setup.^{[8]Reference 8Root Mean Square Layer Normalization.https://arxiv.org/abs/1910.07467} Qwen2.5 and Gemma 2 reports provide later architecture examples using RMSNorm.^{[9]Reference 9Qwen2.5 Technical Reporthttps://arxiv.org/abs/2412.15115}^{[10]Reference 10Gemma 2: Improving Open Language Models at a Practical Sizehttps://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf}

$\text{RMSNorm}(x) = \gamma \odot \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2 + \epsilon}}$

Instead of centering (subtracting the mean) and then scaling (dividing by standard deviation), RMSNorm just divides by the root-mean-square. It doesn't subtract the mean, and many implementations also omit a learned bias term.

Step	LayerNorm	RMSNorm
Compute mean	Yes	No
Subtract mean	Yes	No
Scale statistic	Centered standard deviation	Root mean square
Divide by scale statistic	Yes	Yes
Learned scale γ	Yes	Yes
Learned shift β	Yes	Often omitted

Exact comparison of LayerNorm and RMSNorm on x equals 3, 1, negative 1, 5, showing LayerNorm output centered at zero, RMSNorm output with nonzero mean, and their different response to adding a constant offset. — With $\gamma = 1$, $\beta = 0$, and $\epsilon$ ignored, LayerNorm maps the token to mean zero and standard deviation one. RMSNorm maps it to RMS one but keeps mean $0.67$; adding a shared offset leaves LayerNorm unchanged but changes the RMSNorm output.

The original RMSNorm paper reports comparable quality to LayerNorm on its evaluated tasks and observed run-time reductions from 7% to 64% across different models and implementations.^{[8]Reference 8Root Mean Square Layer Normalization.https://arxiv.org/abs/1910.07467} Fused kernels, hardware, and the fraction of time spent in normalization determine the result in another stack, so those measurements are evidence from the paper rather than a promised speedup.

rmsnorm-a-common-modern-shortcut.py

from math import sqrt

def rms_norm(values, weight=None, eps=1e-6):
    if weight is None:
        weight = [1.0] * len(values)
    rms = sqrt(sum(value * value for value in values) / len(values) + eps)
    return [value * scale / rms for value, scale in zip(values, weight)]

values = [3.0, 1.0, -1.0, 5.0]
normalized = rms_norm(values)

assert round(sqrt(sum(value * value for value in normalized) / len(normalized)), 6) == 1.0
assert round(sum(normalized) / len(normalized), 6) != 0.0
assert rms_norm([3.0, 1.0], weight=[0.0, 0.0]) == [0.0, 0.0]

rmsnorm-a-common-modern-shortcut-2.py

import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    """Root Mean Square Layer Normalization."""
    def __init__(self, d_model: int, eps: float = 1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(d_model))
        self.eps = eps

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x shape: (batch, seq_len, d_model)
        rms = torch.sqrt(x.pow(2).mean(dim=-1, keepdim=True) + self.eps)
        return self.weight * (x / rms)

# Quick sanity check
x = torch.tensor([[[3.0, 1.0, -1.0, 5.0]]])
rms = RMSNorm(d_model=4)
out = rms(x)
print(out.round(decimals=4).detach())
feature_rms = torch.sqrt(out.pow(2).mean(dim=-1))
assert torch.allclose(feature_rms, torch.ones_like(feature_rms), atol=1e-5)
assert not torch.allclose(out.mean(dim=-1), torch.zeros_like(out.mean(dim=-1)))

Output

tensor([[[ 1.0000,  0.3333, -0.3333,  1.6667]]])

Reported normalization layouts

Once you understand Pre-LN and Post-LN, advanced variants become easier to read. These entries summarize specific reports; their experiments differ in model size, objective, and optimization setup.

Recipe	Reported example	What changes
Post-LN	BERT	Normalize after the residual addition
Pre-LN	GPT-2	Normalize before attention and FFN, retaining an identity residual contribution
Pre + Post norm	Gemma 2	Normalize both input and output of each sublayer with RMSNorm^{[10]Reference 10Gemma 2: Improving Open Language Models at a Practical Sizehttps://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf}
DeepNorm	DeepNet study	Modify Post-LN with a scaled residual and matched initialization; evaluated up to 1,000 layers^{[7]Reference 7DeepNet: Scaling Transformers to 1,000 Layers.https://arxiv.org/abs/2203.00555}
QK-Norm (Query-Key Normalization)	Rybakov et al. study	Normalize query and key vectors before their dot product; evaluated alone and with softmax capping^{[11]Reference 11QK-Norm: Improving Transformer Training Stability.https://arxiv.org/abs/2410.16682}

No row establishes a globally superior recipe. Peri-LN names an input-and-output normalization layout and reports experiments up to 3.2B parameters; Rybakov et al. evaluate attention-specific normalization and capping on an 830M-parameter model driven into instability with high learning rates.^{[12]Reference 12Peri-LN: Revisiting Layer Normalization in the Transformer Architecture.https://arxiv.org/abs/2502.02732}^{[11]Reference 11QK-Norm: Improving Transformer Training Stability.https://arxiv.org/abs/2410.16682}

QK-Norm targets attention logits, not every residual-stream dynamic.

This small calculation uses RMS rescaling to expose the mechanism: multiplying a query by twenty changes the raw dot-product logit, but rescaling both query and key by their RMS removes that scale-only change.

qk-norm-controls-dot-scale.py

from math import isclose, sqrt

def rms_rescale(values):
    rms = sqrt(sum(value * value for value in values) / len(values))
    return [value / rms for value in values]

def attention_logit(query, key):
    return sum(q * k for q, k in zip(query, key)) / sqrt(len(query))

query = [6.0, -3.0, 2.0, 1.0]
key = [4.0, -2.0, 1.0, 3.0]
large_query = [20.0 * value for value in query]

raw = attention_logit(query, key)
raw_large = attention_logit(large_query, key)
controlled = attention_logit(rms_rescale(query), rms_rescale(key))
controlled_large = attention_logit(rms_rescale(large_query), rms_rescale(key))

print("raw logits:", round(raw, 3), round(raw_large, 3))
print("RMS-rescaled logits:", round(controlled, 3), round(controlled_large, 3))
assert raw_large == 20.0 * raw
assert isclose(controlled, controlled_large, rel_tol=1e-12)

Output

raw logits: 17.5 350.0
RMS-rescaled logits: 1.807 1.807

Peri-LN and hybrids

Kim et al. use Peri-LN for normalization placed both before and after each sublayer, and report more balanced variance growth and steadier gradients than their compared layouts in experiments up to 3.2B parameters.^{[12]Reference 12Peri-LN: Revisiting Layer Normalization in the Transformer Architecture.https://arxiv.org/abs/2502.02732} Gemma 2's technical report states that it applies RMSNorm to both the input and output of each transformer sublayer; this matches an input-and-output normalization layout, although that report uses no Peri-LN label.^{[10]Reference 10Gemma 2: Improving Open Language Models at a Practical Sizehttps://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf}

Practice: build a toggle-switch transformer block

A practical way to internalize the difference is to write one block that can switch between Post-LN and Pre-LN. The minimal PyTorch module below lets you initialize stacks of each type and compare the gradient norms across blocks at initialization.

practice-build-a-toggle-switch-transformer.py

import torch
import torch.nn as nn

class ToggleBlock(nn.Module):
    """Transformer block with style='pre' or style='post'."""
    def __init__(self, d_model: int, style: str = "pre"):
        super().__init__()
        assert style in ("pre", "post")
        self.style = style
        self.norm = nn.LayerNorm(d_model)
        # Simplified sublayer: single linear + ReLU
        self.sublayer = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.ReLU(),
            nn.Linear(d_model * 4, d_model),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.style == "pre":
            # x_l + Sublayer(LayerNorm(x_l))
            return x + self.sublayer(self.norm(x))
        else:
            # LayerNorm(x_l + Sublayer(x_l))
            return self.norm(x + self.sublayer(x))

class TinyStack(nn.Module):
    def __init__(self, d_model: int = 64, depth: int = 6, style: str = "pre"):
        super().__init__()
        self.blocks = nn.ModuleList([ToggleBlock(d_model, style) for _ in range(depth)])
        self.final_norm = nn.LayerNorm(d_model) if style == "pre" else nn.Identity()
        self.head = nn.Linear(d_model, 16)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        for block in self.blocks:
            x = block(x)
        return self.head(self.final_norm(x))

def inspect_grad(style: str, d_model: int = 64, depth: int = 6, seed: int = 0):
    torch.manual_seed(seed)
    model = TinyStack(d_model=d_model, depth=depth, style=style)
    x = torch.randn(2, 8, d_model)
    target = torch.randn(2, 8, 16)
    out = model(x)
    loss = nn.functional.mse_loss(out, target)
    loss.backward()
    grad_norms = [block.sublayer[0].weight.grad.norm().item() for block in model.blocks]
    rounded_norms = [round(value, 4) for value in grad_norms]
    print(f"{style:4} | loss={loss.item():.4f} | block grad norms={rounded_norms}")
    return grad_norms

pre_grad_norms = inspect_grad("pre")
post_grad_norms = inspect_grad("post")

assert len(pre_grad_norms) == len(post_grad_norms) == 6
assert all(value > 0 for value in pre_grad_norms + post_grad_norms)

Output

pre  | loss=1.3247 | block grad norms=[0.2664, 0.2565, 0.2494, 0.2392, 0.2472, 0.2415]
post | loss=1.3183 | block grad norms=[0.2663, 0.2614, 0.2619, 0.2555, 0.2684, 0.2698]

The profiles differ across depth and seed. This toy is a diagnostic, not a proof of the Xiong et al. result. The real lesson is how to inspect layerwise gradients instead of treating a diverging model as a mystery.

Mini exercise

Change depth from 6 to 12. How do the per-block gradient profiles change?
Replace final_norm with nn.Identity() in Pre-LN mode and inspect output scale and loss under several seeds.
Try replacing nn.LayerNorm with the RMSNorm class from earlier. Does the gradient norm change meaningfully?

What to check before moving on

Compute LayerNorm by hand, including mean, population variance, epsilon, learned scale, and learned shift.
Explain why LayerNorm uses one token's feature values while BatchNorm uses statistics across examples or positions.
Identify Pre-LN and Post-LN from code, then derive $I + J_FJ_{LN}$ versus $J_{LN}(I + J_F)$ .
Connect Post-LN's initialization-time gradient profile to learning-rate warmup without claiming the result applies to every training setup.
Explain the reported deep Pre-LN representation-collapse risk and name measurements that would reveal it.
Implement RMSNorm and state exactly which LayerNorm operations it removes.
Separate residual-stream normalization, QK-Norm, DeepNorm, and input-output hybrid layouts by the mechanism each changes.

Practice checkpoints

What to remember

LayerNorm standardizes one token across its feature dimension, then restores learned scale and shift. Pre-LN normalizes only the sublayer input, leaving an additive identity contribution in each block Jacobian. Post-LN normalizes after the residual addition, so the residual route repeatedly crosses LayerNorm Jacobians. RMSNorm keeps magnitude control but removes mean-centering. QK-Norm acts on attention queries and keys, which makes it a different intervention from residual-stream normalization.

Common failures and fixes

Symptom: A reference Pre-LN implementation has unexpected output scale. Cause: Its final normalization was removed because each block already has an input-side norm. Fix: Match the architecture first; GPT-2 and the Pre-LN setup analyzed by Xiong et al. normalize the accumulated stream before prediction.
Symptom: RMSNorm is described as a faster spelling of LayerNorm. Cause: Mean-centering and RMS rescaling were collapsed into one operation. Fix: State the formulas: LayerNorm centers and rescales; RMSNorm rescales without subtracting the mean.
Symptom: A Post-LN run diverges and depth alone gets blamed. Cause: Initialization-time gradient scale and early learning rate weren't inspected. Fix: Log layerwise gradient norms, then compare warmup, initial rate, and norm placement under matched conditions.
Symptom: Attention logits are controlled but another instability remains. Cause: QK-Norm was treated as a replacement for residual normalization. Fix: Treat query-key scale and residual-stream dynamics as separate measurements and interventions.

Next Step

Continue to Mechanistic Interpretability and Sparse Autoencoders

Layer normalization keeps the residual stream numerically stable. The next chapter opens that residual stream and shows how sparse autoencoders turn opaque activation vectors into readable features you can inspect, test, and sometimes steer.

PreviousPositional Encoding: RoPE & ALiBi

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Layer Normalization.

Ba, J. L., Kiros, J. R., & Hinton, G. E. · 2016

Attention Is All You Need.

Vaswani, A., et al. · 2017

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. · 2019

On Layer Normalization in the Transformer Architecture.

Xiong, R., et al. · 2020 · ICML 2020

ResiDual: Transformer with Dual Residual Connections.

Xie, Z., et al. · 2023 · ACL 2023

DeepNet: Scaling Transformers to 1,000 Layers.

Wang, H., Ma, S., et al. · 2022

Root Mean Square Layer Normalization.

Zhang, B. & Sennrich, R. · 2019 · NeurIPS 2019

Qwen2.5 Technical Report

Qwen Team · 2024

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Google DeepMind · 2024

QK-Norm: Improving Transformer Training Stability.

Rybakov, O., et al. · 2024

Peri-LN: Revisiting Layer Normalization in the Transformer Architecture.

Kim, J., et al. · 2025 · ICML 2025

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Layer Normalization: Pre-LN vs Post-LN

Why scale drift becomes a training problem

LayerNorm by hand

If you feed 16 request-latency feature vectors through LayerNorm, which numbers determine the mean and variance for one vector?

The general formula

Key distinction from BatchNorm

Why is BatchNorm awkward for autoregressive decoding while LayerNorm is natural?

PyTorch implementation

Where you put LayerNorm matters

The placement difference

Post-LN: the original transformer

Pre-LN: the GPT-2 layout

In one equation, what is the architectural difference between Post-LN and Pre-LN?

A two-layer walkthrough

Post-LN stack

Pre-LN stack

Why does the Pre-LN architecture analyzed by Xiong et al. include a final normalization before the output head?

Why Pre-LN can train more stably

Post-LN gradient problem

Pre-LN gradient advantage

A reported Pre-LN risk: representation collapse

What is the representation-collapse risk reported for Pre-LN?

Common mistakes and how to debug them

RMSNorm: scale-only normalization

What does RMSNorm remove from LayerNorm, and what does it keep?

Reported normalization layouts

How does QK-Norm reduce attention instability?

Why do hybrid layouts such as Peri-LN or Pre + Post norm exist if Pre-LN is already stable?

Peri-LN and hybrids

Practice: build a toggle-switch transformer block

Mini exercise

What to check before moving on

Practice checkpoints

Your 48-layer decoder diverges during first few hundred steps, but the same optimizer works on a 12-layer version. What should you inspect first?

Why should you avoid treating Pre-LN as a universal replacement for Post-LN?

When does RMSNorm change the model's behavior less than replacing Pre-LN with Post-LN?

What to remember

Common failures and fixes

Mastery Check

Discussion