Understand LayerNorm mechanics, Pre-LN versus Post-LN placement, RMSNorm simplification, gradient stability, and hybrid normalization layouts for deep transformers.
Positional encoding showed how attention knows where tokens sit. Layer normalization controls a different part of the : the scale of the residual stream that carries token features through dozens of blocks.
Imagine a warehouse routing model that decides which shelf a package should visit next. At each processing station, the model adds a small update to the package's feature vector: current location, priority, weight, and destination zip. After fifty stations, poorly scaled updates can compound. Some features may become far larger than others; a downstream may become sharply saturated and gradients may weaken. Deep networks face this scale-control risk. Layer Normalization helps manage it, and its placement inside a transformer block changes the optimization behavior.
This article assumes you've seen residual connections before. If you need a quick refresher: a residual block computes output = input + sublayer(input). The input travels along a skip path, and the sublayer (attention or a feed-forward network) adds a small change. Layer Normalization decides whether you normalize the input before the sublayer, or normalize the sum after it.
Every residual block adds a change to a hidden state. Residual updates aren't literal multipliers, but a toy multiplicative drift makes the risk easy to see: a scale factor of 1.1 repeated across fifty blocks gives 1.1^50, roughly 117. Backward face a related issue because they multiply block Jacobians; their product can become too large or too small.
Layer Normalization helps by re-centering and rescaling each token's hidden state. Think of it as a calibration step. Before learned scale and shift are applied, the normalized vector sits near zero with a spread close to one. The model then learns scale parameters to stretch those values back into the range it needs.
Before the formula, here is a concrete example. Take a single token's hidden state, a four-dimensional vector:
1x = [3.0, 1.0, -1.0, 5.0]Step 1: compute the mean.
(3 + 1 + (-1) + 5) / 4 = 2.0
Step 2: subtract the mean from every entry.
[1.0, -1.0, -3.0, 3.0]
Step 3: compute the variance.
(1^2 + (-1)^2 + (-3)^2 + 3^2) / 4 = (1 + 1 + 9 + 9) / 4 = 5.0
Step 4: divide by the standard deviation.
sqrt(5.0) ≈ 2.236
[0.45, -0.45, -1.34, 1.34]
Step 5: scale and shift with learned parameters.
If γ = [1.0, 1.0, 1.0, 1.0] and β = [0.0, 0.0, 0.0, 0.0], the output is the same normalized vector. In practice, the model learns the γ and β that help it predict the next token.
That's the entire operation. It happens independently for every token in the sequence. Other tokens in the batch have no role in its statistics, so the same rule applies during training and during autoregressive generation with batch size one.
1from math import isclose, sqrt
2
3def layer_norm(values, gamma=None, beta=None, eps=1e-5):
4 gamma = gamma or [1.0] * len(values)
5 beta = beta or [0.0] * len(values)
6 mean = sum(values) / len(values)
7 variance = sum((value - mean) ** 2 for value in values) / len(values)
8 scale = sqrt(variance + eps)
9 return [
10 gamma_i * ((value - mean) / scale) + beta_i
11 for value, gamma_i, beta_i in zip(values, gamma, beta)
12 ]
13
14normalized = layer_norm([3.0, 1.0, -1.0, 5.0])
15rounded = [round(value, 2) for value in normalized]
16assert rounded == [0.45, -0.45, -1.34, 1.34]
17assert isclose(sum(normalized), 0.0, abs_tol=1e-12)For a vector x ∈ ℝ^d (a single token's hidden state), LayerNorm computes:[1]
where:
μ = (1/d) Σ x_i: mean over the hidden dimensionσ² = (1/d) Σ (x_i - μ)²: variance over the hidden dimensionγ, β ∈ ℝ^d: learnable scale and shift parametersε ≈ 10^-5: tiny constant to prevent division by zeroThe subtraction centers the values around zero. The division squeezes them to a similar scale. The learned γ (scale) and β (shift) let the model decide whether the normalized values should be stretched, compressed, or offset for the next layer.
LayerNorm normalizes across the feature dimension for each token independently, while Batch Normalization (BatchNorm) normalizes across the batch dimension. This makes LayerNorm independent of batch size, which is critical for variable-length sequences and autoregressive generation.
| Property | BatchNorm | LayerNorm |
|---|---|---|
| Normalizes across | Batch dimension | Feature dimension |
| Depends on batch size | Yes | No |
| Running statistics | Yes (train/eval mismatch) | No |
| Works for variable-length sequences | Awkward | Natural |
| Works for autoregressive decoding | Awkward | Natural |
| Used in modern LLMs | Rare | Routine |
This small program changes a neighboring token while holding one token fixed. Its LayerNorm result stays fixed; a batch-normalized feature coordinate changes.
1from math import sqrt
2
3def layer_norm(values, eps=1e-5):
4 mean = sum(values) / len(values)
5 variance = sum((value - mean) ** 2 for value in values) / len(values)
6 return [(value - mean) / sqrt(variance + eps) for value in values]
7
8def batch_norm_one_feature(values, eps=1e-5):
9 mean = sum(values) / len(values)
10 variance = sum((value - mean) ** 2 for value in values) / len(values)
11 return [(value - mean) / sqrt(variance + eps) for value in values]
12
13token = [1.0, 3.0, 5.0, 7.0]
14ln_before = [layer_norm(row) for row in [token, [2.0, 4.0, 6.0, 8.0]]][0]
15ln_after = [layer_norm(row) for row in [token, [100.0, 100.0, 100.0, 100.0]]][0]
16batch_a = batch_norm_one_feature([token[0], 2.0, 3.0])
17batch_b = batch_norm_one_feature([token[0], 2.0, 100.0])
18
19print("LayerNorm token:", [round(value, 3) for value in ln_before])
20print("Batch feature with peers=2,3:", round(batch_a[0], 3))
21print("Batch feature with peers=2,100:", round(batch_b[0], 3))
22assert ln_before == ln_after
23assert round(batch_a[0], 3) != round(batch_b[0], 3)1LayerNorm token: [-1.342, -0.447, 0.447, 1.342]
2Batch feature with peers=2,3: -1.225
3Batch feature with peers=2,100: -0.718The following code demonstrates a basic PyTorch implementation. It takes an input tensor of shape (batch, seq_len, d_model) and calculates the mean and variance across the final feature dimension. It then applies the learnable scale and shift, returning the normalized tensor.
1import torch
2import torch.nn as nn
3
4class LayerNorm(nn.Module):
5 """Layer Normalization from scratch."""
6 def __init__(self, d_model: int, eps: float = 1e-5):
7 super().__init__()
8 self.gamma = nn.Parameter(torch.ones(d_model))
9 self.beta = nn.Parameter(torch.zeros(d_model))
10 self.eps = eps
11
12 def forward(self, x: torch.Tensor) -> torch.Tensor:
13 # x shape: (batch, seq_len, d_model)
14 mean = x.mean(dim=-1, keepdim=True)
15 var = x.var(dim=-1, keepdim=True, unbiased=False)
16 x_norm = (x - mean) / torch.sqrt(var + self.eps)
17 return self.gamma * x_norm + self.beta
18
19# Quick check with the hand-worked example
20x = torch.tensor([[[3.0, 1.0, -1.0, 5.0]]]) # shape (1, 1, 4)
21ln = LayerNorm(d_model=4)
22# Fix gamma=1, beta=0 to match the hand calculation
23nn.init.constant_(ln.gamma, 1.0)
24nn.init.constant_(ln.beta, 0.0)
25out = ln(x)
26print(out.round(decimals=2).detach())
27expected = torch.tensor([[[0.45, -0.45, -1.34, 1.34]]])
28assert torch.allclose(out.round(decimals=2), expected)1tensor([[[ 0.4500, -0.4500, -1.3400, 1.3400]]])
The illustration above traces the same four numbers through each step: raw input, centered, normalized, and final output. Notice how the bars shrink to a controlled range after normalization.
Now that you know what LayerNorm does, the next question is where to place it inside a transformer block. There are two choices, and the difference is simply whether the normalization happens before or after the residual addition.
In this arrangement (used by the original 2017 Transformer and BERT-style encoders), normalization happens after the residual addition:[2][3]
In plain terms: compute the sublayer output (attention or FFN), add it to the original input along the skip connection, then normalize the sum. The problem is that normalization sits directly on the main highway, so gradients through the residual path pass through a LayerNorm Jacobian in each Post-LN block.
In this arrangement, used by GPT-2, normalization happens before the sublayer:[4]
Normalize the input first, feed it into the sublayer, then add the result to a direct copy of the input. The residual path from x_l to x_{l+1} keeps an identity contribution to the backward pass.
At the full-model level, GPT-2 and the Pre-LN architecture analyzed by Xiong et al. apply a final normalization before prediction.[4][5] That final operation normalizes the accumulated residual stream before the output projection. Removing it produces a different architecture whose output scale needs evaluation.
The diagram on the left shows Pre-LN: LayerNorm cleans the sublayer input first, and the raw input still jumps straight over to the addition. The diagram on the right shows Post-LN: the sublayer and the skip path meet first, then LayerNorm cleans up the result.
Holding a sublayer update fixed isolates the placement difference. Post-LN centers and rescales the updated stream immediately; Pre-LN lets the updated stream continue along the residual path.
1from math import sqrt
2
3def layer_norm(values, eps=1e-5):
4 mean = sum(values) / len(values)
5 variance = sum((value - mean) ** 2 for value in values) / len(values)
6 return [(value - mean) / sqrt(variance + eps) for value in values]
7
8x = [1.0, 2.0, 4.0, 8.0]
9fixed_update = [0.5, -0.5, 1.0, -1.0]
10pre_ln_output = [value + update for value, update in zip(x, fixed_update)]
11post_ln_output = layer_norm(pre_ln_output)
12
13print("Pre-LN stream mean:", round(sum(pre_ln_output) / len(pre_ln_output), 3))
14print("Post-LN stream mean:", round(sum(post_ln_output) / len(post_ln_output), 3))
15print("Post-LN output:", [round(value, 3) for value in post_ln_output])
16assert abs(sum(post_ln_output) / len(post_ln_output)) < 1e-12
17assert sum(pre_ln_output) / len(pre_ln_output) != 0.01Pre-LN stream mean: 3.75
2Post-LN stream mean: 0.0
3Post-LN output: [-0.954, -0.954, 0.53, 1.378]To see how the placement handles a uniform offset, trace a deliberately simplified stream through two blocks. Assume each block's sublayer produces the same update, 0.5, for every feature. Also set LayerNorm's learned scale to γ = 1 and shift to β = 0. A trained attention or feed-forward sublayer would usually produce nonuniform updates.
x_0 = [1.0, 1.0, 1.0, 1.0]x_1 = LayerNorm(x_0 + 0.5) = LayerNorm([1.5, 1.5, 1.5, 1.5]) = [0.0, 0.0, 0.0, 0.0] (mean-centered, then scaled)x_2 = LayerNorm(x_1 + 0.5) = LayerNorm([0.5, 0.5, 0.5, 0.5]) = [0.0, 0.0, 0.0, 0.0]The constant offset is reset at every layer. This toy is intentionally stark: LayerNorm removes a uniform shift, so the residual stream doesn't preserve that offset.
x_0 = [1.0, 1.0, 1.0, 1.0]x_1 = x_0 + Sublayer(LayerNorm(x_0)) = [1.0, 1.0, 1.0, 1.0] + 0.5 = [1.5, 1.5, 1.5, 1.5]x_2 = x_1 + Sublayer(LayerNorm(x_1)) = [1.5, 1.5, 1.5, 1.5] + 0.5 = [2.0, 2.0, 2.0, 2.0]The running sum flows through the network because Pre-LN leaves the residual path outside the block normalization. The backward equation below shows why this placement includes an identity gradient route.
1from math import sqrt
2
3def layer_norm(values, eps=1e-5):
4 mean = sum(values) / len(values)
5 variance = sum((value - mean) ** 2 for value in values) / len(values)
6 scale = sqrt(variance + eps)
7 return [(value - mean) / scale for value in values]
8
9def sublayer(values):
10 # Tiny deterministic stand-in for attention or an FFN update.
11 return [0.2 * value + 0.1 for value in values]
12
13def pre_ln_step(values):
14 return [value + update for value, update in zip(values, sublayer(layer_norm(values)))]
15
16def post_ln_step(values):
17 raw = [value + update for value, update in zip(values, sublayer(values))]
18 return layer_norm(raw)
19
20start = [1.0, 2.0, 4.0, 8.0]
21pre_after_two = pre_ln_step(pre_ln_step(start))
22post_after_two = post_ln_step(post_ln_step(start))
23
24assert sum(abs(value) for value in pre_after_two) > sum(abs(value) for value in start)
25assert abs(sum(post_after_two) / len(post_after_two)) < 1e-12Post-LN gradient flow is like sending a package through a long sorting line where every station re-labels it before passing it on. Each relabeling step slightly changes the signal. After eighty stations, the original priority mark is hard to track. Pre-LN is like keeping a clean bypass label that travels beside the package while each station adds useful updates without disrupting the main path.
The key result from Xiong et al. is that Post-LN produces especially large expected gradients near the output layers at initialization under their analysis.[5] For one block, the backward Jacobian is:
where J_LN,l is the LayerNorm Jacobian and J_F,l is the sublayer Jacobian. Across many layers, the backward pass multiplies many such terms together. In practice that means:
For Pre-LN, the Jacobian is:
That leading identity term I is the key difference. Every block includes a direct residual contribution to the gradient, so the backward signal avoids relying entirely on repeated normalization Jacobians. Xiong et al. report that, in their evaluated tasks:
This scalar toy isn't a training experiment. It only shows how putting a normalization factor on the identity path changes a product across blocks.
1depth = 12
2j_layer_norm = 0.75
3j_sublayer = 0.10
4
5post_block = j_layer_norm * (1.0 + j_sublayer)
6pre_block = 1.0 + j_sublayer * j_layer_norm
7post_path = post_block ** depth
8pre_path = pre_block ** depth
9
10print("one block: post=", round(post_block, 3), "pre=", round(pre_block, 3))
11print("twelve-block product: post=", round(post_path, 3), "pre=", round(pre_path, 3))
12assert post_path < 1.0
13assert pre_path > 1.01one block: post= 0.825 pre= 1.075
2twelve-block product: post= 0.099 pre= 2.382Pre-LN isn't free of trade-offs. Analyses including ResiDual report that its residual stream can dominate newer sublayer updates as depth increases. Under that behavior, late layers make smaller relative changes to the hidden representation.[6]
This reported behavior is called representation collapse. Treat it as a design risk to measure, not a guarantee for every Pre-LN model or training run.
Here is the practical summary:
| Property | Post-LN | Pre-LN |
|---|---|---|
| Warmup result in Xiong et al. | Needed in evaluated stable runs | Removed in evaluated stable runs |
| Initialization gradients in Xiong et al. | Large near output layers | Better behaved |
| Reported deep-layer risk | Gradient flow can become difficult | Residual stream can dominate later updates |
| Representative architecture | Original Transformer, BERT | GPT-2 |
There is no universal winner. Xiong et al. demonstrate the Pre-LN optimization advantage in their evaluated tasks, while DeepNorm later scales a Post-LN-derived design to a 1,000-layer machine-translation experiment using residual scaling and matching initialization.[5][7]
| Symptom | Likely cause | Fix |
|---|---|---|
Model diverges at step 1 with loss = NaN | For a Post-LN run, an overly large early update is one hypothesis; Xiong et al. connect Post-LN initialization gradients to warmup sensitivity. | Inspect layerwise gradient norms and learning-rate schedule. Compare a warmer start, smaller initial rate, or a Pre-LN variant under the same setup. |
| One normalization placement underperforms another | Placement interacts with depth, initialization, optimizer, data, and objective. | Treat placement as an ablation, then compare training loss, gradient profile, and downstream quality under matched conditions. |
| A Pre-LN implementation differs sharply from a reference model | A final normalization used by the target Pre-LN architecture may be absent. | Check the architecture definition and restore its final LayerNorm or RMSNorm before the output projection when applicable.[5] |
| Attention becomes nearly one-hot during unstable training | Query-key dot products may have grown enough to saturate softmax. | Inspect attention-logit statistics; QK-Norm or softmax capping are targeted candidates to evaluate in such runs. |
RMSNorm (Root Mean Square Layer Normalization) replaces LayerNorm's centered standard deviation with a root-mean-square scale. Zhang and Sennrich propose it as a cheaper normalization that retains rescaling invariance while removing mean-centering; their evaluated models achieve comparable quality to LayerNorm with run-time reductions that vary by setup.[8] Qwen2.5 and Gemma 2 reports provide later architecture examples using RMSNorm.[9][10]
Instead of centering (subtracting the mean) and then scaling (dividing by standard deviation), RMSNorm just divides by the root-mean-square. It doesn't subtract the mean, and many implementations also omit a learned bias term.
| Step | LayerNorm | RMSNorm |
|---|---|---|
| Compute mean | Yes | No |
| Subtract mean | Yes | No |
| Scale statistic | Centered standard deviation | Root mean square |
| Divide by scale statistic | Yes | Yes |
| Learned scale γ | Yes | Yes |
| Learned shift β | Yes | Often omitted |
The original RMSNorm paper reports comparable quality to LayerNorm on its evaluated tasks and observed run-time reductions from 7% to 64% across different models and implementations.[8] Fused kernels, hardware, and the fraction of time spent in normalization determine the result in another stack, so those measurements are evidence from the paper rather than a promised speedup.
1from math import sqrt
2
3def rms_norm(values, weight=None, eps=1e-6):
4 weight = weight or [1.0] * len(values)
5 rms = sqrt(sum(value * value for value in values) / len(values) + eps)
6 return [value * scale / rms for value, scale in zip(values, weight)]
7
8values = [3.0, 1.0, -1.0, 5.0]
9normalized = rms_norm(values)
10
11assert round(sqrt(sum(value * value for value in normalized) / len(normalized)), 6) == 1.0
12assert round(sum(normalized) / len(normalized), 6) != 0.01import torch
2import torch.nn as nn
3
4class RMSNorm(nn.Module):
5 """Root Mean Square Layer Normalization."""
6 def __init__(self, d_model: int, eps: float = 1e-6):
7 super().__init__()
8 self.weight = nn.Parameter(torch.ones(d_model))
9 self.eps = eps
10
11 def forward(self, x: torch.Tensor) -> torch.Tensor:
12 # x shape: (batch, seq_len, d_model)
13 rms = torch.sqrt(x.pow(2).mean(dim=-1, keepdim=True) + self.eps)
14 return self.weight * (x / rms)
15
16# Quick sanity check
17x = torch.tensor([[[3.0, 1.0, -1.0, 5.0]]])
18rms = RMSNorm(d_model=4)
19out = rms(x)
20print(out.round(decimals=4).detach())
21feature_rms = torch.sqrt(out.pow(2).mean(dim=-1))
22assert torch.allclose(feature_rms, torch.ones_like(feature_rms), atol=1e-5)
23assert not torch.allclose(out.mean(dim=-1), torch.zeros_like(out.mean(dim=-1)))1tensor([[[ 1.0000, 0.3333, -0.3333, 1.6667]]])Once you understand Pre-LN and Post-LN, advanced variants become easier to read. These entries summarize specific reports; their experiments differ in model size, objective, and optimization setup.
| Recipe | Reported example | What changes |
|---|---|---|
| Post-LN | BERT | Normalize after the residual addition |
| Pre-LN | GPT-2 | Normalize before attention and FFN, retaining an identity residual contribution |
| Pre + Post norm | Gemma 2 | Normalize both input and output of each sublayer with RMSNorm[10] |
| DeepNorm | DeepNet study | Modify Post-LN with a scaled residual and matched initialization; evaluated up to 1,000 layers[7] |
| QK-Norm (Query-Key Normalization) | Rybakov et al. study | Normalize query and key vectors before their dot product; evaluated alone and with softmax capping[11] |
No row establishes a globally superior recipe. Peri-LN names an input-and-output normalization layout and reports experiments up to 3.2B parameters; Rybakov et al. evaluate attention-specific normalization and capping on an 830M-parameter model driven into instability with high learning rates.[12][11]
This illustration uses RMS rescaling to expose the mechanism: multiplying a query by twenty changes the raw dot-product logit, but rescaling both query and key by their RMS removes that scale-only change.
1from math import isclose, sqrt
2
3def rms_rescale(values):
4 rms = sqrt(sum(value * value for value in values) / len(values))
5 return [value / rms for value in values]
6
7def attention_logit(query, key):
8 return sum(q * k for q, k in zip(query, key)) / sqrt(len(query))
9
10query = [6.0, -3.0, 2.0, 1.0]
11key = [4.0, -2.0, 1.0, 3.0]
12large_query = [20.0 * value for value in query]
13
14raw = attention_logit(query, key)
15raw_large = attention_logit(large_query, key)
16controlled = attention_logit(rms_rescale(query), rms_rescale(key))
17controlled_large = attention_logit(rms_rescale(large_query), rms_rescale(key))
18
19print("raw logits:", round(raw, 3), round(raw_large, 3))
20print("RMS-rescaled logits:", round(controlled, 3), round(controlled_large, 3))
21assert raw_large == 20.0 * raw
22assert isclose(controlled, controlled_large, rel_tol=1e-12)1raw logits: 17.5 350.0
2RMS-rescaled logits: 1.807 1.807Kim et al. use Peri-LN for normalization placed both before and after each sublayer, and report more balanced variance growth and steadier gradients than their compared layouts in experiments up to 3.2B parameters.[12] Gemma 2's technical report states that it applies RMSNorm to both the input and output of each transformer sublayer; this matches an input-and-output normalization layout, although that report uses no Peri-LN label.[10]
A practical way to internalize the difference is to write one block that can switch between Post-LN and Pre-LN. Below is a minimal PyTorch module. Run it, initialize stacks of each type, and compare the gradient norms across blocks at initialization.
1import torch
2import torch.nn as nn
3
4class ToggleBlock(nn.Module):
5 """Transformer block with style='pre' or style='post'."""
6 def __init__(self, d_model: int, style: str = "pre"):
7 super().__init__()
8 assert style in ("pre", "post")
9 self.style = style
10 self.norm = nn.LayerNorm(d_model)
11 # Simplified sublayer: single linear + ReLU
12 self.sublayer = nn.Sequential(
13 nn.Linear(d_model, d_model * 4),
14 nn.ReLU(),
15 nn.Linear(d_model * 4, d_model),
16 )
17
18 def forward(self, x: torch.Tensor) -> torch.Tensor:
19 if self.style == "pre":
20 # x_l + Sublayer(LayerNorm(x_l))
21 return x + self.sublayer(self.norm(x))
22 else:
23 # LayerNorm(x_l + Sublayer(x_l))
24 return self.norm(x + self.sublayer(x))
25
26class TinyStack(nn.Module):
27 def __init__(self, d_model: int = 64, depth: int = 6, style: str = "pre"):
28 super().__init__()
29 self.blocks = nn.ModuleList([ToggleBlock(d_model, style) for _ in range(depth)])
30 self.final_norm = nn.LayerNorm(d_model) if style == "pre" else nn.Identity()
31 self.head = nn.Linear(d_model, 16)
32
33 def forward(self, x: torch.Tensor) -> torch.Tensor:
34 for block in self.blocks:
35 x = block(x)
36 return self.head(self.final_norm(x))
37
38def inspect_grad(style: str, d_model: int = 64, depth: int = 6, seed: int = 0):
39 torch.manual_seed(seed)
40 model = TinyStack(d_model=d_model, depth=depth, style=style)
41 x = torch.randn(2, 8, d_model)
42 target = torch.randn(2, 8, 16)
43 out = model(x)
44 loss = nn.functional.mse_loss(out, target)
45 loss.backward()
46 grad_norms = [block.sublayer[0].weight.grad.norm().item() for block in model.blocks]
47 rounded_norms = [round(value, 4) for value in grad_norms]
48 print(f"{style:4} | loss={loss.item():.4f} | block grad norms={rounded_norms}")
49 return grad_norms
50
51pre_grad_norms = inspect_grad("pre")
52post_grad_norms = inspect_grad("post")
53
54assert len(pre_grad_norms) == len(post_grad_norms) == 6
55assert all(value > 0 for value in pre_grad_norms + post_grad_norms)1pre | loss=1.3247 | block grad norms=[0.2664, 0.2565, 0.2494, 0.2392, 0.2472, 0.2415]
2post | loss=1.3183 | block grad norms=[0.2663, 0.2614, 0.2619, 0.2555, 0.2684, 0.2698]The profiles differ across depth and seed. This toy is a diagnostic, not a proof of the Xiong et al. result. The real lesson is how to inspect layerwise gradients instead of treating a diverging model as a mystery.
depth from 6 to 12. How do the per-block gradient profiles change?final_norm with nn.Identity() in Pre-LN mode and inspect output scale and loss under several seeds.nn.LayerNorm with the RMSNorm class from earlier. Does the gradient norm change meaningfully?model.py blockSymptom: An implementation of a reference Pre-LN model shows unexpected output-scale behavior. Cause: The final normalization specified by that architecture may have been removed. Fix: Match the reference architecture first; GPT-2 and the Pre-LN setup analyzed by Xiong et al. include a final normalization before prediction.
Symptom: You describe RMSNorm as if it still subtracts the mean and uses the same statistics. Cause: RMSNorm keeps scale control but removes centering. Fix: State the exact difference: LayerNorm centers and rescales; RMSNorm only rescales by root-mean-square magnitude.
Symptom: A Post-LN run diverges and you only blame model size or optimizer choice. Cause: Post-LN can create especially large top-layer gradients at initialization, even before later training dynamics matter. Fix: Tie warmup reasoning back to normalization placement and layerwise gradient scale, not depth alone.
Symptom: Attention logits are controlled, but another instability remains. Cause: QK-Norm targets query-key scale before softmax rather than every residual-path dynamic. Fix: Treat QK-Norm as an attention-specific intervention and separately evaluate the broader normalization layout.
Layer Normalization.
Ba, J. L., Kiros, J. R., & Hinton, G. E. · 2016
Attention Is All You Need.
Vaswani, A., et al. · 2017
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. · 2019 · NAACL 2019
Language Models are Unsupervised Multitask Learners.
Radford, A., et al. · 2019
On Layer Normalization in the Transformer Architecture.
Xiong, R., et al. · 2020 · ICML 2020
ResiDual: Transformer with Dual Residual Connections.
Xie, Z., et al. · 2023 · ACL 2023
DeepNet: Scaling Transformers to 1,000 Layers.
Wang, H., Ma, S., et al. · 2022
Root Mean Square Layer Normalization.
Zhang, B. & Sennrich, R. · 2019 · NeurIPS 2019
Qwen2.5 Technical Report
Qwen Team · 2024
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Google DeepMind · 2024
QK-Norm: Improving Transformer Training Stability.
Rybakov, O., et al. · 2024
Peri-LN: Revisiting Layer Normalization in the Transformer Architecture.
Kim, J., et al. · 2025 · ICML 2025