Trace a support reply through masked attention, a decoder block, and next-token logits with readable NumPy and PyTorch code.
An autoencoder turned a return-photo patch into a compact representation. A Transformer must build a useful representation at every token position, because a support reply such as Order shipped from depends on what came earlier and must still predict what comes next.
This chapter follows a small decoder-only Transformer forward pass. We'll keep the dimensions tiny enough to print, then assemble the same operations in PyTorch. The next chapter will add the learning objective that trains these predictions.
The Transformer introduced in 2017 used an encoder and a decoder for translation. Its decoder combines masked self-attention, encoder-decoder attention, and a feed-forward network. A text-generation decoder-only model keeps the causal self-attention path and predicts the next token from preceding context.[1][2]
Use a simplified support-chat vocabulary. We won't claim these are IDs from a production tokenizer; we're choosing IDs deliberately so the forward pass is readable:
| Text so far | Token IDs | Desired continuation |
|---|---|---|
Order shipped from | [0, 1, 2] | likely hub |
A model doesn't operate on token labels. It looks up one learned vector per ID, then adds or applies position information so shipped from differs from from shipped. In this first calculation, position is a small vector added to each token embedding.
1import numpy as np
2
3tokens = ["Order", "shipped", "from"]
4ids = np.array([0, 1, 2])
5embedding_table = np.array([
6 [1.0, 0.0, 0.4, 0.0], # Order
7 [0.2, 1.0, 0.5, 0.0], # shipped
8 [0.0, 0.3, 0.8, 1.0], # from
9 [0.1, 0.5, 1.0, 0.9], # hub
10])
11position = np.array([
12 [0.00, 0.00, 0.00, 0.00],
13 [0.05, 0.00, 0.00, 0.05],
14 [0.10, 0.00, 0.00, 0.10],
15])
16
17x = embedding_table[ids] + position
18print("tokens:", tokens)
19print("ids shape:", ids.shape)
20print("hidden shape:", x.shape)
21print("last positioned vector:", np.round(x[-1], 2))1tokens: ['Order', 'shipped', 'from']
2ids shape: (3,)
3hidden shape: (3, 4)
4last positioned vector: [0.1 0.3 0.8 1.1]Our tiny model uses three tokens and four features, so its hidden-state matrix has shape [3, 4]. Real models use far wider vectors and batches, but the axes mean the same thing:
| Axis | Tiny example | General name |
|---|---|---|
| Number of sequences | omitted | B (batch size) |
| Token positions | 3 | T (sequence length) |
| Features per token | 4 | d_model (model width) |
An RNN transports information through one recurrent state. Self-attention takes a different route: at each position it computes relevance scores against other visible positions, then blends their value vectors. For a decoder, "visible" means the current and earlier tokens only.
Each token produces three learned projections:
The projections come from the same hidden-state matrix x, using different learned weights W_q, W_k, and W_v.[1]
To see the multiplication, take one two-feature attention head. These arrays stand in for outputs of its learned projections:
1import numpy as np
2
3q = np.array([
4 [1.0, 0.0], # Order asks mainly for order context
5 [0.5, 1.0], # shipped asks for order and movement context
6 [0.2, 1.2], # from asks strongly for movement context
7])
8k = np.array([
9 [1.0, 0.0],
10 [0.4, 1.0],
11 [0.0, 0.8],
12])
13v = np.array([
14 [1.0, 0.0],
15 [0.3, 1.0],
16 [0.0, 0.6],
17])
18
19raw_scores = q @ k.T
20print("Q shape:", q.shape, "K shape:", k.shape, "V shape:", v.shape)
21print("Q @ K.T shape:", raw_scores.shape)
22print("raw score row for 'from':", np.round(raw_scores[2], 2))1Q shape: (3, 2) K shape: (3, 2) V shape: (3, 2)
2Q @ K.T shape: (3, 3)
3raw score row for 'from': [0.2 1.28 0.96]Three queries compared with three keys produce a [3, 3] matrix. The square T x T attention-score matrix is why full attention becomes expensive as context length grows: doubling T gives four times as many pair scores.
Scaled dot-product attention divides query-key scores by the square root of the head width:
Here, d_k is the number of features in one key or query vector. Vaswani et al. explain that without scaling, larger dot products push softmax toward very small gradients when the key width is large.[1]
This calculation uses a 64-feature head and compares unscaled and scaled scores for one token:
1import numpy as np
2
3def softmax(values):
4 shifted = values - values.max()
5 exp = np.exp(shifted)
6 return exp / exp.sum()
7
8scores = np.array([12.4, 9.1, 3.8])
9scaled = scores / np.sqrt(64)
10
11print("unscaled weights:", np.round(softmax(scores), 3))
12print("scaled scores:", np.round(scaled, 3))
13print("scaled weights:", np.round(softmax(scaled), 3))1unscaled weights: [0.964 0.036 0. ]
2scaled scores: [1.55 1.138 0.475]
3scaled weights: [0.499 0.33 0.17 ]Without scaling, nearly all attention mass lands on one key. Scaling doesn't prevent a confident choice after training; it prevents the head width alone from making early scores overly sharp.
During training, we can process Order shipped from hub in one matrix operation. Position from is supposed to predict hub, so it must not read the hub representation first. A causal mask sets future scores to negative infinity before softmax; those entries receive probability zero.
| Query position | Can read |
|---|---|
Order | Order |
shipped | Order, shipped |
from | Order, shipped, from |
hub | all four positions |
1import numpy as np
2
3scores = np.array([
4 [2.0, 1.5, 0.5, 1.0],
5 [1.0, 2.5, 1.5, 0.5],
6 [0.5, 1.0, 3.0, 2.0],
7 [1.5, 0.5, 1.0, 2.5],
8])
9future = np.triu(np.ones_like(scores, dtype=bool), k=1)
10masked_scores = np.where(future, -np.inf, scores)
11
12shifted = masked_scores - masked_scores.max(axis=-1, keepdims=True)
13weights = np.exp(shifted)
14weights /= weights.sum(axis=-1, keepdims=True)
15
16print("masked weights:")
17print(np.round(weights, 3))
18print("future attention mass:", float(weights[future].sum()))1masked weights:
2[[1. 0. 0. 0. ]
3 [0.182 0.818 0. 0. ]
4 [0.067 0.111 0.821 0. ]
5 [0.213 0.078 0.129 0.579]]
6future attention mass: 0.0The important output is future attention mass: 0.0. If a causal language model reports nonzero future attention during training, it has a data-leak bug, not an impressive loss curve.
One attention head can learn one set of relevance patterns. Multi-head attention projects the model features into several heads, applies attention in parallel, concatenates their outputs, and projects back to d_model.[1]
Suppose our tiny model width is four and it has two heads. Each head owns two features. Splitting and merging must preserve every token position:
1import numpy as np
2
3batch, tokens, d_model, heads = 1, 3, 4, 2
4d_head = d_model // heads
5projected = np.arange(batch * tokens * d_model).reshape(batch, tokens, d_model)
6
7split = projected.reshape(batch, tokens, heads, d_head).transpose(0, 2, 1, 3)
8merged = split.transpose(0, 2, 1, 3).reshape(batch, tokens, d_model)
9
10print("per-head shape:", split.shape)
11print("merged shape:", merged.shape)
12print("merge restores values:", bool(np.array_equal(projected, merged)))1per-head shape: (1, 2, 3, 2)
2merged shape: (1, 3, 4)
3merge restores values: TrueThe head axis is parallel work inside one block. The block axis is sequential depth: output from block 1 enters block 2. Mixing these concepts leads to configuration and shape mistakes.
Attention lets token positions exchange information. A feed-forward network (FFN) then transforms each position independently, using the same weights at every position. The original Transformer used a two-layer FFN with a ReLU nonlinearity between the layers.[1]
The first projection commonly widens the feature dimension; the second returns it to d_model, which keeps blocks stackable.
1import numpy as np
2
3x = np.array([
4 [1.0, 0.0, 0.4, 0.0],
5 [0.2, 1.0, 0.5, 0.0],
6 [0.0, 0.3, 0.8, 1.0],
7])
8w1 = np.array([
9 [1.0, 0.0, 0.5, 0.0, 0.0, 0.2],
10 [0.0, 1.0, 0.0, 0.5, 0.2, 0.0],
11 [0.4, 0.2, 1.0, 0.0, 0.0, 0.5],
12 [0.0, 0.2, 0.0, 1.0, 0.4, 0.0],
13])
14w2 = np.array([
15 [0.3, 0.0, 0.0, 0.1],
16 [0.0, 0.3, 0.1, 0.0],
17 [0.2, 0.0, 0.3, 0.0],
18 [0.0, 0.2, 0.0, 0.3],
19 [0.1, 0.0, 0.0, 0.2],
20 [0.0, 0.1, 0.2, 0.0],
21])
22
23expanded = np.maximum(0.0, x @ w1)
24ffn_output = expanded @ w2
25print("input shape:", x.shape)
26print("expanded shape:", expanded.shape)
27print("returned shape:", ffn_output.shape)
28print("last-token update:", np.round(ffn_output[-1], 3))1input shape: (3, 4)
2expanded shape: (3, 6)
3returned shape: (3, 4)
4last-token update: [0.302 0.468 0.386 0.469]A residual connection adds a sublayer result back to its input. It creates an identity path through deep stacks, following the residual-network idea introduced by He et al.[3] Layer normalization computes statistics within each token vector; its learned scale and offset can then adjust the normalized output.[4]
There are multiple valid placements. The 2017 Transformer applied normalization after each residual addition. The diagram and PyTorch block below use pre-normalization: normalize first, compute the sublayer update, then add it to the residual stream. The two layouts share attention, FFN, and residual concepts; don't accidentally describe one while coding the other.[1] Xiong et al. connect Pre-LN with better-behaved gradients at initialization and show that it can train without learning-rate warmup in their evaluated tasks.[5]
1import numpy as np
2
3def normalize(x, eps=1e-5):
4 mean = x.mean(axis=-1, keepdims=True)
5 variance = ((x - mean) ** 2).mean(axis=-1, keepdims=True)
6 return (x - mean) / np.sqrt(variance + eps)
7
8x = np.array([
9 [1.0, 2.0, 0.0, 1.0],
10 [0.2, 0.4, 0.8, 0.6],
11])
12normalized = normalize(x)
13sublayer_update = 0.1 * normalized
14residual_output = x + sublayer_update
15
16print("normalized means:", np.round(normalized.mean(axis=-1), 6))
17print("normalized variances:", np.round(normalized.var(axis=-1), 4))
18print("residual output shape:", residual_output.shape)1normalized means: [0. 0.]
2normalized variances: [1. 0.9998]
3residual output shape: (2, 4)
Now implement a single causal multi-head self-attention layer. The projections are learned linear maps; the mask is created from token positions. masked_fill is the key line: it turns every future score into negative infinity before softmax.
1import math
2import torch
3from torch import nn
4
5class CausalSelfAttention(nn.Module):
6 def __init__(self, d_model=8, heads=2):
7 super().__init__()
8 assert d_model % heads == 0
9 self.heads = heads
10 self.d_head = d_model // heads
11 self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
12 self.out = nn.Linear(d_model, d_model, bias=False)
13
14 def forward(self, x):
15 batch, tokens, d_model = x.shape
16 q, k, v = self.qkv(x).chunk(3, dim=-1)
17 def split_heads(tensor):
18 return tensor.view(batch, tokens, self.heads, self.d_head).transpose(1, 2)
19 q, k, v = map(split_heads, (q, k, v))
20
21 scores = q @ k.transpose(-2, -1) / math.sqrt(self.d_head)
22 future = torch.triu(
23 torch.ones(tokens, tokens, dtype=torch.bool, device=x.device),
24 diagonal=1,
25 )
26 weights = torch.softmax(scores.masked_fill(future, float("-inf")), dim=-1)
27 mixed = weights @ v
28 merged = mixed.transpose(1, 2).contiguous().view(batch, tokens, d_model)
29 return self.out(merged), weights
30
31torch.manual_seed(4)
32x = torch.randn(1, 4, 8)
33attention = CausalSelfAttention()
34output, weights = attention(x)
35future = torch.triu(torch.ones(4, 4, dtype=torch.bool), diagonal=1)
36
37print("attention output shape:", tuple(output.shape))
38print("weights shape:", tuple(weights.shape))
39print("largest future weight:", f"{weights[0, :, future].max().item():.6f}")1attention output shape: (1, 4, 8)
2weights shape: (1, 2, 4, 4)
3largest future weight: 0.000000The output shape matches the input shape, while the [batch, heads, tokens, tokens] weight tensor exposes each head's routing decisions.
With attention implemented, a small pre-normalized block is short: normalize, attend, add; normalize, run FFN, add. PyTorch's LayerNorm includes the learned scale and offset that the earlier NumPy calculation omitted. The block below uses GELU instead of the original Transformer's ReLU, so you can see that the activation choice can change while the FFN's structural role stays the same: widen, transform, and return to d_model.
1import math
2import torch
3from torch import nn
4
5class CausalSelfAttention(nn.Module):
6 def __init__(self, d_model=8, heads=2):
7 super().__init__()
8 assert d_model % heads == 0
9 self.heads = heads
10 self.d_head = d_model // heads
11 self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
12 self.out = nn.Linear(d_model, d_model, bias=False)
13
14 def forward(self, x):
15 batch, tokens, d_model = x.shape
16 q, k, v = self.qkv(x).chunk(3, dim=-1)
17 q = q.view(batch, tokens, self.heads, self.d_head).transpose(1, 2)
18 k = k.view(batch, tokens, self.heads, self.d_head).transpose(1, 2)
19 v = v.view(batch, tokens, self.heads, self.d_head).transpose(1, 2)
20 scores = q @ k.transpose(-2, -1) / math.sqrt(self.d_head)
21 mask = torch.triu(
22 torch.ones(tokens, tokens, dtype=torch.bool, device=x.device),
23 diagonal=1,
24 )
25 weights = torch.softmax(scores.masked_fill(mask, float("-inf")), dim=-1)
26 joined = (weights @ v).transpose(1, 2).contiguous().view(batch, tokens, d_model)
27 return self.out(joined)
28
29class DecoderBlock(nn.Module):
30 def __init__(self, d_model=8, heads=2, d_ff=24):
31 super().__init__()
32 self.norm1 = nn.LayerNorm(d_model)
33 self.attention = CausalSelfAttention(d_model, heads)
34 self.norm2 = nn.LayerNorm(d_model)
35 self.ffn = nn.Sequential(
36 nn.Linear(d_model, d_ff),
37 nn.GELU(),
38 nn.Linear(d_ff, d_model),
39 )
40
41 def forward(self, x):
42 x = x + self.attention(self.norm1(x))
43 x = x + self.ffn(self.norm2(x))
44 return x
45
46torch.manual_seed(5)
47x = torch.randn(2, 4, 8, requires_grad=True)
48block = DecoderBlock()
49output = block(x)
50output.square().mean().backward()
51
52print("input -> output:", tuple(x.shape), "->", tuple(output.shape))
53print("input gradient is finite:", bool(torch.isfinite(x.grad).all()))1input -> output: (2, 4, 8) -> (2, 4, 8)
2input gradient is finite: TrueThis block doesn't generate a word yet. It produces contextual vectors with the same [B, T, d_model] shape as its input, so another block can consume them.
After the final block, many pre-normalized decoder stacks apply one final normalization before the output head. For example, GPT-2 applies its final ln_f immediately before computing logits.[6] A learned output matrix then converts each hidden vector into one raw score per vocabulary item. These scores are logits. During generation, use only the final position's scores because that position represents all text currently visible to the decoder. The small calculation below starts from an already normalized final-position vector.
1import numpy as np
2
3vocab = np.array(["hub", "today", "late", "refund", "carrier", "warehouse"])
4last_hidden_normalized = np.array([0.7, -0.2, 0.5, 0.1])
5output_weight = np.array([
6 [1.0, 0.5, 0.0, 0.0, 0.1, 0.0],
7 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
8 [1.0, 0.0, 0.8, 0.0, 0.1, 0.0],
9 [0.0, 0.0, 0.0, 0.5, 0.0, 0.0],
10])
11
12logits = last_hidden_normalized @ output_weight
13probabilities = np.exp(logits - logits.max())
14probabilities /= probabilities.sum()
15best = probabilities.argmax()
16
17print("logits shape:", logits.shape)
18print("top token:", vocab[best])
19print("top probability:", f"{probabilities[best]:.3f}")1logits shape: (6,)
2top token: hub
3top probability: 0.360
The result hub completes Order shipped from hub. Conceptually, generate a longer support answer by appending that token, running the stack with the longer context, and choosing the next one. Optimized serving reuses earlier attention keys and values through a KV cache instead of recomputing them for every new token. The next chapter turns this forward pass into a trained language model.
Decoder-only generation is one member of the Transformer family. Architecture choice is largely a question of which positions may read which information:
| Variant | Visibility rule | Suitable lesson example | Representative source |
|---|---|---|---|
| Encoder-only | Each input position can read both left and right context | Classify a return request after reading the full ticket | BERT[7] |
| Decoder-only | Each position can read only itself and prior tokens | Generate an order-status response left to right | GPT[2] |
| Encoder-decoder | Encoder reads source; decoder reads source plus prior output tokens | Translate a merchant return policy | T5[8] |
In self-attention, queries, keys, and values come from one sequence. In encoder-decoder cross-attention, decoder queries read keys and values produced by the encoder. A plain decoder-only text generator has no separate encoder sequence, so its core path is causal self-attention.
The fastest way to debug a Transformer is to ask what invariant was violated:
| Symptom | Likely cause | Check |
|---|---|---|
| Training loss looks suspiciously tiny, but generation fails | Future tokens leaked through attention | Assert masked future attention mass is zero |
| Head split crashes or silently produces the wrong shape | d_model isn't divisible by n_heads, or axes were transposed incorrectly | Check [B, H, T, d_head] then merge back to [B, T, d_model] |
| Attention works on CPU but fails after moving the block to a GPU | Causal mask stayed on CPU while scores moved to the accelerator | Create the mask on x.device |
| A block can't be stacked with another block | FFN or output projection changed model width too early | Keep block output at d_model; project to vocabulary only at the end |
| Explanation says decoder-only model reads an encoder | Self-attention and cross-attention were confused | Identify whether a separate source sequence exists |
Use a deliberately bad mask to make leakage visible:
1import numpy as np
2
3scores = np.array([
4 [2.0, 5.0, 4.0],
5 [1.0, 2.0, 6.0],
6 [0.5, 1.0, 2.0],
7])
8future = np.triu(np.ones_like(scores, dtype=bool), k=1)
9
10def softmax_rows(matrix):
11 exp = np.exp(matrix - matrix.max(axis=-1, keepdims=True))
12 return exp / exp.sum(axis=-1, keepdims=True)
13
14unmasked = softmax_rows(scores)
15masked = softmax_rows(np.where(future, -np.inf, scores))
16print("unmasked future mass:", f"{unmasked[future].sum():.3f}")
17print("masked future mass:", f"{masked[future].sum():.3f}")1unmasked future mass: 1.940
2masked future mass: 0.000If that second value isn't zero in a causal decoder, fix the mask before trusting any metric.
[B, T, d_model].Order shipped from from token IDs through hidden states to a next-token distribution.hub from the from position during training.Answer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.
Attention Is All You Need.
Vaswani, A., et al. · 2017
Improving Language Understanding by Generative Pre-Training.
Radford, A., et al. · 2018
Deep Residual Learning for Image Recognition.
He, K., Zhang, X., Ren, S., Sun, J. · 2015 · CVPR 2016
Layer Normalization.
Ba, J. L., Kiros, J. R., & Hinton, G. E. · 2016
On Layer Normalization in the Transformer Architecture.
Xiong, R., et al. · 2020 · ICML 2020
GPT-2 Source Implementation.
OpenAI · 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. · 2019 · NAACL 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
Raffel, C., et al. · 2020 · JMLR