LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnPreparation & PrerequisitesThe Transformer Architecture End-to-End
🧠EasyTransformer Architecture

The Transformer Architecture End-to-End

Trace a support reply through masked attention, a decoder block, and next-token logits with readable NumPy and PyTorch code.

13 min read
Learning path
Step 21 of 155 in the full curriculum
Autoencoders and VAEsLanguage Modeling & Next Tokens

An autoencoder turned a return-photo patch into a compact representation. A Transformer must build a useful representation at every token position, because a support reply such as Order shipped from depends on what came earlier and must still predict what comes next.

This chapter follows a small decoder-only Transformer forward pass. We'll keep the dimensions tiny enough to print, then assemble the same operations in PyTorch. The next chapter will add the learning objective that trains these predictions.

Decoder-only Transformer step showing a three-token feature matrix, a causal attention heatmap inside repeated decoder blocks, a same-shaped contextual matrix, vocabulary logits with hub highest, and hub appended to form a four-token context. Decoder-only Transformer step showing a three-token feature matrix, a causal attention heatmap inside repeated decoder blocks, a same-shaped contextual matrix, vocabulary logits with hub highest, and hub appended to form a four-token context.
Masked blocks preserve one feature row per token while refining each row with visible context. The final row scores the vocabulary; appending the selected token grows the matrix for the next step.

The Transformer introduced in 2017 used an encoder and a decoder for translation. Its decoder combines masked self-attention, encoder-decoder attention, and a feed-forward network. A text-generation decoder-only model keeps the causal self-attention path and predicts the next token from preceding context.[1][2]

Tokens become positioned vectors

Use a simplified support-chat vocabulary. We won't claim these are IDs from a production tokenizer; we're choosing IDs deliberately so the forward pass is readable:

Text so farToken IDsDesired continuation
Order shipped from[0, 1, 2]likely hub

A model doesn't operate on token labels. It looks up one learned vector per ID, then adds or applies position information so shipped from differs from from shipped. In this first calculation, position is a small vector added to each token embedding.

positioned-token-vectors.py
1import numpy as np 2 3tokens = ["Order", "shipped", "from"] 4ids = np.array([0, 1, 2]) 5embedding_table = np.array([ 6 [1.0, 0.0, 0.4, 0.0], # Order 7 [0.2, 1.0, 0.5, 0.0], # shipped 8 [0.0, 0.3, 0.8, 1.0], # from 9 [0.1, 0.5, 1.0, 0.9], # hub 10]) 11position = np.array([ 12 [0.00, 0.00, 0.00, 0.00], 13 [0.05, 0.00, 0.00, 0.05], 14 [0.10, 0.00, 0.00, 0.10], 15]) 16 17x = embedding_table[ids] + position 18print("tokens:", tokens) 19print("ids shape:", ids.shape) 20print("hidden shape:", x.shape) 21print("last positioned vector:", np.round(x[-1], 2))
Output
1tokens: ['Order', 'shipped', 'from'] 2ids shape: (3,) 3hidden shape: (3, 4) 4last positioned vector: [0.1 0.3 0.8 1.1]

Our tiny model uses three tokens and four features, so its hidden-state matrix has shape [3, 4]. Real models use far wider vectors and batches, but the axes mean the same thing:

AxisTiny exampleGeneral name
Number of sequencesomittedB (batch size)
Token positions3T (sequence length)
Features per token4d_model (model width)
Diagram showing Positioned vectors Order shipped from [T, d_model], N masked blocks [T, d_model], Final norm + head Last-position logits [vocab], and Next token hub. Diagram showing Positioned vectors Order shipped from [T, d_model], N masked blocks [T, d_model], Final norm + head Last-position logits [vocab], and Next token hub.
Positioned vectors Order shipped from [T, d_model], N masked blocks [T, d_model], Final norm + head Last-position logits [vocab], and Next token hub.

Attention lets a token read earlier tokens

An RNN transports information through one recurrent state. Self-attention takes a different route: at each position it computes relevance scores against other visible positions, then blends their value vectors. For a decoder, "visible" means the current and earlier tokens only.

Each token produces three learned projections:

  • A query asks what information this position needs.
  • A key advertises what a position contains.
  • A value supplies information if the query matches its key.

The projections come from the same hidden-state matrix x, using different learned weights W_q, W_k, and W_v.[1]

To see the multiplication, take one two-feature attention head. These arrays stand in for outputs of its learned projections:

query-key-scores.py
1import numpy as np 2 3q = np.array([ 4 [1.0, 0.0], # Order asks mainly for order context 5 [0.5, 1.0], # shipped asks for order and movement context 6 [0.2, 1.2], # from asks strongly for movement context 7]) 8k = np.array([ 9 [1.0, 0.0], 10 [0.4, 1.0], 11 [0.0, 0.8], 12]) 13v = np.array([ 14 [1.0, 0.0], 15 [0.3, 1.0], 16 [0.0, 0.6], 17]) 18 19raw_scores = q @ k.T 20print("Q shape:", q.shape, "K shape:", k.shape, "V shape:", v.shape) 21print("Q @ K.T shape:", raw_scores.shape) 22print("raw score row for 'from':", np.round(raw_scores[2], 2))
Output
1Q shape: (3, 2) K shape: (3, 2) V shape: (3, 2) 2Q @ K.T shape: (3, 3) 3raw score row for 'from': [0.2 1.28 0.96]

Three queries compared with three keys produce a [3, 3] matrix. The square T x T attention-score matrix is why full attention becomes expensive as context length grows: doubling T gives four times as many pair scores.

Scale before softmax

Scaled dot-product attention divides query-key scores by the square root of the head width:

Attention⁡(Q,K,V)=softmax⁡(QKTdk)V\operatorname{Attention}(Q, K, V) = \operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)VAttention(Q,K,V)=softmax(dk​​QKT​)V

Here, d_k is the number of features in one key or query vector. Vaswani et al. explain that without scaling, larger dot products push softmax toward very small gradients when the key width is large.[1]

This calculation uses a 64-feature head and compares unscaled and scaled scores for one token:

why-scale-attention.py
1import numpy as np 2 3def softmax(values): 4 shifted = values - values.max() 5 exp = np.exp(shifted) 6 return exp / exp.sum() 7 8scores = np.array([12.4, 9.1, 3.8]) 9scaled = scores / np.sqrt(64) 10 11print("unscaled weights:", np.round(softmax(scores), 3)) 12print("scaled scores:", np.round(scaled, 3)) 13print("scaled weights:", np.round(softmax(scaled), 3))
Output
1unscaled weights: [0.964 0.036 0. ] 2scaled scores: [1.55 1.138 0.475] 3scaled weights: [0.499 0.33 0.17 ]

Without scaling, nearly all attention mass lands on one key. Scaling doesn't prevent a confident choice after training; it prevents the head width alone from making early scores overly sharp.

The causal mask prevents answer leakage

During training, we can process Order shipped from hub in one matrix operation. Position from is supposed to predict hub, so it must not read the hub representation first. A causal mask sets future scores to negative infinity before softmax; those entries receive probability zero.

Query positionCan read
OrderOrder
shippedOrder, shipped
fromOrder, shipped, from
huball four positions
causal-mask.py
1import numpy as np 2 3scores = np.array([ 4 [2.0, 1.5, 0.5, 1.0], 5 [1.0, 2.5, 1.5, 0.5], 6 [0.5, 1.0, 3.0, 2.0], 7 [1.5, 0.5, 1.0, 2.5], 8]) 9future = np.triu(np.ones_like(scores, dtype=bool), k=1) 10masked_scores = np.where(future, -np.inf, scores) 11 12shifted = masked_scores - masked_scores.max(axis=-1, keepdims=True) 13weights = np.exp(shifted) 14weights /= weights.sum(axis=-1, keepdims=True) 15 16print("masked weights:") 17print(np.round(weights, 3)) 18print("future attention mass:", float(weights[future].sum()))
Output
1masked weights: 2[[1. 0. 0. 0. ] 3 [0.182 0.818 0. 0. ] 4 [0.067 0.111 0.821 0. ] 5 [0.213 0.078 0.129 0.579]] 6future attention mass: 0.0

The important output is future attention mass: 0.0. If a causal language model reports nonzero future attention during training, it has a data-leak bug, not an impressive loss curve.

Heads run in parallel; blocks run in sequence

One attention head can learn one set of relevance patterns. Multi-head attention projects the model features into several heads, applies attention in parallel, concatenates their outputs, and projects back to d_model.[1]

Suppose our tiny model width is four and it has two heads. Each head owns two features. Splitting and merging must preserve every token position:

split-and-merge-heads.py
1import numpy as np 2 3batch, tokens, d_model, heads = 1, 3, 4, 2 4d_head = d_model // heads 5projected = np.arange(batch * tokens * d_model).reshape(batch, tokens, d_model) 6 7split = projected.reshape(batch, tokens, heads, d_head).transpose(0, 2, 1, 3) 8merged = split.transpose(0, 2, 1, 3).reshape(batch, tokens, d_model) 9 10print("per-head shape:", split.shape) 11print("merged shape:", merged.shape) 12print("merge restores values:", bool(np.array_equal(projected, merged)))
Output
1per-head shape: (1, 2, 3, 2) 2merged shape: (1, 3, 4) 3merge restores values: True

The head axis is parallel work inside one block. The block axis is sequential depth: output from block 1 enters block 2. Mixing these concepts leads to configuration and shape mistakes.

A block alternates communication and local transformation

Attention lets token positions exchange information. A feed-forward network (FFN) then transforms each position independently, using the same weights at every position. The original Transformer used a two-layer FFN with a ReLU nonlinearity between the layers.[1]

FFN⁡(x)=ReLU⁡(xW1+b1)W2+b2\operatorname{FFN}(x) = \operatorname{ReLU}(xW_1 + b_1)W_2 + b_2FFN(x)=ReLU(xW1​+b1​)W2​+b2​

The first projection commonly widens the feature dimension; the second returns it to d_model, which keeps blocks stackable.

position-wise-ffn.py
1import numpy as np 2 3x = np.array([ 4 [1.0, 0.0, 0.4, 0.0], 5 [0.2, 1.0, 0.5, 0.0], 6 [0.0, 0.3, 0.8, 1.0], 7]) 8w1 = np.array([ 9 [1.0, 0.0, 0.5, 0.0, 0.0, 0.2], 10 [0.0, 1.0, 0.0, 0.5, 0.2, 0.0], 11 [0.4, 0.2, 1.0, 0.0, 0.0, 0.5], 12 [0.0, 0.2, 0.0, 1.0, 0.4, 0.0], 13]) 14w2 = np.array([ 15 [0.3, 0.0, 0.0, 0.1], 16 [0.0, 0.3, 0.1, 0.0], 17 [0.2, 0.0, 0.3, 0.0], 18 [0.0, 0.2, 0.0, 0.3], 19 [0.1, 0.0, 0.0, 0.2], 20 [0.0, 0.1, 0.2, 0.0], 21]) 22 23expanded = np.maximum(0.0, x @ w1) 24ffn_output = expanded @ w2 25print("input shape:", x.shape) 26print("expanded shape:", expanded.shape) 27print("returned shape:", ffn_output.shape) 28print("last-token update:", np.round(ffn_output[-1], 3))
Output
1input shape: (3, 4) 2expanded shape: (3, 6) 3returned shape: (3, 4) 4last-token update: [0.302 0.468 0.386 0.469]

Residual paths and normalization

A residual connection adds a sublayer result back to its input. It creates an identity path through deep stacks, following the residual-network idea introduced by He et al.[3] Layer normalization computes statistics within each token vector; its learned scale and offset can then adjust the normalized output.[4]

There are multiple valid placements. The 2017 Transformer applied normalization after each residual addition. The diagram and PyTorch block below use pre-normalization: normalize first, compute the sublayer update, then add it to the residual stream. The two layouts share attention, FFN, and residual concepts; don't accidentally describe one while coding the other.[1] Xiong et al. connect Pre-LN with better-behaved gradients at initialization and show that it can train without learning-rate warmup in their evaluated tasks.[5]

pre-normalized-residual.py
1import numpy as np 2 3def normalize(x, eps=1e-5): 4 mean = x.mean(axis=-1, keepdims=True) 5 variance = ((x - mean) ** 2).mean(axis=-1, keepdims=True) 6 return (x - mean) / np.sqrt(variance + eps) 7 8x = np.array([ 9 [1.0, 2.0, 0.0, 1.0], 10 [0.2, 0.4, 0.8, 0.6], 11]) 12normalized = normalize(x) 13sublayer_update = 0.1 * normalized 14residual_output = x + sublayer_update 15 16print("normalized means:", np.round(normalized.mean(axis=-1), 6)) 17print("normalized variances:", np.round(normalized.var(axis=-1), 4)) 18print("residual output shape:", residual_output.shape)
Output
1normalized means: [0. 0.] 2normalized variances: [1. 0.9998] 3residual output shape: (2, 4)
Pre-normalized Transformer block drawn as a same-shaped residual stream with two explicit bypass paths: one branch applies LayerNorm and masked attention before an add, and the next applies LayerNorm and a per-token feed-forward network before a second add. Pre-normalized Transformer block drawn as a same-shaped residual stream with two explicit bypass paths: one branch applies LayerNorm and masked attention before an add, and the next applies LayerNorm and a per-token feed-forward network before a second add.
Each pre-LN branch reads a normalized copy of the residual stream, computes an update, and merges it back with an addition. The untouched bypass keeps the tensor shape and identity path available through both sublayers.

Implement masked attention in PyTorch

Now implement a single causal multi-head self-attention layer. The projections are learned linear maps; the mask is created from token positions. masked_fill is the key line: it turns every future score into negative infinity before softmax.

causal-self-attention.py
1import math 2import torch 3from torch import nn 4 5class CausalSelfAttention(nn.Module): 6 def __init__(self, d_model=8, heads=2): 7 super().__init__() 8 assert d_model % heads == 0 9 self.heads = heads 10 self.d_head = d_model // heads 11 self.qkv = nn.Linear(d_model, 3 * d_model, bias=False) 12 self.out = nn.Linear(d_model, d_model, bias=False) 13 14 def forward(self, x): 15 batch, tokens, d_model = x.shape 16 q, k, v = self.qkv(x).chunk(3, dim=-1) 17 def split_heads(tensor): 18 return tensor.view(batch, tokens, self.heads, self.d_head).transpose(1, 2) 19 q, k, v = map(split_heads, (q, k, v)) 20 21 scores = q @ k.transpose(-2, -1) / math.sqrt(self.d_head) 22 future = torch.triu( 23 torch.ones(tokens, tokens, dtype=torch.bool, device=x.device), 24 diagonal=1, 25 ) 26 weights = torch.softmax(scores.masked_fill(future, float("-inf")), dim=-1) 27 mixed = weights @ v 28 merged = mixed.transpose(1, 2).contiguous().view(batch, tokens, d_model) 29 return self.out(merged), weights 30 31torch.manual_seed(4) 32x = torch.randn(1, 4, 8) 33attention = CausalSelfAttention() 34output, weights = attention(x) 35future = torch.triu(torch.ones(4, 4, dtype=torch.bool), diagonal=1) 36 37print("attention output shape:", tuple(output.shape)) 38print("weights shape:", tuple(weights.shape)) 39print("largest future weight:", f"{weights[0, :, future].max().item():.6f}")
Output
1attention output shape: (1, 4, 8) 2weights shape: (1, 2, 4, 4) 3largest future weight: 0.000000

The output shape matches the input shape, while the [batch, heads, tokens, tokens] weight tensor exposes each head's routing decisions.

Assemble one decoder block

With attention implemented, a small pre-normalized block is short: normalize, attend, add; normalize, run FFN, add. PyTorch's LayerNorm includes the learned scale and offset that the earlier NumPy calculation omitted. The block below uses GELU instead of the original Transformer's ReLU, so you can see that the activation choice can change while the FFN's structural role stays the same: widen, transform, and return to d_model.

decoder-block.py
1import math 2import torch 3from torch import nn 4 5class CausalSelfAttention(nn.Module): 6 def __init__(self, d_model=8, heads=2): 7 super().__init__() 8 assert d_model % heads == 0 9 self.heads = heads 10 self.d_head = d_model // heads 11 self.qkv = nn.Linear(d_model, 3 * d_model, bias=False) 12 self.out = nn.Linear(d_model, d_model, bias=False) 13 14 def forward(self, x): 15 batch, tokens, d_model = x.shape 16 q, k, v = self.qkv(x).chunk(3, dim=-1) 17 q = q.view(batch, tokens, self.heads, self.d_head).transpose(1, 2) 18 k = k.view(batch, tokens, self.heads, self.d_head).transpose(1, 2) 19 v = v.view(batch, tokens, self.heads, self.d_head).transpose(1, 2) 20 scores = q @ k.transpose(-2, -1) / math.sqrt(self.d_head) 21 mask = torch.triu( 22 torch.ones(tokens, tokens, dtype=torch.bool, device=x.device), 23 diagonal=1, 24 ) 25 weights = torch.softmax(scores.masked_fill(mask, float("-inf")), dim=-1) 26 joined = (weights @ v).transpose(1, 2).contiguous().view(batch, tokens, d_model) 27 return self.out(joined) 28 29class DecoderBlock(nn.Module): 30 def __init__(self, d_model=8, heads=2, d_ff=24): 31 super().__init__() 32 self.norm1 = nn.LayerNorm(d_model) 33 self.attention = CausalSelfAttention(d_model, heads) 34 self.norm2 = nn.LayerNorm(d_model) 35 self.ffn = nn.Sequential( 36 nn.Linear(d_model, d_ff), 37 nn.GELU(), 38 nn.Linear(d_ff, d_model), 39 ) 40 41 def forward(self, x): 42 x = x + self.attention(self.norm1(x)) 43 x = x + self.ffn(self.norm2(x)) 44 return x 45 46torch.manual_seed(5) 47x = torch.randn(2, 4, 8, requires_grad=True) 48block = DecoderBlock() 49output = block(x) 50output.square().mean().backward() 51 52print("input -> output:", tuple(x.shape), "->", tuple(output.shape)) 53print("input gradient is finite:", bool(torch.isfinite(x.grad).all()))
Output
1input -> output: (2, 4, 8) -> (2, 4, 8) 2input gradient is finite: True

This block doesn't generate a word yet. It produces contextual vectors with the same [B, T, d_model] shape as its input, so another block can consume them.

Project the final position into a vocabulary

After the final block, many pre-normalized decoder stacks apply one final normalization before the output head. For example, GPT-2 applies its final ln_f immediately before computing logits.[6] A learned output matrix then converts each hidden vector into one raw score per vocabulary item. These scores are logits. During generation, use only the final position's scores because that position represents all text currently visible to the decoder. The small calculation below starts from an already normalized final-position vector.

next-token-head.py
1import numpy as np 2 3vocab = np.array(["hub", "today", "late", "refund", "carrier", "warehouse"]) 4last_hidden_normalized = np.array([0.7, -0.2, 0.5, 0.1]) 5output_weight = np.array([ 6 [1.0, 0.5, 0.0, 0.0, 0.1, 0.0], 7 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0], 8 [1.0, 0.0, 0.8, 0.0, 0.1, 0.0], 9 [0.0, 0.0, 0.0, 0.5, 0.0, 0.0], 10]) 11 12logits = last_hidden_normalized @ output_weight 13probabilities = np.exp(logits - logits.max()) 14probabilities /= probabilities.sum() 15best = probabilities.argmax() 16 17print("logits shape:", logits.shape) 18print("top token:", vocab[best]) 19print("top probability:", f"{probabilities[best]:.3f}")
Output
1logits shape: (6,) 2top token: hub 3top probability: 0.360
Transformer shape trace showing a final-normalized three-by-four hidden-state matrix, selection of its final four-feature row, multiplication by a four-by-six output-weight matrix, and six vocabulary logits with hub highest. Transformer shape trace showing a final-normalized three-by-four hidden-state matrix, selection of its final four-feature row, multiplication by a four-by-six output-weight matrix, and six vocabulary logits with hub highest.
The decoder stack and final normalization preserve the `[3, 4]` hidden-state shape. Selecting the final row gives four features, and multiplying by a `[4, 6]` output matrix produces one logit for each of six vocabulary candidates.

The result hub completes Order shipped from hub. Conceptually, generate a longer support answer by appending that token, running the stack with the longer context, and choosing the next one. Optimized serving reuses earlier attention keys and values through a KV cache instead of recomputing them for every new token. The next chapter turns this forward pass into a trained language model.

Encoder, decoder, and encoder-decoder visibility

Decoder-only generation is one member of the Transformer family. Architecture choice is largely a question of which positions may read which information:

VariantVisibility ruleSuitable lesson exampleRepresentative source
Encoder-onlyEach input position can read both left and right contextClassify a return request after reading the full ticketBERT[7]
Decoder-onlyEach position can read only itself and prior tokensGenerate an order-status response left to rightGPT[2]
Encoder-decoderEncoder reads source; decoder reads source plus prior output tokensTranslate a merchant return policyT5[8]
Three visibility matrices compare encoder-only full bidirectional attention, decoder-only lower-triangular causal attention, and encoder-decoder attention with a full source block, causal target block, and target-to-source cross-attention. Three visibility matrices compare encoder-only full bidirectional attention, decoder-only lower-triangular causal attention, and encoder-decoder attention with a full source block, causal target block, and target-to-source cross-attention.
Encoder-only attention fills the input-to-input matrix. Decoder-only attention keeps only the causal lower triangle. Encoder-decoder attention combines a full source block, a causal target block, and target queries that can read every source position.

In self-attention, queries, keys, and values come from one sequence. In encoder-decoder cross-attention, decoder queries read keys and values produced by the encoder. A plain decoder-only text generator has no separate encoder sequence, so its core path is causal self-attention.

Diagnose a broken forward pass

The fastest way to debug a Transformer is to ask what invariant was violated:

SymptomLikely causeCheck
Training loss looks suspiciously tiny, but generation failsFuture tokens leaked through attentionAssert masked future attention mass is zero
Head split crashes or silently produces the wrong shaped_model isn't divisible by n_heads, or axes were transposed incorrectlyCheck [B, H, T, d_head] then merge back to [B, T, d_model]
Attention works on CPU but fails after moving the block to a GPUCausal mask stayed on CPU while scores moved to the acceleratorCreate the mask on x.device
A block can't be stacked with another blockFFN or output projection changed model width too earlyKeep block output at d_model; project to vocabulary only at the end
Explanation says decoder-only model reads an encoderSelf-attention and cross-attention were confusedIdentify whether a separate source sequence exists

Use a deliberately bad mask to make leakage visible:

detect-mask-leakage.py
1import numpy as np 2 3scores = np.array([ 4 [2.0, 5.0, 4.0], 5 [1.0, 2.0, 6.0], 6 [0.5, 1.0, 2.0], 7]) 8future = np.triu(np.ones_like(scores, dtype=bool), k=1) 9 10def softmax_rows(matrix): 11 exp = np.exp(matrix - matrix.max(axis=-1, keepdims=True)) 12 return exp / exp.sum(axis=-1, keepdims=True) 13 14unmasked = softmax_rows(scores) 15masked = softmax_rows(np.where(future, -np.inf, scores)) 16print("unmasked future mass:", f"{unmasked[future].sum():.3f}") 17print("masked future mass:", f"{masked[future].sum():.3f}")
Output
1unmasked future mass: 1.940 2masked future mass: 0.000

If that second value isn't zero in a causal decoder, fix the mask before trusting any metric.

Mastery check

Key concepts

  • Token IDs become positioned vectors with shape [B, T, d_model].
  • Scaled dot-product attention mixes visible token values using query-key relevance.
  • A causal mask gives future positions zero attention weight.
  • Heads run in parallel inside a block; blocks run sequentially.
  • Attention communicates across positions; the FFN transforms each position.
  • Residual paths and normalization let same-shaped blocks stack.
  • In the pre-LN pipeline shown here, a final normalization prepares the residual stream before the output projection turns the last hidden state into next-token logits.

Evaluation rubric

  • Foundational: Traces Order shipped from from token IDs through hidden states to a next-token distribution.
  • Foundational: Explains why the causal mask must hide hub from the from position during training.
  • Intermediate: Derives the score and output shapes for one attention head and for a multi-head merge.
  • Intermediate: Implements causal attention with a future-mask assertion.
  • Intermediate: Explains attention, FFN, residual, and normalization roles without treating them as interchangeable.
  • Advanced: Distinguishes encoder-only, decoder-only, and encoder-decoder visibility rules and selects a variant for a stated task.

Common pitfalls

  • Treating query, key, or value as a fixed property of a word rather than a learned per-layer projection.
  • Applying softmax before causal masking, which allows future answers to influence the output.
  • Confusing parallel attention heads with sequential decoder blocks.
  • Projecting to vocabulary size inside every block instead of once after the stack.
  • Assuming every Transformer has the same normalization placement or cross-attention path.

Follow-up questions

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A toy prompt has token IDs [0, 1, 2] for Order shipped from. The embedding for from is [0.00, 0.30, 0.80, 1.00], and its position vector is [0.10, 0.00, 0.00, 0.10]. What hidden-state shape and last positioned vector result before attention?
2.One attention head has q = [0.2, 1.2] for the from position and keys [[1.0, 0.0], [0.4, 1.0], [0.0, 0.8]]. What does q @ K.T produce, and what do the resulting attention weights mix?
3.In one attention head with d_head = 64, a query has raw scores [12.4, 9.1, 3.8] against three visible keys. What should scaled dot-product attention feed to softmax, and why?
4.During training on the sequence Order shipped from hub, the representation at from is used to predict hub. In causal self-attention, what must be true of the attention weights from the from query after softmax?
5.A decoder Transformer represents hidden states as [B, T, d_model]. Multi-head attention splits the query projection's feature dimension into H heads with d_head = d_model / H and arranges queries as [B, H, T, d_head]. For a batch of 2 prompts with 10 tokens each, d_model = 768, and 12 heads, which shapes are correct after embeddings and query-head splitting?
6.A pre-normalized decoder block receives x with shape [2, 4, 8] and uses d_ff = 24. Which layout keeps blocks stackable and delays next-token scoring?
7.After a decoder stack processes one prompt, its hidden stream has shape [1, 3, 4], and the vocabulary has 6 tokens. During left-to-right generation, which operation produces the next-token scores?
8.You need a Transformer to translate a merchant return policy from one language to another. In an encoder-decoder Transformer, which visibility pattern should be used?
9.A model has two attention heads in each of two decoder blocks. For one forward pass, which description correctly traces the computation and information flow?

9 questions remaining.

Next Step
Continue to Language Modeling & Next Tokens

You can now run the decoder forward path and obtain next-token logits. Next you will learn how shifted targets and cross-entropy train those logits into useful predictions.

PreviousAutoencoders and VAEs
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Attention Is All You Need.

Vaswani, A., et al. · 2017

Improving Language Understanding by Generative Pre-Training.

Radford, A., et al. · 2018

Deep Residual Learning for Image Recognition.

He, K., Zhang, X., Ren, S., Sun, J. · 2015 · CVPR 2016

Layer Normalization.

Ba, J. L., Kiros, J. R., & Hinton, G. E. · 2016

On Layer Normalization in the Transformer Architecture.

Xiong, R., et al. · 2020 · ICML 2020

GPT-2 Source Implementation.

OpenAI · 2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

Raffel, C., et al. · 2020 · JMLR