LearnTransformer Deep DivesScaled Dot-Product Attention

🧠HardTransformer Architecture

Scaled Dot-Product Attention

Learn scaled dot-product attention from first principles, including Q/K/V routing, variance scaling, masks, multi-head shapes, KV-cache costs, and FlashAttention.

41 min read

Learning path

Step 90 of 158 in the full curriculum

Retrieval compares embedding vectors. Transformer layers do a related job inside a sequence: each token compares itself to other tokens, decides which positions matter, and blends information from those positions.

When you read "The parser rejected the patch because it was malformed," you instantly know that "it" refers to the patch, not the parser. Your brain routes focus to the relevant word. Scaled dot-product attention is the transformer operation that learns that kind of routing with matrix multiplication.

Attention is the core information-routing mechanism in modern transformers. Given a sequence of tokens, it answers: "For each token, which other positions should it use, and how much?" Once that routing idea is clear, the math and code become direct translations of it.

Scaled dot-product attention shown as matrix operations: Q times K transpose creates masked logits, row-wise softmax creates routing weights, and those weights mix V into context vectors. — Queries and keys build a routing matrix. Scaling, masking, and row-wise softmax turn it into weights that sum to one, then those weights mix value rows into context vectors.

The attention mechanism

Intuition

Attention answers a simple question: "For each token in a sequence, which other tokens should it focus on?"

When you read a sentence, "it" in "The parser rejected the patch because it was malformed" points back to "patch." Self-attention does this computationally: every token computes a weighted combination of other token representations based on learned relevance.^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762}

Setup: Q, K, V projections

To perform attention, the model transforms each token representation into three different roles:

Vector	Question it answers	Role in attention
Query (Q)	What context does this position need?	The current token's lookup vector
Key (K)	When should another position read me?	The addressable routing vector for a token
Value (V)	What information do I contribute if selected?	The content vector that gets mixed into the output

The model creates these three vectors by multiplying each token representation through learned weight matrices:

$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$

From one embedding to three views

Each token representation $X$ is multiplied by three different weight matrices ( $W_Q$ , $W_K$ , $W_V$ ) to produce three different views of the same token. One view represents what this token is searching for (Query), another represents what this token offers to others (Key), and the third carries the actual information (Value).

Here $W_Q, W_K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ and $W_V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ are the learned projections (where $d_{\text{model}}$ is the model's overall hidden dimension size, and $d_k, d_v$ are the dimensions of the queries/keys and values respectively).^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762}

One token hidden state splits through three learned matrices into query, key, and value views; query and key meet to produce one routing score, while value stays separate as the content vector that later gets mixed. — The same hidden state is projected three ways. Query and key interact to produce a routing score, while value remains the content vector that attention weights later mix.

Q and K determine where to look. Their dot products produce routing scores, and softmax converts those scores into weights. V determines what information to extract. Separating "routing" (Q, K) from "content" (V) is what lets attention route information flexibly.

The core formula

The scaled dot-product attention formula computes a weighted combination of value vectors based on the compatibility between queries and keys.^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762} Softmax turns each query row of scores into non-negative weights that sum to 1:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) V$

When some key positions aren't allowed, add a mask matrix $M$ after scaling. An allowed location has $M_{ij}=0$ ; a blocked logit gets $M_{ij}=-\infty$ :

$\text{Attention}(Q, K, V; M) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$

Step by step

Step	Operation	What it means
Compute scores	$S = Q \cdot K^T$	Build an $n \times n$ matrix where $S_{ij}$ measures how much token $i$ should attend to token $j$ .
Scale	$L = S / \sqrt{d_k}$	Keep softmax logits in a range where gradients stay useful.
Mask	Add $0$ or $-\infty$ to each logit	Block future keys for causal attention or padded keys for batching.
Normalize	$\alpha_{ij} = \frac{\exp(L_{ij})}{\sum_k \exp(L_{ik})}$	Turn each row into weights that sum to 1.
Aggregate	$\alpha \cdot V$	Blend value vectors according to the attention weights.

The shape flow is compact enough to keep beside the formula. Queries and keys build one routing matrix; values join only after softmax:

Diagram showing Q [B, h, Nq, d_k], QKᵀ / √d_k + mask [B, h, Nq, Nk], Kᵀ [B, h, d_k, Nk], and softmax over keys [B, h, Nq, Nk]. — Q [B, h, Nq, d_k], QKᵀ / √d_k + mask [B, h, Nq, Nk], Kᵀ [B, h, d_k, Nk], and softmax over keys [B, h, Nq, Nk].

A two-token attention trace laid out left to right: Q times K transpose gives raw scores, scaling gives logits, row softmax gives attention weights, and multiplying by V gives output vectors, with the second row highlighted through each stage. — The whole two-token pass stays visible at once: Q times K transpose gives raw scores, scaling shrinks them into logits, row softmax turns one row into a probability distribution, and multiplying by V produces the mixed output vectors.

A trace with real numbers

Before running PyTorch code, walk through one step with actual vectors. Use a two-token sequence, "cache grows." Each token lives in a 2-dimensional space for this toy example, and we have learned tiny projection matrices that give:

Token	Query vector	Key vector	Value vector
cache	[1.0, 0.5]	[0.8, 0.2]	[2.0, 1.0]
grows	[0.5, 1.0]	[0.3, 0.9]	[1.0, 2.0]

Step 1: raw scores

Compute the dot product of every query with every key:

S = QK^T = \begin{bmatrix} 1.0 \cdot 0.8 + 0.5 \cdot 0.2 & 1.0 \cdot 0.3 + 0.5 \cdot 0.9 \\ 0.5 \cdot 0.8 + 1.0 \cdot 0.2 & 0.5 \cdot 0.3 + 1.0 \cdot 0.9 \end{bmatrix} = \begin{bmatrix} 0.90 & 0.75 \\ 0.60 & 1.05 \end{bmatrix}

Step 2: scale

With $d_k = 2$ , divide by $\sqrt{2} \approx 1.414$ :

L = \begin{bmatrix} 0.64 & 0.53 \\ 0.42 & 0.74 \end{bmatrix}

Step 3: softmax

Normalize each row so it sums to 1:

\alpha = \begin{bmatrix} 0.53 & 0.47 \\ 0.42 & 0.58 \end{bmatrix}

Step 4: weighted values

Multiply the weights by the value vectors:

New "cache" vector = $0.53 \cdot [2.0, 1.0] + 0.47 \cdot [1.0, 2.0] = [1.53, 1.47]$
New "grows" vector = $0.42 \cdot [2.0, 1.0] + 0.58 \cdot [1.0, 2.0] = [1.42, 1.58]$

The output replaces each original embedding with a blend of the whole sequence, weighted by relevance. Even in this toy example, "cache" pulls slightly more from its own value (0.53) than from "grows" (0.47), while "grows" mixes both values nearly evenly. In a real model with 512 or 4096 dimensions, this blending happens across thousands of numbers at once.

The same arithmetic is easy to verify without a tensor library:

a-trace-with-real-numbers.py

import math

Q = [[1.0, 0.5], [0.5, 1.0]]
K = [[0.8, 0.2], [0.3, 0.9]]
V = [[2.0, 1.0], [1.0, 2.0]]

def softmax(row: list[float]) -> list[float]:
    shift = max(row)
    exps = [math.exp(x - shift) for x in row]
    total = sum(exps)
    return [x / total for x in exps]

scores = [[sum(q_i * k_i for q_i, k_i in zip(q, k)) for k in K] for q in Q]
scaled = [[score / math.sqrt(2) for score in row] for row in scores]
weights = [softmax(row) for row in scaled]
outputs = [
    [sum(weight * value[col] for weight, value in zip(row, V)) for col in range(2)]
    for row in weights
]

print([[round(x, 2) for x in row] for row in weights])
print([[round(x, 2) for x in row] for row in outputs])
print([round(sum(row), 3) for row in weights])

Output

[[0.53, 0.47], [0.42, 0.58]]
[[1.53, 1.47], [1.42, 1.58]]
[1.0, 1.0]

PyTorch implementation

The scaled dot-product attention formula translates directly into PyTorch code. The function takes the Query, Key, and Value tensors along with an optional mask (used for causal or padding purposes). It computes the attention scores, scales them by the square root of the dimension, normalizes them into probabilities, and returns both the final weighted output and the attention weights themselves.

pytorch-implementation.py

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(
    Q: torch.Tensor,   # (batch, n_heads, seq_len, d_k)
    K: torch.Tensor,   # (batch, n_heads, seq_len, d_k)
    V: torch.Tensor,   # (batch, n_heads, seq_len, d_v)
    mask: torch.Tensor | None = None,  # complete visibility mask, broadcastable to (B, h, n, n)
    dropout_p: float = 0.0,
    training: bool = True,
) -> tuple[torch.Tensor, torch.Tensor]:
    """Scaled dot-product attention (Vaswani et al., 2017)."""
    d_k = Q.size(-1)

    # Step 1: Compute raw attention scores
    attn_scores = torch.matmul(Q, K.transpose(-2, -1))  # (B, h, n, n)

    # Step 2: Scale by head width to control logit spread
    attn_scores = attn_scores / math.sqrt(d_k)

    # Step 3: Apply mask (causal or padding)
    if mask is not None:
        if not bool(mask.any(dim=-1).all()):
            raise ValueError("each query row must have at least one visible key")
        attn_scores = attn_scores.masked_fill(~mask, float('-inf'))

    # Step 4: Softmax normalization (each row sums to 1)
    attn_weights = F.softmax(attn_scores, dim=-1)

    # Optional: dropout on attention weights (regularization)
    if dropout_p > 0.0:
        attn_weights = F.dropout(attn_weights, p=dropout_p, training=training)

    # Step 5: Weighted aggregation of values
    output = torch.matmul(attn_weights, V)  # (B, h, n, d_v)

    return output, attn_weights

Q = torch.tensor([[[[1.0, 0.5], [0.5, 1.0]]]])
K = torch.tensor([[[[0.8, 0.2], [0.3, 0.9]]]])
V = torch.tensor([[[[2.0, 1.0], [1.0, 2.0]]]])
causal = torch.tensor([[[[True, False], [True, True]]]])
output, weights = scaled_dot_product_attention(Q, K, V, causal)

pretty_weights = [[round(float(value), 3) for value in row] for row in weights[0, 0]]
print("causal weights:", pretty_weights)
print("token 0 future weight:", weights[0, 0, 0, 1].item())
print("output shape:", tuple(output.shape))

Output

causal weights: [[1.0, 0.0], [0.421, 0.579]]
token 0 future weight: 0.0
output shape: (1, 1, 2, 2)

Two masking details matter in real code. First, softmax must run along the key dimension (dim=-1) so each query row sums to 1; normalizing along the query axis silently changes the operation. Second, a row with no permitted key has no valid attention distribution. -inf makes that mistake visible as NaN. Replacing it with a finite negative value hides the bug: softmax assigns weight to blocked keys because every blocked logit ties. Ensure each active query has at least one allowed key (causal attention includes its own position), or explicitly suppress outputs for padded query rows.

This short failure case makes the second rule concrete:

fully-masked-rows-need-explicit-handling.py

import torch

scores = torch.tensor([[0.8, 0.1], [0.4, -0.2]])
visible = torch.tensor([[True, False], [False, False]])

finite_fill_weights = torch.softmax(scores.masked_fill(~visible, -1e4), dim=-1)
active_rows = visible.any(dim=-1, keepdim=True)
served_weights = torch.where(active_rows, finite_fill_weights, torch.zeros_like(finite_fill_weights))

print("finite fill, invalid row:", finite_fill_weights[1].tolist())
print("after padded-query suppression:", served_weights[1].tolist())
print("valid row ignores blocked key:", served_weights[0].tolist())

Output

finite fill, invalid row: [0.5, 0.5]
after padded-query suppression: [0.0, 0.0]
valid row ignores blocked key: [1.0, 0.0]

Common shape mistakes

The most frequent debugging task in transformer work is tracing a dimension mismatch. Two typical mistakes produce these exact symptoms.

Forgetting to transpose K

If you write torch.matmul(Q, K) instead of torch.matmul(Q, K.transpose(-2, -1)), the inner dimensions won't align. With Q shape (B, h, n, d_k) and K shape (B, h, n, d_k), the operation attempts to multiply two arrays whose final matrix axes don't form $(n \times d_k)(d_k \times n)$ , so it fails before softmax.

The repair is always the same: swap the last two dimensions of K so the matrix multiplication becomes $(n \times d_k) \cdot (d_k \times n)$ , yielding the $(n \times n)$ score matrix.

Forgetting to scale

If you skip attn_scores / math.sqrt(d_k), the code won't crash. Under the independent unit-variance setup below, a $d_k=64$ raw dot product has standard deviation 8 rather than 1. That larger spread can saturate softmax and reduce routing gradients. Inspect score statistics and attention entropy when debugging, then restore the scale factor unless you're intentionally testing a different attention formulation.

A self-attention heatmap for the sentence The cache grew quickly, with query rows, key columns, and two highlighted query rows showing how one row distributes weight across key positions while still summing to 1. — Read attention row by row. Each query token spreads one probability distribution across key positions, so the row sums to 1 before those weights mix value vectors. These invented numbers teach orientation; they don't claim what a trained head means.

Why scale by $\sqrt{d_k}$ ? The variance proof

Each attention score adds many independent feature products into one logit. Without scaling, increasing $d_k$ makes raw dot products larger and larger until softmax saturates. Dividing by $\sqrt{d_k}$ normalizes the score so the model can still distribute probability across several plausible tokens instead of locking onto one too early.

The derivation that motivated the transformer's scale factor starts with a simple assumption: entries of $q$ and $k$ are independent across vectors and dimensions, with mean 0 and variance 1. Learned activations won't satisfy those assumptions exactly, but the calculation explains why unscaled logits begin with dimension-dependent spread.^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762}

Assume

$q_i, k_i$ are independent components, each with mean $0$ and variance $1$ , and the product terms are independent across dimensions. The calculation doesn't require a Gaussian distribution; it uses these moments and independence to add the per-dimension variances. The dot product is:

$q \cdot k = \sum_{i=1}^{d_k} q_i k_i$

Each term $q_i k_i$ has $\mathbb{E}[q_i k_i] = 0$ (since $\mathbb{E}[q_i] = \mathbb{E}[k_i] = 0$ and they're independent). The variance is:

$\text{Var}(q_i k_i) = \mathbb{E}[(q_i k_i)^2] - (\mathbb{E}[q_i k_i])^2 = \mathbb{E}[q_i^2]\mathbb{E}[k_i^2] - 0 = 1 \cdot 1 = 1$

By the sum of independent variances:

$\text{Var}(q \cdot k) = d_k$

So the standard deviation of the raw dot product is $\sqrt{d_k}$ . As $d_k$ grows under this model, logits spread more widely before softmax:

$d_k$	$\text{Std}(q \cdot k)$	$\text{Std}(q \cdot k / \sqrt{d_k})$
16	4.0	1.0
64	8.0	1.0
512	22.6	1.0
4096	64.0	1.0

After dividing by $\sqrt{d_k}$

$\text{Var}\!\left(\frac{q \cdot k}{\sqrt{d_k}}\right) = \frac{d_k}{d_k} = 1$ under these assumptions. Scaling doesn't promise a particular learned attention pattern; it removes a predictable source of width-dependent logit growth.

Raw attention-logit spread grows with head width, while dividing by square root of d_k keeps the scaled spread flat; a 64-term audit shows standard deviation dropping from 8 to 1. — Under the independent mean-zero, unit-variance assumptions, raw dot-product spread grows with head width, while dividing by $\sqrt{d_k}$ keeps the scaled variance stable. For $d_k=64$, the raw standard deviation is 8 and the scaled standard deviation is 1.

See it in code. This experiment samples independent unit-variance query/key vectors and checks the standard-deviation calculation rather than choosing one dramatic softmax row:

after-dividing-by-sqrtdk.py

import math
import random
import statistics

rng = random.Random(7)

def dot_products(width: int, samples: int = 5000) -> list[float]:
    return [
        sum(rng.gauss(0, 1) * rng.gauss(0, 1) for _ in range(width))
        for _ in range(samples)
    ]

for width in (16, 64, 512):
    raw = dot_products(width)
    raw_std = statistics.pstdev(raw)
    scaled_std = statistics.pstdev([x / math.sqrt(width) for x in raw])
    print(f"d_k={width:3d}: raw std={raw_std:5.2f}, scaled std={scaled_std:4.2f}")

Output

d_k= 16: raw std= 4.07, scaled std=1.02
d_k= 64: raw std= 8.09, scaled std=1.01
d_k=512: raw std=22.39, scaled std=0.99

The sampled values won't be exactly the theoretical values, but their trend should match: raw spread grows with width while scaled spread stays near one.

Three types of attention

The transformer architecture uses attention in three distinct patterns. Bidirectional attention lets every position read every other visible position. Causal attention processes a sequence left to right, where each new token can only use earlier tokens. Cross-attention lets a target sequence read from a separate source sequence, such as a decoder reading encoder states.

1. Bidirectional self-attention (encoder)

Every non-padding token may attend to every other non-padding token; there's no future-token mask. Encoder architectures such as BERT use this pattern for tasks where the full input is available. For example, given a complete sentence, every word's representation can use every other word:

text

"The cache grew quickly"
 Token "grew" attends to: [The, cache, grew, quickly]  (full context)

2. Causal self-attention (decoder)

Each token can only attend to itself and previous tokens. Future positions are masked with $-\infty$ . Used in decoder architectures such as GPT-style and other autoregressive language models for generation tasks where the model must predict the next token without seeing the future. For instance, when processing a sequence step-by-step, the model progressively builds context but remains strictly blind to upcoming words:

text

"The cache grew quickly"
 Token "grew" attends to: [The, cache, grew]  (only past + self)
 Token "quickly" attends to: [The, cache, grew, quickly]  (full history)

The causal mask is a lower-triangular matrix that lets each position look only at itself and the positions before it. This minimal Python version takes the sequence length as input and outputs a boolean matrix where True indicates an allowed connection and False indicates a masked one.

2-causal-self-attention-decoder.py

def create_causal_mask(seq_len: int) -> list[list[bool]]:
    return [[key_pos <= query_pos for key_pos in range(seq_len)] for query_pos in range(seq_len)]

mask = create_causal_mask(4)
for row in mask:
    print(row)
print(mask[0] == [True, False, False, False])
print(mask[1] == [True, True, False, False])
print(mask[3] == [True, True, True, True])

Output

[True, False, False, False]
[True, True, False, False]
[True, True, True, False]
[True, True, True, True]
True
True
True

3. Cross-attention (encoder-decoder)

Queries come from one sequence, Keys and Values from another. The original Transformer uses this pattern in its decoder: encoder outputs provide keys and values, while current decoder states provide queries.^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762} The resulting weights choose which source positions contribute to each target representation:

text

Encoder output (source): "cache memory pressure"   provides K, V
Decoder state (target):  "decode slows ___"        provides Q

Q from decoder times K from encoder gives attention weights
Weights times V from encoder give decoder context

In practice, cross-attention usually applies a source padding mask so decoder tokens don't attend to padded encoder positions. It doesn't use a causal mask over the source sequence, because the encoder has already seen the whole input.

Self-attention produces a square query-by-key matrix. Cross-attention doesn't have to: two target queries reading three source positions produce a 2 x 3 routing matrix.

cross-attention-can-be-rectangular.py

import math

decoder_queries = [[1.0, 0.0], [0.0, 1.0]]
encoder_keys = [[1.0, 0.0], [0.2, 0.8], [0.0, 1.0]]
encoder_values = [[1.0, 0.0], [0.5, 0.5], [0.0, 1.0]]

def softmax(row: list[float]) -> list[float]:
    exps = [math.exp(x - max(row)) for x in row]
    return [value / sum(exps) for value in exps]

logits = [
    [sum(q_i * k_i for q_i, k_i in zip(q, k)) / math.sqrt(2) for k in encoder_keys]
    for q in decoder_queries
]
weights = [softmax(row) for row in logits]
context = [
    [sum(weight * value[d] for weight, value in zip(row, encoder_values)) for d in range(2)]
    for row in weights
]

print(f"routing shape: {len(weights)} x {len(weights[0])}")
print("row sums:", [round(sum(row), 3) for row in weights])
print("context:", [[round(x, 3) for x in row] for row in context])

Output

routing shape: 2 x 3
row sums: [1.0, 1.0]
context: [[0.623, 0.377], [0.393, 0.607]]

Attention Type	Q Source	K, V Source	Mask	Architecture
Bidirectional Self	Same sequence	Same sequence	Padding mask only (if needed)	BERT, Vision Transformer (ViT)
Causal Self	Same sequence	Same sequence	Lower-triangular, plus padding if needed	GPT, Llama
Cross	Target sequence	Source sequence	Source padding mask common	T5, original Transformer

Three attention patterns side by side: bidirectional self-attention shows all links visible, causal self-attention blocks future positions above the diagonal, and cross-attention lets target tokens read from a separate source sequence. — Two choices define attention type: where queries and keys come from, and which score cells stay visible. Bidirectional self-attention keeps the full grid, causal self-attention keeps only the lower triangle, and cross-attention lets target queries read a separate source sequence.

Multi-head attention

Multi-head attention gives the model several learned routing views of the same token sequence, then lets a final projection combine those views. Each head has its own query, key, and value projections. A trained head may acquire a recognizable routing pattern, but the architecture doesn't assign jobs such as "syntax head" or "copy head" in advance.

Instead of one large attention operation, we run $h$ parallel heads, each with $d_k = d_{\text{model}} / h$ :^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762}

$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$

Reading the formula

Instead of running one big attention operation, we project into $h$ narrower heads that attend independently. After each head produces its output, we concatenate them and multiply by a final matrix $W^O$ so information from those routes rejoins the residual stream.

For self-attention with input $X$ , each head computes: $\text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)$ . Cross-attention uses target states for the query projection and source states for the key/value projections.

Multi-head attention shown as one width-8 residual stream projected into four parallel width-2 heads, then concatenated back to width 8 and mixed by the output projection. — Multi-head attention doesn't create width from nowhere. It splits one width-8 stream into four width-2 routes, runs those routes in parallel, then concatenates them back to width 8 before $W^O$ mixes information across heads.

What do attention heads learn?

These interpretability results are evidence about specific trained models, not a promise about every transformer. Voita et al. found positional and syntactic patterns among heads in neural machine translation encoders, and Michel et al. found that many heads in the models they tested could be removed at inference with limited quality loss.^{[2]Reference 2Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting.https://arxiv.org/abs/1905.09418}^{[3]Reference 3Are Sixteen Heads Really Better than One?.https://arxiv.org/abs/1905.10650} Olsson et al. studied induction heads, circuits that support copying patterns in autoregressive models under their experiments.^{[4]Reference 4In-context Learning and Induction Heads.https://arxiv.org/abs/2209.11895}

Result from a study	Useful inference	Unsafe inference
Some heads show consistent patterns	Inspect heads when debugging or researching a trained model	Every head has a named human-readable purpose
Some tested models tolerate head pruning	Redundancy can exist and can be measured	Arbitrarily deleting heads preserves a new model's quality
Induction-head circuits can emerge	Attention can implement copy-like sequence algorithms	An attention heatmap alone proves causal model behavior

Nuance: same asymptotic attention FLOPs

Multi-head attention doesn't increase the asymptotic cost of the attention core when you keep $d_{\text{model}}$ fixed. It restructures the work. Single-head attention on $d_{\text{model}}=512$ uses roughly the same leading-order FLOPs (Floating Point Operations) for score computation and value mixing as 8-head attention with $d_k=64$ each, because $h \times d_k = d_{\text{model}}$ . The dense Q/K/V and output projections still cost $O(n d_{\text{model}}^2)$ either way.

The arithmetic also makes the compute comparison testable. Splitting a fixed width into more heads doesn't change the total number of score-and-value multiply-adds in the attention core:

multi-head-core-work-at-fixed-width.py

seq_len = 2048
d_model = 512

for heads in (1, 8, 16):
    d_head = d_model // heads
    score_and_value_work = 2 * heads * seq_len**2 * d_head
    print(f"heads={heads:2d}, d_head={d_head:3d}, core units={score_and_value_work:,}")

Output

heads= 1, d_head=512, core units=4,294,967,296
heads= 8, d_head= 64, core units=4,294,967,296
heads=16, d_head= 32, core units=4,294,967,296

Here's a runnable PyTorch shape implementation. To keep shapes simple, it uses the common case $d_v = d_k = d_{\text{model}} / h$ . The three dense projection layers each produce all head slices at once; reshaping exposes those slices to batched attention.

critical-nuance-same-asymptotic-attention.py

import torch
import torch.nn.functional as F

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
        self.d_k = d_model // n_heads
        self.n_heads = n_heads

        # Each projection produces all per-head slices in one dense operation.
        self.W_q = torch.nn.Linear(d_model, d_model)
        self.W_k = torch.nn.Linear(d_model, d_model)
        self.W_v = torch.nn.Linear(d_model, d_model)
        self.W_o = torch.nn.Linear(d_model, d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, N, D = x.shape

        # Project and reshape: (B, N, D) -> (B, h, N, d_k)
        Q = self.W_q(x).view(B, N, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, N, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, N, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention per head.
        out = F.scaled_dot_product_attention(Q, K, V, is_causal=True)

        # Concatenate heads and project back: (B, h, N, d_k) -> (B, N, D)
        out = out.transpose(1, 2).contiguous().view(B, N, D)
        return self.W_o(out)

torch.manual_seed(0)
layer = MultiHeadAttention(d_model=8, n_heads=2)
x = torch.randn(1, 4, 8)
y = layer(x)
print("input shape:", tuple(x.shape))
print("head shape:", (1, layer.n_heads, 4, layer.d_k))
print("output shape:", tuple(y.shape))

Output

input shape: (1, 4, 8)
head shape: (1, 2, 4, 4)
output shape: (1, 4, 8)

From-scratch checklist

Before relying on torch.nn.MultiheadAttention, make sure you can implement the pieces above by hand. High-level modules are useful later, but they hide the exact shape and masking mistakes that break production attention code.

For a decoder-only block, your implementation should do each step explicitly:

Project x into Q, K, and V.
Reshape (B, N, D) into (B, h, N, d_k).
Compute Q @ K.transpose(-2, -1) / sqrt(d_k).
Apply the causal mask before softmax.
Use numerically stable softmax along the key dimension.
Multiply attention weights by V.
Concatenate heads back to (B, N, D).
Apply the output projection.

Two shape assertions catch many bugs:

from-scratch-checklist.py

assert Q.shape == (B, n_heads, N, d_k)
assert attn_weights.shape == (B, n_heads, N, N)

One mask assertion catches leakage: assert not causal[0][1]. Token 0 must not see token 1.

If a model can read future tokens during training, the loss can look excellent while generation fails. That's why causal masking isn't a detail. It's the contract that makes next-token prediction honest.

Gradient flow through attention

Attention has three gradient paths:

Path	What receives gradient	Why it matters
Output to V	value projection and upstream token representations	teaches what information each token should carry
Output to attention weights	softmax probabilities	teaches which source positions should matter
Weights back to Q and K	query/key projections	teaches the routing function itself

The scale factor helps keep the Q/K path trainable. If scores become too large, softmax saturates, attention weights become nearly one-hot, and the gradient through the routing path becomes tiny. If the causal mask is wrong, gradients flow through illegal future positions and the model learns a shortcut it can't use at inference time.

When debugging attention, don't only print the final output. Inspect the score range, the mask, one row of attention weights, and the gradient norm on W_q and W_k. Those four checks tell you whether the model is learning routing or only moving values through a broken router.

Complexity analysis

Big-O notation describes how work or storage grows as sequence length $n$ and model width $d_{\text{model}}$ increase. It hides constant factors, so use it for growth trends and use concrete byte/FLOP arithmetic for capacity planning.

Metric	Complexity	Explanation
Time (attention core, single head)	$O(n^2 \cdot d_k)$	Both $QK^T$ and $\alpha V$ touch all $n^2$ query-key pairs
Time (attention core, full multi-head)	$O(n^2 \cdot d_{\text{model}})$	Across $h$ heads, $h \cdot d_k = d_{\text{model}}$
Time (Q/K/V + output projections)	$O(n \cdot d_{\text{model}}^2)$	Dense linear layers before and after the attention core
Memory (naive weights)	$O(n^2)$ per head	The score or weight matrix is $n \times n$
Parameters	$O(d_{\text{model}}^2)$	$W_Q, W_K, W_V, W_O$ are dense projections

Attention score growth shown as four larger n-by-n grids from 512 to 32768 tokens, where each 4x increase in sequence length causes 16x more query-key pairs, plus a bottom memory wall panel showing that one naive FP16 score tensor at batch 8, 32 heads, and sequence length 8192 is about 32 GiB. — The growth rule is visual: every 4x jump in sequence length makes the score grid 16x larger. By 8,192 tokens, one naive FP16 score-sized tensor at batch 8 and 32 heads is already about 32 GiB, before gradients, activations, or weights.

Concrete memory example

Batch of 8, 32 heads, $n = 8192$ , FP16 (16-bit floating point format):

$\text{Attention matrix} = 8 \times 32 \times 8192^2 \times 2 \text{ bytes} \approx 34.4 \text{ GB} \approx 32 \text{ GiB}$

How the numbers work

8 sequences in the batch x 32 attention heads x 8192² entries per attention map x 2 bytes per number (FP16) = about 34.4 GB of raw storage, or about 32 GiB, just for one attention-score tensor before values needed for backpropagation, model weights, or optimizer state.

how-the-numbers-work.py

batch = 8
heads = 32
seq_len = 8192
bytes_per_fp16 = 2

bytes_total = batch * heads * seq_len**2 * bytes_per_fp16
gb = bytes_total / 1_000_000_000
gib = bytes_total / 1024**3

print(round(gb, 1), "GB")
print(round(gib, 1), "GiB")
print(round(gb, 1) == 34.4, round(gib, 1) == 32.0)

Output

34.4 GB
32.0 GiB
True True

In the naive formulation, $O(n^2)$ temporary memory becomes a bottleneck for long sequences. Doubling sequence length makes one materialized score tensor four times larger. Fused kernels can avoid storing that full tensor, but exact dense attention still computes interactions across all query-key pairs.

Training vs. inference bottlenecks

During training, or in a naive implementation, the temporary $n \times n$ score matrix is the obvious $O(n^2)$ memory problem. During autoregressive decoding, optimized kernels often avoid materializing that matrix, but the model still has to repeatedly read the accumulated KV cache (past keys and values) for every new token. That makes incremental inference heavily constrained by memory bandwidth, not FLOPs alone.^{[5]Reference 5Fast Transformer Decoding: One Write-Head is All You Need.https://arxiv.org/abs/1911.02150}

Architectural variants attack that persistent cache cost directly. Multi-query attention (MQA) shares one key/value head across all query heads, while grouped-query attention (GQA) uses a smaller number of key/value heads than query heads.^{[5]Reference 5Fast Transformer Decoding: One Write-Head is All You Need.https://arxiv.org/abs/1911.02150}^{[6]Reference 6GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.https://arxiv.org/abs/2305.13245} Both shrink KV-cache bytes. Whether they improve latency enough for a workload is a measurement question because kernel choice, batch size, and quality requirements also matter.

For a simplified decoder cache, the storage count is proportional to layers x tokens x kv_heads x head_dim x 2 (the final factor stores both K and V). Keeping 32 query heads but reducing KV heads changes this count directly:

kv-head-count-controls-cache-bytes.py

layers = 32
tokens = 8192
head_dim = 128
bytes_per_value = 2  # FP16

def cache_gib(kv_heads: int) -> float:
    bytes_total = layers * tokens * kv_heads * head_dim * 2 * bytes_per_value
    return bytes_total / 1024**3

mha = cache_gib(32)
for label, kv_heads in [("MHA", 32), ("GQA", 8), ("MQA", 1)]:
    size = cache_gib(kv_heads)
    print(f"{label}: kv_heads={kv_heads:2d}, cache={size:.2f} GiB, reduction={mha / size:.0f}x")

Output

MHA: kv_heads=32, cache=4.00 GiB, reduction=1x
GQA: kv_heads= 8, cache=1.00 GiB, reduction=4x
MQA: kv_heads= 1, cache=0.12 GiB, reduction=32x

FlashAttention and MQA/GQA solve different problems. FlashAttention cuts temporary attention I/O. MQA/GQA cut persistent KV-cache size.

FlashAttention: tiled exact attention

FlashAttention (Dao et al., 2022)^{[7]Reference 7FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.https://arxiv.org/abs/2205.14135} computes exact dense attention with an IO-aware algorithm. Its attention working memory is linear in sequence length rather than storing a quadratic score matrix; it changes the execution order, not the attention definition.

Core idea

Instead of materializing the full $n \times n$ attention matrix in GPU HBM, FlashAttention computes attention in tiles that fit in on-chip SRAM, using online softmax to avoid storing the full matrix while preserving the exact result.^{[7]Reference 7FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.https://arxiv.org/abs/2205.14135}^{[8]Reference 8Online normalizer calculation for softmax.https://arxiv.org/abs/1805.02867}

Property	Standard Attention	FlashAttention
Memory for attention computation	$O(n^2)$	$O(n)$
HBM traffic	Writes and rereads large score matrices	Keeps tiles on chip and avoids materializing full score matrix
Exact	Yes	Yes
Wall-clock speed	Baseline	Depends on hardware, shapes, dtype, and kernel availability

FlashAttention is IO-aware. It minimizes traffic between large off-chip HBM and small on-chip SRAM by restructuring the computation order, not by changing the mathematical result.

FlashAttention keeps full Q, K, V, and output tensors in HBM, streams small tiles through SRAM, updates online softmax state on chip, and never stores the full score matrix in HBM. — FlashAttention keeps one query tile and the running row state in SRAM while streaming key and value tiles. Each transient score tile updates the row max, softmax sum, and output accumulator, so the exact output is written without materializing an $n \times n$ score or probability matrix in HBM.

Using fused attention in practice doesn't require writing custom CUDA kernels. PyTorch exposes scaled dot-product attention; the backend chosen for a given run depends on device, dtype, shapes, masks, and framework version. On CUDA, the function attempts to select an enabled implementation based on its inputs, but a fused kernel isn't guaranteed for every call.^{[9]Reference 9torch.nn.functional.scaled_dot_product_attentionhttps://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html} The public contract is the result, so start by checking it against the explicit computation:

sdpa-matches-explicit-causal-attention.py

import math
import torch
from torch.nn.functional import scaled_dot_product_attention as sdpa

torch.manual_seed(4)
Q = torch.randn(1, 1, 4, 8)
K = torch.randn(1, 1, 4, 8)
V = torch.randn(1, 1, 4, 8)

scores = Q @ K.transpose(-2, -1) / math.sqrt(Q.size(-1))
causal = torch.ones(4, 4, dtype=torch.bool).tril()
explicit = torch.softmax(scores.masked_fill(~causal, float("-inf")), dim=-1) @ V
library = sdpa(Q, K, V, is_causal=True)

print("output shape:", tuple(library.shape))
print("matches explicit computation:", torch.allclose(library, explicit, atol=1e-6))

Output

output shape: (1, 1, 4, 8)
matches explicit computation: True

Two API details prevent quiet bugs when you replace explicit attention with PyTorch's fused primitive. For F.scaled_dot_product_attention, True in a boolean attn_mask means the position participates in attention; that's the inverse of nn.MultiheadAttention's boolean key_padding_mask. Also, dropout_p is always applied when it's greater than zero, so pass 0.0 during evaluation.^{[9]Reference 9torch.nn.functional.scaled_dot_product_attentionhttps://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html}

Softmax numerical stability

Naive softmax can overflow. The exponential $e^x$ grows quickly, so a large positive logit can become Inf; dividing Inf by a sum containing Inf can produce NaN. Attention kernels use stable softmax so large but valid scores don't corrupt routing weights.

The max-shift technique

To prevent this overflow, deep learning frameworks implement a numerically stable version of softmax. They subtract the maximum value from the input vector before exponentiating:

$\text{softmax}(x_i) = \frac{e^{x_i - \max(\mathbf{x})}}{\sum_j e^{x_j - \max(\mathbf{x})}}$

Why shifting doesn't change the answer

Subtracting $\max(\mathbf{x})$ shifts all logits by the same constant. Because of the properties of exponentials, this constant factors out and cancels between the numerator and denominator. The resulting probabilities don't change, but the highest value being exponentiated is exactly $0$ (since $\max(\mathbf{x}) - \max(\mathbf{x}) = 0$ ), so the maximum exponential evaluated is $e^0 = 1$ . This prevents positive exponential overflow; very negative shifted values may underflow harmlessly toward zero.^{[8]Reference 8Online normalizer calculation for softmax.https://arxiv.org/abs/1805.02867}

Building on this, online softmax (Milakov & Gimelshein, 2018)^{[8]Reference 8Online normalizer calculation for softmax.https://arxiv.org/abs/1805.02867} extends the max-shift technique to compute softmax in a single streaming pass. Instead of needing to read the entire vector into memory to find the maximum, online softmax tracks the running maximum and the running sum of exponentials simultaneously, dynamically correcting the sum as a new, larger maximum is found. This streaming capability is exactly the mathematical foundation that allows FlashAttention's memory-efficient block tiling.

why-shifting-doesnt-change-the-answer.py

import math

def stable_softmax(logits: list[float]) -> list[float]:
    shift = max(logits)
    exps = [math.exp(x - shift) for x in logits]
    total = sum(exps)
    return [x / total for x in exps]

probs = stable_softmax([1000.0, 1001.0, 999.0])
print([round(p, 4) for p in probs])
print(round(sum(probs), 6))
print(probs[1] == max(probs))

Output

[0.2447, 0.6652, 0.09]
1.0
True

Attention in modern architectures

The mask shape and the source of Q, K, and V reveal which information an architecture permits each output to use:

Architecture	Self-Attention	Cross-Attention	Typical objective / task
BERT	Bidirectional	No	Masked language modeling
Decoder-only LM	Causal	No	Next-token prediction
T5	Bidirectional (enc) + Causal (dec)	Yes	Span corruption
Stable Diffusion	Self (in U-Net)	Yes (text to image)	Diffusion denoising
Whisper	Bidirectional (enc) + Causal (dec)	Yes	Speech-to-text seq2seq
Vision Transformer (ViT)	Bidirectional	No	Image classification / self-supervision

Encoder-only models like BERT and ViT use bidirectional self-attention because they read the entire input before producing representations. They process the full text or image simultaneously to generate contextual embeddings. In contrast, decoder-only language models use causal self-attention because they must generate output step-by-step. If they could look ahead at future tokens during training, they would "cheat" instead of learning to predict.

Encoder-decoder models (like T5 and Whisper) mix these approaches. They use a bidirectional encoder to read the input (text or audio), and a causal decoder to generate the output. Cross-attention acts as the bridge between the two, allowing the decoder to continuously refer back to the encoder states for the source material.

Decoder-only language models exercise causal self-attention; encoder-decoder models add cross-attention; ViT uses bidirectional attention over image tokens. Those mechanics matter more than a market survey here. The next chapter will reuse the same attention operation after turning an image into patch tokens.

What to check before moving on

Tier	Defense target
Foundational	Derive $\text{Attention}(Q,K,V)=\text{softmax}(QK^T/\sqrt{d_k})V$ and annotate the dimensions of each tensor.
Intermediate	Derive why dividing by $\sqrt{d_k}$ normalizes score variance under independent unit-variance assumptions.
Advanced	Distinguish bidirectional self-attention, causal self-attention, and cross-attention by Q/K/V source and mask.
Advanced	Explain why multi-head attention runs several lower-dimensional attention heads without changing leading attention-core asymptotic FLOPs when $d_{\text{model}}$ is fixed.
Advanced	Separate $O(n^2 d)$ attention-core time, $O(n d_{\text{model}}^2)$ projection time, and $O(n^2)$ naive attention memory.
Advanced	Explain why FlashAttention cuts temporary attention I/O while MQA/GQA shrink persistent KV-cache traffic during decoding.
Advanced	Describe max-shift softmax, online softmax, and why numerical stability matters inside attention kernels.

Mistakes that break attention

Mistake	Symptom	Fix
Forgetting the K transpose	Matmul shape error before softmax	Compute `Q @ K.transpose(-2, -1)` so scores have shape `(B, h, N, N)`.
Skipping $\sqrt{d_k}$ scaling	Score spread rises with head width under the initialization model; attention may saturate	Divide scores by `math.sqrt(d_k)` before masking and softmax.
Mixing up attention memory and time	Bad long-context sizing estimates	Track attention-core FLOPs, projection FLOPs, temporary score memory, and KV-cache memory separately.
Treating FlashAttention and GQA as the same optimization	Wrong performance diagnosis	Use FlashAttention for temporary attention I/O; use MQA/GQA for persistent KV-cache traffic.
Forgetting padding masks in cross-attention	Decoder attends to fake source tokens	Mask padded encoder positions even though source positions don't need a causal mask.
Confusing Q/K/V roles	Hard-to-debug routing behavior	Remember: Q and K choose where information flows; V carries the content being mixed.
Using naive softmax	`Inf`, `NaN`, or unstable probabilities	Subtract the row max, or use a framework primitive that already applies stable softmax.
Softmax on the wrong axis	Rows don't sum to 1; routing is meaningless	Apply softmax over the key dimension (`dim=-1`), not the query dimension.
Allowing a fully masked query row	`NaN` with `-inf`, or silent blocked-key mixing with a finite fill	Guarantee one valid key per active query, or zero/skip outputs for padded queries.

Going deeper

"Why not use additive attention instead of dot-product?"

Bahdanau et al. (2015)^{[10]Reference 10Neural Machine Translation by Jointly Learning to Align and Translate.https://arxiv.org/abs/1409.0473} used additive attention: $\text{score}(q, k) = v^T \tanh(W_q q + W_k k)$ . In the Transformer paper, Vaswani et al. note that additive and dot-product attention behave similarly at small dimensions, but dot-product maps much better to batched matrix multiplication and is much faster at the larger dimensions used in transformers.^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762} The $\frac{1}{\sqrt{d_k}}$ scaling is what keeps dot-product attention stable as $d_k$ grows.

"Can attention attend to nothing / everything equally?"

If all scores are equal, softmax produces a uniform distribution $1/n$ , and the output is the average of all value vectors. If one score dominates, softmax approximates an argmax, so the output is approximately one value vector. The sharpness of the logits controls where on this spectrum you land.

Practice drill

Build a tiny attention-debug notebook or trace for three tokens:

Print Q, K, score, mask, softmax row sums, and output shapes after every operation.
Add one padded key and one causal mask case, then verify blocked positions receive zero probability.
Trigger one wrong-axis softmax failure and record the symptom you would see in row sums or output behavior.
Write a checklist for reviewing production attention code: tensor shapes, mask semantics, stable softmax, and fully masked rows.

This makes routing errors observable before they become mysterious model-quality bugs.

Next Step

Continue to Vision Transformers and Image Encoders

You now understand attention over token sequences; the next chapter shows how images become patch tokens so the same attention machinery can process visual data.

PreviousEmbedding Similarity & Quantization

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Attention Is All You Need.

Vaswani, A., et al. · 2017

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting.

Voita, E., et al. · 2019 · ACL 2019

Are Sixteen Heads Really Better than One?.

Michel, P., Levy, O., & Neubig, G. · 2019 · NeurIPS 2019

In-context Learning and Induction Heads.

Olsson, C., et al. · 2022

Fast Transformer Decoding: One Write-Head is All You Need.

Shazeer, N. · 2019 · arXiv preprint

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

Online normalizer calculation for softmax.

Milakov, M. & Gimelshein, N. · 2018

torch.nn.functional.scaled_dot_product_attention

PyTorch Contributors · 2026

Neural Machine Translation by Jointly Learning to Align and Translate.

Bahdanau, D., Cho, K., & Bengio, Y. · 2015 · ICLR 2015

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Scaled Dot-Product Attention

The attention mechanism

Intuition

Setup: Q, K, V projections

From one embedding to three views

In one sentence, which tensors choose where attention goes, and which tensor carries content?

The core formula

Step by step

A trace with real numbers

Step 1: raw scores

Step 2: scale

Step 3: softmax

Step 4: weighted values

In this toy trace, skipping the scaling step barely changes softmax. Why does the same omission hurt when dkd_kdk​ grows to 64 or 512?

PyTorch implementation

Common shape mistakes

Forgetting to transpose K

Forgetting to scale

Why scale by dk\sqrt{d_k}dk​​? The variance proof

Assume

After dividing by dk\sqrt{d_k}dk​​

Why divide by dk\sqrt{d_k}dk​​ instead of by dkd_kdk​?

Three types of attention

1. Bidirectional self-attention (encoder)

2. Causal self-attention (decoder)

Why must token 0 be unable to attend to token 1 in causal self-attention?

3. Cross-attention (encoder-decoder)

Multi-head attention

Reading the formula

What do attention heads learn?

Nuance: same asymptotic attention FLOPs

From-scratch checklist

Gradient flow through attention

Complexity analysis

Concrete memory example

How the numbers work

If sequence length grows from 512 to 2048, why does naive attention memory grow by 16x instead of 4x?

Training vs. inference bottlenecks

Which optimization reduces temporary attention-matrix I/O, and which reduces persistent KV-cache bandwidth during decoding?

FlashAttention: tiled exact attention

Core idea

Softmax numerical stability

The max-shift technique

Why shifting doesn't change the answer

Attention in modern architectures

What to check before moving on

What happens if you remove the scaling factor for a large head dimension such as dk=512d_k=512dk​=512?

Why use multiple attention heads instead of one large head?

How can transformer systems handle very long contexts such as 100K tokens?

Mistakes that break attention

Going deeper

"Why not use additive attention instead of dot-product?"

"Can attention attend to nothing / everything equally?"

Practice drill

Mastery Check

Discussion

Why scale by $\sqrt{d_k}$ ? The variance proof

After dividing by $\sqrt{d_k}$