Learn NumPy shape reasoning from first principles: name axes, predict indexing and broadcasting, reduce safely, distinguish reshape from transpose, and add shape guards.
Two customers send support tickets: "refund request order" and "shipped from hub." A routing model can't work directly with those words, so it receives small lists of numbers that represent each word. If a data-loading step swaps the customer and word positions, later math can fail or, worse, run while attaching predictions to the wrong text.
This is why array shape is your first AI engineering habit. An array is an ordered collection of numbers; a multi-axis array is often called a tensor in machine learning. NumPy calls its array object an ndarray: it holds numerical data in a buffer plus metadata such as shape and strides that tell NumPy how an index reaches that data. Views can reuse the same buffer with different metadata, so fast operations aren't permission to ignore axis meaning.[1]
That habit matters early in AI work. Many beginner bugs are not mysterious math bugs. They are shape bugs:
This chapter uses those two support tickets as one tiny batch. By the end, you should be able to name the axes in an array, predict how indexing or broadcasting changes its shape, reduce the right axis on purpose, choose reshape or transpose for the job, and check an attention-score tensor without guessing.
Use one tiny batch for the whole chapter.
Imagine two short support-ticket phrases:
| Batch item | Tokens (support ticket words) |
|---|---|
| 0 | ["refund", "request", "order"] |
| 1 | ["shipped", "from", "hub"] |
A token is a piece of text a model processes; here we use one word as one token. A feature vector is the list of numbers representing one token. Later chapters call these per-position vectors hidden states. Suppose each token receives four features. The resulting tensor has shape:
1(B, T, D) = (2, 3, 4)Read that slowly:
| Axis | Meaning | Size |
|---|---|---|
B | batch, how many tickets | 2 |
T | token positions per ticket | 3 |
D | numbers per token | 4 |
That tuple is a contract, not decoration. If you can say the sentence, you can usually debug the code.
Don't treat a tensor as a mysterious cube. Our tensor (2, 3, 4) is two ticket trays; each tray contains three token slots; each token slot contains four numerical features. To find one value, you need three addresses. Shape tells you how many choices exist at each address.
| Term | Shape | What it holds |
|---|---|---|
| scalar | () | one number, no axes |
| vector | (4,) | one row of 4 numbers |
| matrix | (3, 4) | 3 rows, each with 4 numbers |
| tensor | (2, 3, 4) | any number of axes, generalizing the same idea |
Keep the words concrete. If D means feature count, each vector on that axis should contain D numbers.
Start with real numbers so the array stops feeling abstract:
1x =
2[
3 [[ 0, 1, 2, 3],
4 [ 4, 5, 6, 7],
5 [ 8, 9, 10, 11]],
6
7 [[12, 13, 14, 15],
8 [16, 17, 18, 19],
9 [20, 21, 22, 23]]
10]The shape is still (2, 3, 4), but now you can point at actual token vectors.
| Expression | Plain-English meaning | Result shape |
|---|---|---|
x[0] | ticket 0, all tokens, all features | (3, 4) |
x[0, 1] | token 1 in ticket 0 | (4,) |
x[:, 1] | token 1 from each ticket | (2, 4) |
x[:, :, 0] | feature 0 from each token in each ticket | (2, 3) |
The pattern is simple:
0 or 1 chooses one item and usually removes that axis.: keeps the whole axis..shape.Run this as a tiny script. Notice that it prints values as well as shapes.
1import numpy as np
2
3B, T, D = 2, 3, 4
4x = np.arange(B * T * D).reshape(B, T, D)
5
6print("x shape", x.shape)
7print("x[0] shape", x[0].shape)
8print("x[0,1]", x[0, 1])
9print("x[:,1]")
10print(x[:, 1])1x shape (2, 3, 4)
2x[0] shape (3, 4)
3x[0,1] [4 5 6 7]
4x[:,1]
5[[ 4 5 6 7]
6 [16 17 18 19]]The useful beginner habit isn't "look at .shape after the fact." It's "predict the shape, then let .shape confirm or correct you."
Now add one feature bias to each token vector:
1bias = [100, 200, 300, 400]This bias has shape (4,), so it matches the last axis of x.
1x[0, 1] = [ 4, 5, 6, 7]
2bias = [100, 200, 300, 400]
3result = [104, 205, 306, 407]NumPy applies that addition to each token in each ticket without a Python loop. You can imagine the bias vector repeating across the batch and token axes, but the repetition is conceptual: broadcasting doesn't need to copy that smaller operand in memory.[2]
The matching rule is easier than it sounds:
1.For x.shape == (2, 3, 4):
| Smaller shape | What NumPy does | Meaning |
|---|---|---|
(4,) | broadcasts across batch and token axes | one bias per feature |
(3, 1) | broadcasts across batch and feature axes | one scalar per token position |
(3,) | raises an error | no feature axis matches that 3 |
That last row is important. A shape can be wrong in two ways:
Here is the same rule shown as a shape-alignment diagram:
1import numpy as np
2
3B, T, D = 2, 3, 4
4x = np.arange(B * T * D).reshape(B, T, D)
5bias = np.array([100, 200, 300, 400])
6shifted = x + bias
7
8print("shifted shape", shifted.shape)
9print("shifted[0,1]", shifted[0, 1])1shifted shape (2, 3, 4)
2shifted[0,1] [104 205 306 407]If you accidentally try np.array([10, 20, 30]), NumPy raises:
1ValueError: operands could not be broadcast together with shapes (2,3,4) (3,)That message is useful. It tells you the feature axis was not where you thought it was.
You can test that failure deliberately instead of waiting for it to surprise you:
1import numpy as np
2
3x = np.arange(2 * 3 * 4).reshape(2, 3, 4)
4token_bias = np.array([10, 20, 30])
5
6try:
7 x + token_bias
8except ValueError as exc:
9 print(str(exc))
10else:
11 raise RuntimeError("Expected NumPy to reject shape (3,) against (2, 3, 4)")1operands could not be broadcast together with shapes (2,3,4) (3,)(N,) vs (N, 1) trapA 1D array (5,) is not the same as a column vector (5, 1). This is a common cause of silent broadcasting errors in neural networks.
1import numpy as np
2
3v = np.array([1, 2, 3])
4print("v shape", v.shape)
5print("v[:, None] shape", v[:, None].shape)
6print("v[None, :] shape", v[None, :].shape)1v shape (3,)
2v[:, None] shape (3, 1)
3v[None, :] shape (1, 3)A plain (3,) can broadcast into either a row or a column, depending on what it's paired with. That flexibility is convenient but dangerous. When you need an explicit row or column vector, make the axis visible with v[None, :] or v[:, None].
A reduction answers the question, "which axis am I summarizing away?"
Suppose you want one vector per ticket, not one vector per token. Then average across the token axis:
1x.mean(axis=1)For easier arithmetic, keep the same (B, T, D) contract but switch to feature values whose average you can check by hand:
1token 0 = [ 0, 3, 6, 9]
2token 1 = [ 3, 6, 9, 12]
3token 2 = [12, 9, 6, 3]
4mean = [ 5, 6, 7, 8]So x.mean(axis=1) should keep batch and feature axes, giving shape (B, D) = (2, 4).
Now compare that with:
1x.mean(axis=-1)That reduces the feature axis instead. It keeps batch and token axes, so the shape becomes (B, T) = (2, 3).
1import numpy as np
2
3B, T, D = 2, 3, 4
4x = np.array([
5 [[0, 3, 6, 9], [3, 6, 9, 12], [12, 9, 6, 3]],
6 [[1, 4, 7, 10], [4, 7, 10, 13], [13, 10, 7, 4]],
7])
8pooled_tokens = x.mean(axis=1)
9feature_means = x.mean(axis=-1)
10
11print("pooled_tokens shape", pooled_tokens.shape)
12print("pooled_tokens[0]", pooled_tokens[0])
13print("feature_means shape", feature_means.shape)
14print("feature_means[0]", feature_means[0])1pooled_tokens shape (2, 4)
2pooled_tokens[0] [5. 6. 7. 8.]
3feature_means shape (2, 3)
4feature_means[0] [4.5 7.5 7.5]The easiest way to avoid reduction mistakes is to say the axis name out loud before you write the code:
"I want one vector per ticket, so I am reducing the token axis."
keepdims=True is a safety switchBy default, a reduction removes the axis entirely. That can break later broadcasting because the shape has changed.
1import numpy as np
2
3B, T, D = 2, 3, 4
4x = np.arange(B * T * D).reshape(B, T, D)
5
6# Without keepdims: shape becomes (2, 4), which no longer broadcasts with (2, 3, 4)
7mean_flat = x.mean(axis=1)
8print("mean_flat shape", mean_flat.shape)
9
10# With keepdims: shape stays (2, 1, 4), which still broadcasts with (2, 3, 4)
11mean_kept = x.mean(axis=1, keepdims=True)
12print("mean_kept shape", mean_kept.shape)1mean_flat shape (2, 4)
2mean_kept shape (2, 1, 4)keepdims=True keeps the collapsed axis as a placeholder of size 1. That preserves the original axis positions so operations against the original tensor can still broadcast correctly. Use it when a reduction result must be added, subtracted, or divided back into its source-shaped tensor.
This is easier to trust when you compare both shape paths side by side:
Beginners often mix these up because both can produce a new shape.
Use a smaller array so you can see the numbers move:
1import numpy as np
2
3small = np.arange(12).reshape(2, 3, 2)
4reshaped = small.reshape(2, 2, 3)
5transposed = np.transpose(small, (0, 2, 1))
6
7print("small[0]")
8print(small[0])
9print("reshaped[0]")
10print(reshaped[0])
11print("transposed[0]")
12print(transposed[0])1small[0]
2[[0 1]
3 [2 3]
4 [4 5]]
5reshaped[0]
6[[0 1 2]
7 [3 4 5]]
8transposed[0]
9[[0 2 4]
10 [1 3 5]]Both results have shape (2, 2, 3), but they answer different questions:
| Operation | What it does | When to use it |
|---|---|---|
reshape | keeps the same data stream and groups it into a new shape | when the flat order is correct and you want new grouping |
transpose | permutes the axes | when the same values are attached to the wrong axis order |
If your real goal is to turn (B, T, D) into (B, D, T), that is an axis-order problem, so transpose is the right tool.
One more beginner trap lives here:
1import numpy as np
2
3v = np.array([1, 2, 3])
4print(v.shape)
5print(v.T.shape)1(3,)
2(3,)A 1-D vector has one axis, so transposing it does nothing. If you need a row or column vector, make that axis explicit with v[None, :] or v[:, None].
Basic slicing produces a view: an array that reaches the same data buffer through different metadata. transpose also returns a view whenever possible. reshape returns a view when the new layout can be expressed with strides, but it can make a copy for a non-contiguous input. Fancy indexing (an integer array or a boolean mask) returns a copy. Those differences matter when you mutate arrays or reason about memory use.[1]
1import numpy as np
2
3x = np.arange(6).reshape(2, 3)
4
5token = x[:, 1] # basic slice -> view shares memory with x
6token[0] = 99
7print("after editing a slice, x[0, 1] =", x[0, 1])
8
9picked = x[[0, 1]] # fancy indexing -> independent copy
10picked[0, 0] = -1
11print("after editing a fancy-index copy, x[0, 0] =", x[0, 0])1after editing a slice, x[0, 1] = 99
2after editing a fancy-index copy, x[0, 0] = 0If you need to mutate a slice without touching the source, take an explicit copy first with x[:, 1].copy(). When unsure, np.shares_memory(a, b) answers the question directly.
Attention looks harder than indexing or reduction, but the shape logic is the same. A later Transformer chapter will explain the full computation. For now, track one intermediate result: each token's query vector is compared with every token's key vector by a dot product before those scores become weights.[3]
Start with hidden states of shape (B, T, D). Build queries and keys with matrices of shape (D, D). That keeps the batch axis and token axis in place:
1q.shape == (B, T, D)
2k.shape == (B, T, D)To compare each token with each token, use matrix multiplication (many dot products calculated together). The last two axes of k must swap first:
1unscaled_scores = q @ swapaxes(k, -1, -2)
2unscaled_scores.shape == (B, T, T)T axesT: which token is asking the questionT: which token is being compared against1import numpy as np
2
3B, T, D = 2, 3, 4
4rng = np.random.default_rng(7)
5x = rng.normal(size=(B, T, D))
6wq = rng.normal(size=(D, D))
7wk = rng.normal(size=(D, D))
8
9q = x @ wq
10k = x @ wk
11unscaled_scores = q @ np.swapaxes(k, -1, -2)
12
13print("q", q.shape)
14print("k", k.shape)
15print("unscaled_scores", unscaled_scores.shape)1q (2, 3, 4)
2k (2, 3, 4)
3unscaled_scores (2, 3, 3)This is why shape practice belongs before Transformer internals. The score matrix is still built from small, explainable axis moves; later you will scale it, hide positions a token shouldn't read, turn scores into weights, and use those weights to mix values.
Transformers run several attention heads in parallel by carving feature axis D into H smaller feature groups of width D // H.[3] This is exactly the reshape-versus-transpose trap from earlier. To go from (B, T, D) to (B, H, T, D // H), first reshape to expose the head axis, then transpose to move it next to batch axis. Reshaping straight to target shape produces a legal array whose numbers are jumbled.
1import numpy as np
2
3B, T, D, H = 2, 3, 8, 2
4head_dim = D // H
5x = np.arange(B * T * D).reshape(B, T, D)
6
7# Right: expose the head axis, then move it next to the batch axis.
8heads = x.reshape(B, T, H, head_dim).transpose(0, 2, 1, 3)
9
10# Wrong: same target shape, but the values land in the wrong heads.
11jumbled = x.reshape(B, H, T, head_dim)
12
13print("heads shape", heads.shape)
14print("same values?", np.array_equal(heads, jumbled))1heads shape (2, 2, 3, 4)
2same values? FalseBoth arrays have shape (2, 2, 3, 4), so a shape check alone would pass. Only the reshape-then-transpose path keeps each token's features grouped inside the right head.
The same rule applies outside ticket text. Suppose ShopFlow prepares 32 catalog photos for a product-image model. Each red-green-blue (RGB) photo is 224 pixels high by 224 pixels wide. The shape is:
1images = (32, 224, 224, 3)The last axis is the color channel. Pixel values are scaled between 0 and 1, and you want to subtract a measured mean-color vector (3,) from each pixel in every photo.
1import numpy as np
2
3images = np.ones((32, 224, 224, 3)) # placeholder normalized catalog photos
4mean_color = np.array([0.52, 0.48, 0.44]) # measured from training photos
5
6normalized = images - mean_color
7print("normalized shape", normalized.shape)1normalized shape (32, 224, 224, 3)NumPy lines the shapes up from the right. The trailing 3 matches, so the mean color broadcasts across all 224 rows, 224 columns, and 32 photos. No Python loop is needed.
Shape bugs become easier to fix when you name the symptom, the cause, and the correction.
| Symptom | Likely cause | Fix |
|---|---|---|
ValueError: operands could not be broadcast together with shapes (2,3,4) (3,) | you tried to match a token-sized vector with the feature axis | decide whether the smaller array should be (D,), (T, 1), or something else |
output has shape (2, 3) after mean, but you expected (2, 4) | you reduced the feature axis instead of the token axis | say the axis name before calling mean, then verify the kept axes |
code works with x[0] but breaks on real batches | you dropped the batch axis during indexing | keep tiny batch examples in scripts and assertions |
v.T still has shape (4,) | a 1-D vector has a single axis | make the row or column explicit with v[None, :] or v[:, None] |
reshape gave a legal array, but token meaning changed | axis order was wrong, but you regrouped values instead of permuting axes | use transpose when the axis names must move |
q @ k raises a matrix-multiply mismatch | the last axis of q doesn't line up with the second-to-last axis of k | swap the last two axes of k before the multiply |
subtracting a mean gives (32, 224, 224, 3) but colors look wrong | the mean vector shape might be (224, 224, 3) instead of (3,) | check that the trailing dimension matches the channel axis |
| editing a slice silently changes the original array | basic slices are views that share memory, not copies | call .copy() when you need an independent array |
| split heads have the right shape but attention output is garbage | you reshaped straight to (B, H, T, D // H) instead of reshaping then transposing | reshape to (B, T, H, D // H) first, then transpose(0, 2, 1, 3) |
The pattern behind all of these fixes is the same: write the intended axis meaning before asking whether the operation preserves that meaning.
Key insight: The error message
operands could not be broadcast togetherisn't a cryptic NumPy complaint. It's a precise statement that two axes you assumed matched don't match. Treat it as a shape-contract violation, not a random failure, and the next check becomes obvious.
Build one tiny script that proves you understand the whole chapter.
x = np.arange(24).reshape(2, 3, 4).(4,) and assert the result is still (2, 3, 4).(2, 4).(2, 3, 3).(3,) and read the failure.If you want a stronger version, wrap each step in a function:
add_feature_bias(x, bias)pool_tokens(x)unscaled_attention_scores(x, wq, wk)Each function should assert both the input shape it expects and the output shape it promises.
Here is a compact reference version:
1import numpy as np
2
3def add_feature_bias(x: np.ndarray, bias: np.ndarray) -> np.ndarray:
4 assert x.ndim == 3, f"expected (B, T, D), got {x.shape}"
5 assert bias.shape == (x.shape[-1],), f"expected bias shape ({x.shape[-1]},), got {bias.shape}"
6 out = x + bias
7 assert out.shape == x.shape
8 return out
9
10def pool_tokens(x: np.ndarray) -> np.ndarray:
11 assert x.ndim == 3, f"expected (B, T, D), got {x.shape}"
12 out = x.mean(axis=1)
13 assert out.shape == (x.shape[0], x.shape[2])
14 return out
15
16def unscaled_attention_scores(x: np.ndarray, wq: np.ndarray, wk: np.ndarray) -> np.ndarray:
17 batch, tokens, width = x.shape
18 assert wq.shape == (width, width)
19 assert wk.shape == (width, width)
20 q = x @ wq
21 k = x @ wk
22 unscaled_scores = q @ np.swapaxes(k, -1, -2)
23 assert unscaled_scores.shape == (batch, tokens, tokens)
24 return unscaled_scores
25
26x = np.arange(24).reshape(2, 3, 4)
27bias = np.array([100, 200, 300, 400])
28wq = np.eye(4)
29wk = np.eye(4)
30
31shifted = add_feature_bias(x, bias)
32pooled = pool_tokens(shifted)
33unscaled_scores = unscaled_attention_scores(x, wq, wk)
34
35print("shifted", shifted.shape)
36print("pooled", pooled.shape)
37print("unscaled_scores", unscaled_scores.shape)
38
39try:
40 add_feature_bias(x, np.array([10, 20, 30]))
41except AssertionError as exc:
42 print("caught:", exc)
43else:
44 raise AssertionError("Expected the shape guard to reject a token-sized bias")1shifted (2, 3, 4)
2pooled (2, 4)
3unscaled_scores (2, 3, 3)
4caught: expected bias shape (4,), got (3,)Cover the code blocks above and answer these without running anything. If your explanation names numbers without axis meanings, slow down and add labels: reliable shape reasoning attaches each number to its role.
Open each checkpoint only after you commit to an answer.
If you got any of these wrong, re-read the one-running-example section and trace the indices by hand before moving on.
The durable lesson is simple: a shape is a sentence about data.
(B, T, D) is not three numbers. It says how many examples you have, how many positions each example contains, and how many numbers describe each position. Once you keep that sentence attached to the array, matrix multiplies, reductions, and attention stop feeling opaque.
Symptom: Code runs and output shape looks plausible, but every token or package gets the wrong offset. Cause: A legal broadcast matched the wrong axis. NumPy checked size compatibility, not semantic intent. Fix: Name the target axis first, then verify the smaller array is shaped for that axis on purpose.
Symptom: mean() runs without complaint, but downstream code gets scalars where vectors were expected.
Cause: The wrong axis was reduced, so the result kept different meaning than intended.
Fix: Say the axis name out loud before reducing, then check kept axes against the plain-English goal.
Symptom: Final shape matches expectation, but token order, catalog-photo layout, or attention heads become mixed up.
Cause: reshape regrouped the flat stream when the real task was axis permutation.
Fix: Use transpose when axis meaning moves. Use reshape only when flat order is already correct.
Symptom: (N,) sometimes behaves like a row and sometimes like a column, creating silent broadcast surprises.
Cause: One axis is missing entirely, so NumPy infers placement from context.
Fix: Make the axis explicit with v[None, :] or v[:, None] whenever orientation matters.
Symptom: Editing a slice unexpectedly changes the original batch.
Cause: Basic slices are views, and reshape may return a view that shares memory with the source.
Fix: Call .copy() when you need an independent array before mutating it.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.