Turn one gradient vector into batches of model inputs while learning dot products, matrix transforms, tensor axes, and shape debugging.
In the previous chapter, ShopFlow's delivery model produced a gradient [4, 2]: one correction for its distance weight and one for its warehouse-backlog weight. That list already has a mathematical name. It's a vector.
Now you will learn how one list becomes a row in a dataset, how learned weights change its features, and how batches of token sequences need extra axes. These shapes are the notation behind training code, similarity search, and later Transformer chapters.
A scalar is one number, such as a loss of 1.0. A vector is an ordered list of numbers. Its order matters because each position has a meaning.
Reuse the delivery-time example from the gradient calculation:
| Vector | Meaning of position 0 | Meaning of position 1 | Values |
|---|---|---|---|
features | hundreds of kilometres | warehouse backlog hours | [2, 1] |
weights | hours per distance unit | hours per backlog unit | [2, 2] |
gradient | correction for distance weight | correction for backlog weight | [4, 2] |
Every vector above has shape [2], but they play different roles. Shape tells you how values are arranged. Names tell you what those values mean.
1import numpy as np
2
3features = np.array([2.0, 1.0]) # distance, backlog
4weights = np.array([2.0, 2.0])
5gradient = np.array([4.0, 2.0])
6
7print("features", features.tolist(), "shape", features.shape)
8print("weights", weights.tolist(), "shape", weights.shape)
9print("gradient", gradient.tolist(), "shape", gradient.shape)
10print("updated_weights", (weights - 0.1 * gradient).tolist())1features [2.0, 1.0] shape (2,)
2weights [2.0, 2.0] shape (2,)
3gradient [4.0, 2.0] shape (2,)
4updated_weights [1.6, 1.8]A vector isn't automatically an embedding. Here it holds two features or two parameters. Much later, a token embedding will be a longer vector whose positions are learned rather than named by hand.
The delivery model uses a dot product to combine two features into one prediction:
Here x is the feature vector, w is the weight vector, and \hat{y} is the predicted delivery time. To isolate the operation, use a calibrated rule with x = [2, 1] and w = [2, 1]:
The dot product multiplies matching positions, then adds the contributions. Two vectors of length two produce one scalar prediction.
1import numpy as np
2
3features = np.array([2.0, 1.0])
4weights = np.array([2.0, 1.0])
5
6contributions = features * weights
7prediction = features @ weights
8
9print("contributions", contributions.tolist())
10print("prediction_hours", float(prediction))
11print("output_shape", prediction.shape)1contributions [4.0, 1.0]
2prediction_hours 5.0
3output_shape ()The output shape is (): NumPy's representation for a scalar. Multiplying element by element produced a vector of contributions; applying @ summed them into one number.
Training from one delivery would be fragile. Put three observed deliveries into a matrix: a rectangular array whose rows are orders and whose columns are features.
Read conceptual shape [3, 2] as three orders by two features. NumPy prints the same shape as tuple (3, 2). Multiplying every row by the same weight vector returns one prediction per order:
1import numpy as np
2
3X = np.array([
4 [1.0, 0.0], # nearby order, no backlog
5 [2.0, 1.0], # medium distance, backlog
6 [3.0, 1.0], # farther order, backlog
7])
8weights = np.array([2.0, 1.0])
9
10predictions = X @ weights
11
12print("X_shape", X.shape)
13print("weights_shape", weights.shape)
14print("predictions_shape", predictions.shape)
15print("predicted_hours", predictions.tolist())1X_shape (3, 2)
2weights_shape (2,)
3predictions_shape (3,)
4predicted_hours [2.0, 5.0, 7.0]This is the same computation you did for one order, repeated efficiently for every row. In this matrix-vector case, the feature dimensions must agree and the row axis survives in the result.[1]
The weight vector above produces one output. A model often wants several outputs from the same input. Suppose ShopFlow needs a predicted number of hours and an internal handling-priority score.
Use one column of a weight matrix for each output:
The first column computes delivery hours; the second computes priority. For x = [2, 1]:
The shape contract is:
For the whole three-row matrix:
1import numpy as np
2
3X = np.array([[1.0, 0.0], [2.0, 1.0], [3.0, 1.0]])
4W = np.array([
5 [2.0, 0.2], # distance contribution to [hours, priority]
6 [1.0, 1.5], # backlog contribution to [hours, priority]
7])
8
9outputs = X @ W
10
11print("contract", X.shape, "@", W.shape, "->", outputs.shape)
12print("second_order", outputs[1].tolist())
13print("all_hour_predictions", outputs[:, 0].tolist())1contract (3, 2) @ (2, 2) -> (3, 2)
2second_order [5.0, 1.9]
3all_hour_predictions [2.0, 5.0, 7.0]This is why matrices matter in neural networks: one learned transform can mix input features into many output features for every row at once.
Suppose a teammate replaces W with a matrix expecting three input features:
The first object ends with feature count 2; the second expects feature count 3. No computation can match the missing feature, so NumPy raises an error.
1import numpy as np
2
3X = np.zeros((3, 2))
4wrong_W = np.zeros((3, 2))
5
6try:
7 X @ wrong_W
8except ValueError:
9 print("invalid", X.shape, "@", wrong_W.shape)
10 print("meeting_dimensions", X.shape[-1], "and", wrong_W.shape[0], "do_not_match")
11
12correct_W = np.zeros((2, 2))
13print("valid_output_shape", (X @ correct_W).shape)1invalid (3, 2) @ (3, 2)
2meeting_dimensions 2 and 3 do_not_match
3valid_output_shape (3, 2)When you see a multiplication error, don't randomly transpose arrays until it disappears. Name the axes first: order rows, input features, and output features. Then fix the object whose meaning is wrong.
In NumPy and PyTorch code, a tensor is an array with any number of axes. This chapter uses that framework meaning: a vector is a one-axis tensor, and a matrix is a two-axis tensor. Text models commonly need three named axes:
| Axis | Symbol | ShopFlow meaning |
|---|---|---|
| batch | B | several support tickets processed together |
| sequence | T | token positions within each ticket |
| feature | D | numbers describing one token at one layer |
Use tiny handcrafted token features so every number stays readable. This isn't a trained embedding table; it's a shape exercise:
| Token feature | Position 0 | Position 1 |
|---|---|---|
| late-like signal | 1.0 | 0.0 |
| package-like signal | 0.5 | 0.5 |
| refund-like signal | 0.0 | 1.0 |
Two tickets, each containing three token-feature vectors, have shape [2, 3, 2].
1import numpy as np
2
3ticket_tensor = np.array([
4 [[1.0, 0.0], [0.5, 0.5], [0.0, 1.0]], # "late package refund"
5 [[0.0, 1.0], [0.5, 0.5], [1.0, 0.0]], # "refund package late"
6])
7
8print("tensor_shape", ticket_tensor.shape)
9print("first_ticket_shape", ticket_tensor[0].shape)
10print("first_token_shape", ticket_tensor[0, 0].shape)
11print("first_token", ticket_tensor[0, 0].tolist())1tensor_shape (2, 3, 2)
2first_ticket_shape (3, 2)
3first_token_shape (2,)
4first_token [1.0, 0.0]The last axis still contains features. Extra axes organize which ticket and which token each vector belongs to.
Once a ticket tensor has token features, a learned projection can change only its final feature dimension. Let P map two input features into three output features:
Batch and sequence remain intact. Only the token-feature size changes.
1import numpy as np
2
3H = np.array([
4 [[1.0, 0.0], [0.5, 0.5], [0.0, 1.0]],
5 [[0.0, 1.0], [0.5, 0.5], [1.0, 0.0]],
6])
7P = np.array([
8 [1.0, 0.0, 0.5],
9 [0.0, 1.0, 0.5],
10])
11
12projected = H @ P
13
14print("contract", H.shape, "@", P.shape, "->", projected.shape)
15print("first_ticket_first_token", projected[0, 0].tolist())
16print("middle_token_shared", np.allclose(projected[0, 1], projected[1, 1]))1contract (2, 3, 2) @ (2, 3) -> (2, 3, 3)
2first_ticket_first_token [1.0, 0.0, 0.5]
3middle_token_shared TrueFor this [B, T, D] @ [D, P] case, NumPy applies the same two-dimensional projection at every leading [B, T] position, preserving the batch and sequence axes. More generally, higher-rank @ treats operands as stacks of matrices on their last two axes and broadcasts compatible leading axes.[1]
Broadcasting lets a smaller array participate in element-wise operations against a larger one. Adding an output bias [hours_offset, priority_offset] to every order row is useful:
NumPy compares dimensions from the right: equal dimensions match, a dimension of size 1 can expand, and missing leading dimensions act like size 1. This is powerful, but it can hide an axis mistake.[2]
1import numpy as np
2
3outputs = np.array([
4 [2.0, 0.2],
5 [5.0, 1.9],
6 [7.0, 2.1],
7]) # columns: [hours, priority]
8
9feature_bias = np.array([0.1, -0.1]) # intended: adjust each output feature
10wrong_row_bias = np.array([[0.1], [-0.1], [0.0]]) # wrong: adjusts each order row
11
12corrected = outputs + feature_bias
13wrong_but_valid = outputs + wrong_row_bias
14
15print("correct_shape", corrected.shape)
16print("wrong_shape", wrong_but_valid.shape)
17print("correct_second_order", corrected[1].tolist())
18print("wrong_second_order", wrong_but_valid[1].tolist())
19assert feature_bias.shape == (outputs.shape[-1],)1correct_shape (3, 2)
2wrong_shape (3, 2)
3correct_second_order [5.1, 1.7999999999999998]
4wrong_second_order [4.9, 1.7999999999999998]Both additions return shape [3, 2], yet only one uses the intended axis. A failing operation is easy to debug. A legal broadcast on the wrong axis is more dangerous, so assertions should state meaning, not only final size.
The delivery predictor used a dot product to combine features with weights. The same operation can compare vectors to one another.
For three token-feature vectors in one ticket, multiply the token matrix by its transpose:
Every entry in the output is one dot product between two token positions.
These are raw alignment scores, not automatically cosine similarities. A dot product changes when vector magnitude changes; cosine similarity first divides by vector lengths.
1import numpy as np
2
3tokens = np.array([
4 [1.0, 0.0], # late-like
5 [0.5, 0.5], # package-like
6 [0.0, 1.0], # refund-like
7])
8
9scores = tokens @ tokens.T
10
11print("contract", tokens.shape, "@", tokens.T.shape, "->", scores.shape)
12print(scores)
13print("late_to_refund", float(scores[0, 2]))1contract (3, 2) @ (2, 3) -> (3, 3)
2[[1. 0.5 0. ]
3 [0.5 0.5 0.5]
4 [0. 0.5 1. ]]
5late_to_refund 0.0Later, attention will create separate query and key vectors, scale the Q @ K.T score grid, and convert each row into weights that sum to one.[3] You don't need attention mechanics yet. For now, recognize its shape as many dot products arranged in a matrix.
The lab below packages the shape reasoning into a guardrail. It accepts a batch of token-feature sequences and a projection matrix. Before multiplying, it checks that the hidden states have three named axes, the projection is a matrix, and the feature size agrees with the matrix input size.
1import numpy as np
2
3def project_tickets(hidden: np.ndarray, projection: np.ndarray) -> np.ndarray:
4 if hidden.ndim != 3:
5 raise ValueError("expected [batch, sequence, feature]")
6 if projection.ndim != 2:
7 raise ValueError("expected projection [input_feature, output_feature]")
8 if hidden.shape[-1] != projection.shape[0]:
9 raise ValueError("feature axis does not match projection input")
10 return hidden @ projection
11
12hidden = np.array([
13 [[1.0, 0.0], [0.5, 0.5], [0.0, 1.0]],
14 [[0.0, 1.0], [0.5, 0.5], [1.0, 0.0]],
15])
16projection = np.array([[2.0, 0.0], [0.0, 3.0]])
17
18projected = project_tickets(hidden, projection)
19ticket_summaries = projected.mean(axis=1)
20
21print("input", hidden.shape, "projection", projection.shape)
22print("projected", projected.shape)
23print("ticket_summaries", ticket_summaries.tolist())
24
25try:
26 project_tickets(hidden, np.zeros((3, 2)))
27except ValueError as error:
28 print("feature_guardrail", str(error))
29
30try:
31 project_tickets(hidden, np.zeros(2))
32except ValueError as error:
33 print("rank_guardrail", str(error))1input (2, 3, 2) projection (2, 2)
2projected (2, 3, 2)
3ticket_summaries [[1.0, 1.5], [1.0, 1.5]]
4feature_guardrail feature axis does not match projection input
5rank_guardrail expected projection [input_feature, output_feature]These handcrafted sequences have three meaningful token positions each. Real padded text batches need masked pooling: plain .mean(axis=1) would let padding positions change the summaries.
Real token vectors won't be hand written, and the model's learned projections will be much wider. The invariant stays small enough to say aloud: preserve batch and sequence; match and transform feature size.
NumPy is excellent for seeing the algebra. PyTorch carries the same tensor operations into trainable models and automatic differentiation.[4] Confirm that the same batch projection gives the same values in both libraries:
1import numpy as np
2import torch
3
4hidden_np = np.array([
5 [[1.0, 0.0], [0.5, 0.5], [0.0, 1.0]],
6 [[0.0, 1.0], [0.5, 0.5], [1.0, 0.0]],
7], dtype=np.float32)
8projection_np = np.array([[2.0, 0.0], [0.0, 3.0]], dtype=np.float32)
9
10numpy_output = hidden_np @ projection_np
11torch_output = torch.tensor(hidden_np) @ torch.tensor(projection_np)
12
13print("torch_shape", tuple(torch_output.shape))
14print("same_values", np.allclose(numpy_output, torch_output.numpy()))
15print("first_token", torch_output[0, 0].tolist())1torch_shape (2, 3, 2)
2same_values True
3first_token [2.0, 0.0]This is enough PyTorch for this chapter. Device placement, mixed precision, training modules, and full attention blocks each need their own careful lesson.
Use the conventions from this chapter: rows represent orders or token positions; the final dimension represents features.
[3], and a weight vector has shape [3]. What does x @ w return?X has shape [20, 3], and W has shape [3, 4]. What does X @ W mean and what shape does it produce?[8, 12, 64], and a projection has shape [64, 16]. Which axes remain unchanged?[20, 4]. Why can adding a value of shape [20, 1] be a silent bug when you meant a four-feature bias?[12, 64]. What transpose is needed to create a pairwise score grid?[20, 4].8 and sequence length 12 stay unchanged; feature size changes from 64 to 16, giving [8, 12, 16].[20, 1] legally broadcasts one different value across all four output columns in each row. The output shape looks right even though the adjustment is attached to orders instead of features.[64, 12]; then [12, 64] @ [64, 12] produces [12, 12].You can now read a model operation as a contract:
| Object | Example shape | Question to ask |
|---|---|---|
| Vector | [2] | What does each feature position mean? |
| Matrix of rows | [3, 2] | What does one row describe? |
| Weight matrix | [2, 3] | Which input feature axis must match? |
| Tensor | [2, 3, 2] | What do batch, sequence, and feature axes mean? |
| Similarity grid | [3, 3] | Which vector pairs produced these dot products? |
You should be able to predict a multiplication output before running it, catch an invalid meeting dimension, and explain why a legal broadcast might still corrupt meaning.
X @ W and [B, T, D] @ [D, P] contracts before executing code.Symptom: A prediction is numerically plausible but wrong for known examples. Cause: Feature and weight vectors have the same length but disagree about position meanings. Fix: Name feature order once and assert it at data boundaries.
Symptom: A shape error disappears, but outputs no longer mean what the model intended. Cause: A transpose was used as a guess rather than from the row/column contract. Fix: Label each axis, then transpose only when the operation requires pairwise rows, such as a similarity grid.
Symptom: Results are shifted across entire orders or token positions. Cause: A singleton axis expanded in a valid but unintended direction. Fix: Assert the bias or mask shape before arithmetic and test a row with hand-computed expected values.
Symptom: A token projection fails after batching or reshaping.
Cause: The last axis no longer represents the feature size expected by the projection.
Fix: Print or assert [batch, sequence, feature] immediately before multiplication.
Symptom: Ticket summaries change when shorter tickets receive more padding. Cause: A plain sequence-axis mean treats padding positions like meaningful tokens. Fix: Use a padding mask and divide by the number of real tokens.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.