Follow a shipment-delay model through prediction, loss, gradients, parameter updates, scalar autograd, mini-batches, validation checks, and PyTorch.
In the previous lesson, you traced a CNN score while its kernel values stayed fixed. A useful detector can't be supplied by hand for every crack, barcode smear, or torn label in a shipment photo. The model has to discover parameter values from examples.
We will make that discovery visible using a small regression problem: predict extra delivery delay in hours from a route-disruption score. A real CNN has far more parameters, but each kernel weight changes for the same reason as the one weight below: changing it changes a prediction, which changes a loss.
A model learns by repeating four steps:
| Step | What happens | Evidence you can inspect |
|---|---|---|
| Forward pass | Current parameters produce a prediction. | Print predicted delay. |
| Loss | A scalar measures the error against a known target. | Print one nonnegative number. |
| Backward pass | Derivatives measure how each parameter affects loss. | Print dL/dw and dL/db. |
| Update | An optimizer changes parameters using those derivatives. | Recompute and compare loss. |
Suppose a shipment has route-disruption score x = 2 and later arrives y = 6 hours late. Start with this model:
Set w = 1 and b = 0. The predicted extra delay is 2 hours, far below the observed 6. For this numerical lesson, use half squared error:
The factor 1/2 doesn't change which parameters are good; it cancels the 2 when we differentiate the square. Here the initial loss is:
The derivative of loss with respect to the prediction is -4: increasing the predicted delay would reduce this mistake. Because w is multiplied by x = 2, its gradient is twice as large:
With learning rate , subtract those gradients:
The revised prediction is 1.8 * 2 + 0.4 = 4.0 hours, and loss falls from 8.0 to 2.0.
1x, target = 2.0, 6.0
2w, bias = 1.0, 0.0
3learning_rate = 0.1
4
5prediction = w * x + bias
6loss = 0.5 * (prediction - target) ** 2
7grad_prediction = prediction - target
8grad_w = grad_prediction * x
9grad_bias = grad_prediction
10
11new_w = w - learning_rate * grad_w
12new_bias = bias - learning_rate * grad_bias
13new_prediction = new_w * x + new_bias
14new_loss = 0.5 * (new_prediction - target) ** 2
15
16print(f"before: prediction={prediction:.1f} loss={loss:.1f}")
17print(f"gradients: dL/dw={grad_w:.1f} dL/db={grad_bias:.1f}")
18print(f"after: prediction={new_prediction:.1f} loss={new_loss:.1f}")1before: prediction=2.0 loss=8.0
2gradients: dL/dw=-8.0 dL/db=-4.0
3after: prediction=4.0 loss=2.0
One update helps, but training is a repeated correction:
1x, target = 2.0, 6.0
2w, bias = 1.0, 0.0
3learning_rate = 0.1
4
5for step in range(5):
6 prediction = w * x + bias
7 loss = 0.5 * (prediction - target) ** 2
8 error = prediction - target
9 w -= learning_rate * error * x
10 bias -= learning_rate * error
11 print(f"step {step}: prediction={prediction:.3f} loss={loss:.3f} w={w:.3f} b={bias:.3f}")1step 0: prediction=2.000 loss=8.000 w=1.800 b=0.400
2step 1: prediction=4.000 loss=2.000 w=2.200 b=0.600
3step 2: prediction=5.000 loss=0.500 w=2.400 b=0.700
4step 3: prediction=5.500 loss=0.125 w=2.500 b=0.750
5step 4: prediction=5.750 loss=0.031 w=2.550 b=0.775The gradient says which direction lowers loss locally. The learning rate says how far to move. A very small rate makes progress slowly. A useful rate settles toward a low-loss setting. A large rate can jump past it and grow the error on every oscillation.
The same one-example problem makes this visible because we can hold everything except learning rate constant:
1def train_for_five_steps(learning_rate: float) -> list[float]:
2 x, target = 2.0, 6.0
3 w, bias = 1.0, 0.0
4 losses = []
5 for _ in range(5):
6 prediction = w * x + bias
7 error = prediction - target
8 losses.append(0.5 * error ** 2)
9 w -= learning_rate * error * x
10 bias -= learning_rate * error
11 return losses
12
13for rate in (0.01, 0.1, 0.5):
14 losses = train_for_five_steps(rate)
15 formatted = ", ".join(f"{loss:.3f}" for loss in losses)
16 print(f"lr={rate:.2f}: {formatted}")1lr=0.01: 8.000, 7.220, 6.516, 5.881, 5.307
2lr=0.10: 8.000, 2.000, 0.500, 0.125, 0.031
3lr=0.50: 8.000, 18.000, 40.500, 91.125, 205.031Don't assume that one loss decrease proves the rate is safe. Real batches disagree with one another, and deep networks can have steep regions in some directions and shallow ones in others. Training code should log loss and gradient norms so a bad run becomes visible quickly.
Gradient descent needs dL/dw for every trainable parameter. Calculating it by nudging each parameter separately is useful as a check for a tiny formula, but it would repeat forward evaluation for every parameter direction. Backpropagation uses reverse-mode automatic differentiation: it reuses the forward computation graph and moves derivatives backward from one scalar loss through each local operation.[1]
Rumelhart, Hinton, and Williams showed that this error-driven weight adjustment could learn useful hidden representations in multilayer networks in their 1986 Nature paper.[2] For our small shipment-delay model, the graph is already enough to see the mechanism:
The chain rule follows paths from loss back to parameters:
A useful engineering habit is to compare an analytical gradient with a centered finite difference. The check is slow for training, but excellent for finding a missing factor or wrong sign in a new backward formula.
1x, target = 2.0, 6.0
2w, bias = 1.0, 0.0
3epsilon = 1e-5
4
5def loss_at(weight: float, intercept: float) -> float:
6 prediction = weight * x + intercept
7 return 0.5 * (prediction - target) ** 2
8
9analytic_w = (w * x + bias - target) * x
10analytic_bias = w * x + bias - target
11numeric_w = (loss_at(w + epsilon, bias) - loss_at(w - epsilon, bias)) / (2 * epsilon)
12numeric_bias = (loss_at(w, bias + epsilon) - loss_at(w, bias - epsilon)) / (2 * epsilon)
13
14print(f"dL/dw analytic={analytic_w:.6f} numeric={numeric_w:.6f}")
15print(f"dL/db analytic={analytic_bias:.6f} numeric={numeric_bias:.6f}")
16print("checks pass:", abs(analytic_w - numeric_w) < 1e-8 and abs(analytic_bias - numeric_bias) < 1e-8)1dL/dw analytic=-8.000000 numeric=-8.000000
2dL/db analytic=-4.000000 numeric=-4.000000
3checks pass: TrueAndrej Karpathy's Micrograd makes the backward sweep inspectable: each object stores one scalar, its accumulated gradient, the values that produced it, and a local backward function.[3] Tensor libraries apply the same principle to matrix operations; a scalar version keeps the accounting readable.
For prediction = w * x + b and loss = 0.5 * (prediction - target) ** 2, every operation leaves a small backward rule:
| Operation | Local backward rule |
|---|---|
out = left + right | Add out.grad to both parent gradients. |
out = left * right | Add right * out.grad to the left parent, and vice versa. |
out = value ** 2 | Add 2 * value * out.grad to the parent. |
The following compact engine builds the computation graph during the forward pass, orders nodes so children run before parents during the backward pass, and accumulates gradients with +=.
1class Value:
2 def __init__(self, data, parents=(), operation=""):
3 self.data = float(data)
4 self.grad = 0.0
5 self.parents = set(parents)
6 self.operation = operation
7 self._backward = lambda: None
8
9 def __add__(self, other):
10 other = other if isinstance(other, Value) else Value(other)
11 out = Value(self.data + other.data, (self, other), "+")
12 def backward():
13 self.grad += out.grad
14 other.grad += out.grad
15 out._backward = backward
16 return out
17
18 __radd__ = __add__
19
20 def __mul__(self, other):
21 other = other if isinstance(other, Value) else Value(other)
22 out = Value(self.data * other.data, (self, other), "*")
23 def backward():
24 self.grad += other.data * out.grad
25 other.grad += self.data * out.grad
26 out._backward = backward
27 return out
28
29 __rmul__ = __mul__
30
31 def __neg__(self):
32 return self * -1.0
33
34 def __sub__(self, other):
35 return self + (-other)
36
37 def __pow__(self, exponent):
38 out = Value(self.data ** exponent, (self,), f"**{exponent}")
39 def backward():
40 self.grad += exponent * self.data ** (exponent - 1) * out.grad
41 out._backward = backward
42 return out
43
44 def backward(self):
45 order = []
46 seen = set()
47 def visit(node):
48 if node not in seen:
49 seen.add(node)
50 for parent in node.parents:
51 visit(parent)
52 order.append(node)
53 visit(self)
54 self.grad = 1.0
55 for node in reversed(order):
56 node._backward()
57
58x, target = Value(2.0), Value(6.0)
59w, bias = Value(1.0), Value(0.0)
60prediction = w * x + bias
61loss = 0.5 * (prediction - target) ** 2
62loss.backward()
63
64print(f"prediction={prediction.data:.1f} loss={loss.data:.1f}")
65print(f"dL/dw={w.grad:.1f} dL/db={bias.grad:.1f}")1prediction=2.0 loss=8.0
2dL/dw=-8.0 dL/db=-4.0One rule is easy to miss: if a value affects loss through two paths, its gradient is the sum of those paths. The simple expression a * a exposes an incorrect engine immediately.
1a = Value(3.0)
2squared = a * a
3squared.backward()
4print(f"a*a={squared.data:.1f}")
5print(f"d(a*a)/da={a.grad:.1f}")1a*a=9.0
2d(a*a)/da=6.0
In a tensor framework, parameter gradient buffers also accumulate until they are cleared. That makes intentional gradient accumulation possible, but it also creates a common bug: forgetting to clear gradients between independent optimizer steps.[4]
One damaged shipment doesn't represent a delivery network. Training normally estimates a useful update from a mini-batch, a subset of training examples processed before one parameter update. For mean loss, the batch gradient is the average of the example gradients.
This tiny dataset contains only three shipments, so the code below uses all three at once: technically, it is full-batch gradient descent. The arithmetic is the same one a mini-batch uses; a larger run uses a subset before each update. Here the true relationship is extra_delay = 2 * disruption + 1. Starting from zero parameters:
1shipments = [(1.0, 3.0), (2.0, 5.0), (3.0, 7.0)]
2w, bias = 0.0, 0.0
3learning_rate = 0.08
4
5for step in range(20):
6 errors = [w * x + bias - target for x, target in shipments]
7 loss = sum(0.5 * error ** 2 for error in errors) / len(shipments)
8 grad_w = sum(error * x for error, (x, _) in zip(errors, shipments)) / len(shipments)
9 grad_bias = sum(errors) / len(shipments)
10 w -= learning_rate * grad_w
11 bias -= learning_rate * grad_bias
12 if step in (0, 1, 19):
13 print(f"step {step:2d}: loss={loss:.4f} w={w:.3f} b={bias:.3f}")1step 0: loss=13.8333 w=0.907 b=0.400
2step 1: loss=4.2812 w=1.411 b=0.623
3step 19: loss=0.0005 w=2.037 b=0.917An epoch means one complete pass through the training examples. LLM pretraining runs are commonly described in optimizer steps and tokens processed because the dataset is very large; the batch contract remains the same: forward, average loss, backward, update.
Training loss only measures examples used for updates. A validation set contains held-out examples used for evaluation, not for backpropagation. If training loss drops while validation loss rises, the new parameters fit the training set better but transfer worse. That pattern is a warning, not a diagnosis: investigate data mismatch, leakage, noise, model capacity and regularization before deciding why it happened.
This deliberately mismatched example makes the warning visible. Training records follow a peak-season delay pattern; validation records follow a normal-week pattern. Starting from a model that already matches normal weeks, fitting peak season improves train loss and damages validation loss:
1train = [(1.0, 4.0), (2.0, 8.0), (3.0, 12.0)] # peak season
2validation = [(1.0, 2.0), (2.0, 4.0), (3.0, 6.0)] # normal week
3w = 2.0
4learning_rate = 0.02
5
6def mean_loss(records, weight):
7 return sum(0.5 * (weight * x - target) ** 2 for x, target in records) / len(records)
8
9for step in range(4):
10 grad_w = sum((w * x - target) * x for x, target in train) / len(train)
11 w -= learning_rate * grad_w
12 print(f"step {step}: train={mean_loss(train, w):.3f} validation={mean_loss(validation, w):.3f} w={w:.3f}")1step 0: train=7.672 validation=0.081 w=2.187
2step 1: train=6.307 validation=0.296 w=2.356
3step 2: train=5.185 validation=0.605 w=2.509
4step 3: train=4.262 validation=0.981 w=2.648For a real run, compare train and validation curves at consistent checkpoints and log enough context to reproduce the run: data version, split policy, batch size, learning rate, optimizer, seed and checkpoint identifier.
Libraries remove manual derivative bookkeeping, not the conceptual separation. In the code below, multiplying PyTorch's mean squared error by 0.5 keeps the same half-squared-error convention used above. loss.backward() fills parameter gradients and optimizer.step() applies an update. The loss drops after one step on the same three-example batch.[4]
1import torch
2from torch import nn
3
4features = torch.tensor([[1.0], [2.0], [3.0]])
5targets = torch.tensor([[3.0], [5.0], [7.0]])
6model = nn.Linear(1, 1)
7
8with torch.no_grad():
9 model.weight.fill_(0.0)
10 model.bias.fill_(0.0)
11
12optimizer = torch.optim.SGD(model.parameters(), lr=0.08)
13loss_fn = nn.MSELoss(reduction="mean")
14
15optimizer.zero_grad()
16before = 0.5 * loss_fn(model(features), targets)
17before.backward()
18print(f"before={before.item():.3f}")
19print(f"weight gradient={model.weight.grad.item():.3f} bias gradient={model.bias.grad.item():.3f}")
20optimizer.step()
21after = 0.5 * loss_fn(model(features), targets)
22print(f"after={after.item():.3f}")1before=13.833
2weight gradient=-11.333 bias gradient=-5.000
3after=4.281Because gradients accumulate, two backward calls without clearing produce twice the gradient for the same loss:
1import torch
2
3w = torch.tensor(1.0, requires_grad=True)
4loss = 0.5 * (w * 2.0 - 6.0) ** 2
5loss.backward()
6first_gradient = w.grad.item()
7
8loss = 0.5 * (w * 2.0 - 6.0) ** 2
9loss.backward()
10accumulated_gradient = w.grad.item()
11
12w.grad.zero_()
13loss = 0.5 * (w * 2.0 - 6.0) ** 2
14loss.backward()
15reset_gradient = w.grad.item()
16
17print(f"first backward: {first_gradient:.1f}")
18print(f"without zeroing: {accumulated_gradient:.1f}")
19print(f"after zeroing again: {reset_gradient:.1f}")1first backward: -8.0
2without zeroing: -16.0
3after zeroing again: -8.0We used half squared error because predicting delay hours is a regression problem and its derivative is easy to inspect. A CNN that selects damaged, intact, or needs-review, and an LLM that selects the next token, face a classification problem: the output layer produces competing logits rather than one continuous hour estimate.
The backward-pass mechanics won't disappear. The next lesson supplies the missing objective: softmax converts logits into a distribution, cross-entropy scores the correct choice, and their gradient becomes the signal sent backward through the model.
dL/dw, dL/db and one SGD update for the shipment-delay example.backward(), zero_grad() and optimizer update.a * a receives half its expected gradient. Cause: Graph paths overwrite rather than accumulate contributions. Fix: Update parent gradients with +=.optimizer.zero_grad() before backward() unless accumulation is deliberate.Answer every question, then check your score. Score above 75% to mark this lesson complete.
7 questions remaining.