Learn why training works by nudging one response-latency weight, tracing and summing chain-rule paths, checking gradients, and confirming them with PyTorch.
Measuring engineering work turns vague performance into numbers. Training builds on the same instinct, but with a different problem: the system begins with poor numbers and must learn better ones from examples.
Suppose a coding assistant predicts response latency from prompt length. A request used 200 input tokens and finished in 5 seconds, but the current rule predicts 6. Which stored number should change, in which direction, and by how much? Gradients answer that question.
A parameter is a number a model is allowed to change while it learns. Start with one parameter, w, representing predicted seconds per 100 input tokens.
For one completed request:
| Quantity | Meaning | Value |
|---|---|---|
x | prompt length in hundreds of tokens | 2.0 |
y | observed response seconds | 5.0 |
w | current seconds-per-token-block parameter | 3.0 |
prediction = w * x | model estimate | 6.0 |
loss = (prediction - y)² | penalty for being wrong | 1.0 |
The loss is one non-negative number. Zero means the prediction matches this row. Larger numbers mean the miss is worse.
1def squared_loss(weight: float, token_blocks: float, actual_seconds: float) -> float:
2 prediction = weight * token_blocks
3 return (prediction - actual_seconds) ** 2
4
5w = 3.0
6x = 2.0
7y = 5.0
8prediction = w * x
9loss = squared_loss(w, x, y)
10
11print("prediction_seconds", prediction)
12print("loss", loss)1prediction_seconds 6.0
2loss 1.0That code evaluates a model. It doesn't yet learn. Learning requires knowing whether to move w upward or downward.
Change w by a tiny amount and recompute the loss:
| Weight | Prediction | Loss | What changed? |
|---|---|---|---|
3.000 | 6.000 | 1.000000 | starting point |
3.001 | 6.002 | 1.004004 | weight moved upward |
Loss increased by 0.004004 when the weight increased by 0.001. Divide those changes:
A positive slope means increasing w makes loss worse near this point. To lower the loss, move w down.
1def loss(weight: float) -> float:
2 return (weight * 2.0 - 5.0) ** 2
3
4w = 3.0
5eps = 0.001
6slope_estimate = (loss(w + eps) - loss(w)) / eps
7
8print("loss_before", loss(w))
9print("loss_after_nudge", loss(w + eps))
10print("slope_estimate", round(slope_estimate, 3))1loss_before 1.0
2loss_after_nudge 1.0040039999999995
3slope_estimate 4.004This is a finite difference: a practical approximation to a derivative. It's slow if you repeat it for millions of parameters, but it's an excellent check for a small formula.
A derivative is the slope as the nudge becomes infinitesimally small. Here the loss is:
Apply the chain rule in two pieces:
The first factor differentiates the square. The final 2 comes from the prediction 2w: changing w by one changes the prediction by two seconds for this row.
At w = 3:
The finite-difference estimate was 4.004; the derivative is exactly 4 at this point.
1def loss(weight: float) -> float:
2 return (weight * 2.0 - 5.0) ** 2
3
4def analytical_gradient(weight: float) -> float:
5 prediction = weight * 2.0
6 return 2 * (prediction - 5.0) * 2.0
7
8w = 3.0
9eps = 1e-5
10centered = (loss(w + eps) - loss(w - eps)) / (2 * eps)
11exact = analytical_gradient(w)
12
13print("analytical_gradient", exact)
14print("finite_difference", round(centered, 6))
15print("agree", abs(centered - exact) < 1e-6)1analytical_gradient 4.0
2finite_difference 4.0
3agree TrueA centered finite difference nudges left and right, so it's usually a better numerical check than a one-sided nudge.
The derivative tells you uphill. Gradient descent moves the opposite way. A learning rate is a small chosen step scale:
With w = 3, derivative 4, and learning rate 0.1:
Now the prediction is 2.6 * 2 = 5.2, and loss is (5.2 - 5)² = 0.04. One update reduced the loss from 1.0 to 0.04.
1w = 3.0
2x = 2.0
3y = 5.0
4learning_rate = 0.1
5
6for step in range(5):
7 prediction = w * x
8 loss = (prediction - y) ** 2
9 gradient = 2 * (prediction - y) * x
10 print(step, round(w, 4), round(prediction, 4), round(loss, 6), round(gradient, 4))
11 w = w - learning_rate * gradient10 3.0 6.0 1.0 4.0
21 2.6 5.2 0.04 0.8
32 2.52 5.04 0.0016 0.16
43 2.504 5.008 6.4e-05 0.032
54 2.5008 5.0016 3e-06 0.0064Calculus supplies direction. It doesn't choose a safe learning rate for you.
From the same starting point, a learning rate of 1.0 produces:
The new prediction is -2 seconds and loss becomes (-2 - 5)² = 49. The derivative was right; the step jumped straight past the good region.
1def one_step(learning_rate: float) -> tuple[float, float]:
2 w, x, y = 3.0, 2.0, 5.0
3 prediction = w * x
4 gradient = 2 * (prediction - y) * x
5 updated_w = w - learning_rate * gradient
6 updated_loss = (updated_w * x - y) ** 2
7 return updated_w, updated_loss
8
9for lr in (0.1, 1.0):
10 updated_w, updated_loss = one_step(lr)
11 print("lr", lr, "new_w", updated_w, "new_loss", round(updated_loss, 4))1lr 0.1 new_w 2.6 new_loss 0.04
2lr 1.0 new_w -1.0 new_loss 49.0
In a real training loop, watch for non-finite loss values such as NaN or infinity. They can signal an unstable step or invalid input. Later optimization chapters will teach adaptive optimizers and schedules; first you need a backward calculation you can trust.
Response latency doesn't depend on prompt length alone. Add a second parameter for queue delay:
Here x_p is prompt length in hundreds of tokens, and x_q is queue delay in seconds. For one request:
| Value | Number |
|---|---|
prompt blocks x_p | 2 |
queue delay x_q | 1 |
current weights [w_p, w_q] | [2, 2] |
| prediction | 2*2 + 2*1 = 6 |
| actual seconds | 5 |
| error | 1 |
| squared loss | 1 |
A partial derivative changes one parameter while holding the other fixed:
Collect those slopes in order and you have the gradient: [4, 2]. The next article gives that list its full vector-and-shape language. For now, read it as one correction signal per stored parameter.
1import numpy as np
2
3features = np.array([2.0, 1.0]) # prompt token blocks, queue delay
4weights = np.array([2.0, 2.0])
5actual_seconds = 5.0
6
7prediction = float(features @ weights)
8error = prediction - actual_seconds
9gradient = 2 * error * features
10updated = weights - 0.1 * gradient
11new_loss = float((features @ updated - actual_seconds) ** 2)
12
13print("prediction", prediction)
14print("gradient", gradient.tolist())
15print("updated_weights", updated.tolist())
16print("new_loss", new_loss)1prediction 6.0
2gradient [4.0, 2.0]
3updated_weights [1.6, 1.8]
4new_loss 0.0One row can illustrate an update, but it can't uniquely determine what prompt length and queue delay should contribute across all future requests. Learning useful weights requires several observed requests.
The two-parameter gradient came from a chain of small operations:
prediction = w_p * x_p + w_q * x_qerror = prediction - yloss = error ** 2During the forward pass, these operations produce prediction = 6, error = 1, and loss = 1. During the backward pass, begin at the scalar loss and move backward:
| Local step | Derivative | Value |
|---|---|---|
loss = error² | d loss / d error = 2 * error | 2 |
error = prediction - y | d error / d prediction | 1 |
prediction = w_p*x_p + w_q*x_q | d prediction / d w_p = x_p | 2 |
| same prediction | d prediction / d w_q = x_q | 1 |
Multiply along each path. Prompt length gets 2 * 1 * 2 = 4; queue delay gets 2 * 1 * 1 = 2.
Backpropagation efficiently repeats this reverse chain-rule calculation through a layered model. Rumelhart, Hinton, and Williams used back-propagated error derivatives to learn internal representations in multilayer networks in their 1986 Nature paper.[1]
1x_prompt = 2.0
2x_queue = 1.0
3w_prompt = 2.0
4w_queue = 2.0
5actual = 5.0
6
7prediction = w_prompt * x_prompt + w_queue * x_queue
8error = prediction - actual
9loss = error ** 2
10
11d_loss_d_error = 2 * error
12d_error_d_prediction = 1.0
13d_prediction_d_prompt_weight = x_prompt
14d_prediction_d_queue_weight = x_queue
15
16d_loss_d_prompt_weight = d_loss_d_error * d_error_d_prediction * d_prediction_d_prompt_weight
17d_loss_d_queue_weight = d_loss_d_error * d_error_d_prediction * d_prediction_d_queue_weight
18
19print("forward", prediction, error, loss)
20print("prompt_path", d_loss_d_error, "*", d_prediction_d_prompt_weight, "=", d_loss_d_prompt_weight)
21print("queue_path", d_loss_d_error, "*", d_prediction_d_queue_weight, "=", d_loss_d_queue_weight)1forward 6.0 1.0 1.0
2prompt_path 2.0 * 2.0 = 4.0
3queue_path 2.0 * 1.0 = 2.0The example above sends the loss signal to two different parameters. A computation graph can also use the same parameter more than once. When backward paths meet at one parameter, add their contributions.
Use an intentionally simple latency rule to isolate that idea. The same weight w scales both the prompt-length estimate and the queue-delay estimate:
With w = 2, x_p = 2, x_q = 1, and y = 5, prediction is 6 and error is 1. The prompt path contributes 2 * 1 * 2 = 4. The queue path contributes 2 * 1 * 1 = 2. Both paths reach w, so its gradient is 4 + 2 = 6.
1w = 2.0
2x_prompt = 2.0
3x_queue = 1.0
4actual = 5.0
5
6prediction = w * x_prompt + w * x_queue
7error = prediction - actual
8prompt_path = 2 * error * x_prompt
9queue_path = 2 * error * x_queue
10gradient = prompt_path + queue_path
11
12print("prediction", prediction)
13print("path_contributions", [prompt_path, queue_path])
14print("shared_weight_gradient", gradient)
15assert gradient == 6.01prediction 6.0
2path_contributions [4.0, 2.0]
3shared_weight_gradient 6.0This scalar rule is deliberately small. Larger models reuse parameters across many rows or token positions, and branches can merge inside a graph. The next batch example combines row contributions too, then divides by the number of rows because its loss is a mean.
Now use four completed requests rather than one. Each row has prompt length and recorded queue delay; the target is observed response seconds. You already used NumPy arrays earlier in the curriculum, so X @ weights computes all four predictions at once.
1import numpy as np
2
3X = np.array([
4 [1.0, 0.5],
5 [2.0, 1.0],
6 [3.0, 0.5],
7 [4.0, 2.0],
8])
9y = np.array([2.5, 5.0, 6.5, 10.0])
10weights = np.array([0.0, 0.0])
11learning_rate = 0.03
12
13for step in range(101):
14 predictions = X @ weights
15 errors = predictions - y
16 loss = float(np.mean(errors ** 2))
17 gradient = (2 / len(X)) * (X.T @ errors)
18 if not np.isfinite(loss) or not np.all(np.isfinite(gradient)):
19 raise FloatingPointError("non-finite loss or gradient")
20 if step in (0, 1, 10, 100):
21 print(step, round(loss, 6), np.round(weights, 4).tolist())
22 weights = weights - learning_rate * gradient
23
24print("learned_weights", np.round(weights, 4).tolist())
25print("final_predictions", np.round(X @ weights, 3).tolist())10 43.375 [0.0, 0.0]
21 9.852759 [1.08, 0.4425]
310 0.003643 [2.0574, 0.8557]
4100 0.000709 [2.0259, 0.9364]
5learned_weights [2.0257, 0.937]
6final_predictions [2.494, 4.988, 6.546, 9.977]The finite-value guard doesn't make training stable by itself. It stops the loop before a NaN or infinity update silently corrupts the weights. If it fires, inspect the learning rate and input scale first.
This is still a tiny model, not a production latency service. Its value is transparency: every prediction, loss, derivative, and update fits into code you can explain.
A manual backward formula can silently omit a factor, transpose the wrong array, or reuse the wrong value. Compare it to centered finite differences before trusting it.
For each parameter, nudge only that one value by eps, compute the two losses, and compare the numerical slope to the formula from your backward pass.
1import numpy as np
2
3X = np.array([[1.0, 0.5], [2.0, 1.0], [3.0, 0.5], [4.0, 2.0]])
4y = np.array([2.5, 5.0, 6.5, 10.0])
5weights = np.array([1.3, 1.1])
6
7def mean_squared_loss(current: np.ndarray) -> float:
8 errors = X @ current - y
9 return float(np.mean(errors ** 2))
10
11errors = X @ weights - y
12manual_gradient = (2 / len(X)) * (X.T @ errors)
13
14eps = 1e-5
15numerical_gradient = np.zeros_like(weights)
16for i in range(len(weights)):
17 direction = np.zeros_like(weights)
18 direction[i] = eps
19 numerical_gradient[i] = (
20 mean_squared_loss(weights + direction) - mean_squared_loss(weights - direction)
21 ) / (2 * eps)
22
23print("manual", np.round(manual_gradient, 6).tolist())
24print("numerical", np.round(numerical_gradient, 6).tolist())
25print("check_passed", np.allclose(manual_gradient, numerical_gradient, atol=1e-6))1manual [-9.9, -3.925]
2numerical [-9.9, -3.925]
3check_passed TrueFinite differences are far too expensive for normal training because they need repeated forward computations per parameter. They are ideal for testing a small backward formula.
Reverse-mode automatic differentiation propagates chain-rule products backward from outputs to inputs. It's especially useful when one scalar loss depends on many parameters, which is the training shape we have here.[2]
PyTorch autograd records operations involving tensors that request gradients, then exposes the reverse pass through a call to backward().[3] This cell repeats the small latency table and checks PyTorch against the manual NumPy derivative. It also runs backward twice to expose one important default: PyTorch adds each new result into the leaf tensor's .grad buffer.
1import numpy as np
2import torch
3
4X_np = np.array([[1.0, 0.5], [2.0, 1.0], [3.0, 0.5], [4.0, 2.0]], dtype=np.float32)
5y_np = np.array([2.5, 5.0, 6.5, 10.0], dtype=np.float32)
6w_np = np.array([1.3, 1.1], dtype=np.float32)
7
8manual = (2 / len(X_np)) * (X_np.T @ (X_np @ w_np - y_np))
9
10X = torch.tensor(X_np)
11y = torch.tensor(y_np)
12w = torch.tensor(w_np, requires_grad=True)
13loss = torch.mean((X @ w - y) ** 2)
14loss.backward()
15first_backward = w.grad.detach().clone()
16
17same_loss_again = torch.mean((X @ w - y) ** 2)
18same_loss_again.backward()
19after_second_backward = w.grad.detach().clone()
20
21w.grad = None
22fresh_loss = torch.mean((X @ w - y) ** 2)
23fresh_loss.backward()
24after_reset_backward = w.grad.detach().clone()
25
26def rounded(values: torch.Tensor | np.ndarray) -> list[float]:
27 return [round(float(value), 3) for value in values]
28
29print("loss", round(float(loss.item()), 6))
30print("manual", rounded(manual))
31print("first_backward", rounded(first_backward))
32print("agree", np.allclose(manual, first_backward.numpy(), atol=1e-6))
33print("after_second_backward", rounded(after_second_backward))
34print("after_reset_backward", rounded(after_reset_backward))1loss 3.268751
2manual [-9.9, -3.925]
3first_backward [-9.9, -3.925]
4agree True
5after_second_backward [-19.8, -7.85]
6after_reset_backward [-9.9, -3.925]The second backward pass doubles the stored numbers because it adds the same gradient again. A training loop normally calls optimizer.zero_grad() before the next batch, or sets buffers to None, unless accumulating across several batches is intentional.[4] Frameworks save you from writing every derivative. They don't remove your responsibility to understand why a gradient has its sign, shape, and size.
Use one latency row with x_prompt = 3, x_queue = 1, actual seconds y = 7, and weights [w_prompt, w_queue] = [2, 2].
2 * error by the relevant feature.0.05. What is the new prediction and loss?5.99999 while your manual derivative says 6, is that evidence of a bug?NaN after a very large update, what two things should you inspect first?2*3 + 2*1 = 8, error is 8 - 7 = 1, and loss is 1.2*1*3 = 6; the queue partial is 2*1*1 = 2. The gradient is [6, 2].[2 - 0.05*6, 2 - 0.05*2] = [1.7, 1.9]. Prediction is 1.7*3 + 1.9 = 7.0, so loss is 0.Training should now read as a concrete process:
| Step | Question you can answer |
|---|---|
| Forward pass | What prediction did current parameters make? |
| Loss | How wrong was that prediction? |
| Derivative | How would loss respond to a small change in one parameter? |
| Gradient | What is the correction signal for every parameter? |
| Backpropagation | How do local derivatives carry that signal backward? |
| Gradient check | How do I test a handwritten backward formula? |
We deliberately stopped before optimizer algorithms such as momentum or Adam. They control how to use gradients over many steps; a later chapter teaches them once vectors, matrices, and deeper linear algebra are available.
learning_rate * gradient instead of subtracting it.2 * error term from squared loss.backward() runs, but you can't explain a zero, huge, or wrong-shaped gradient..grad even though the batch didn't change.optimizer.zero_grad() before each fresh batch, unless summing across batches is intentional.Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
Learning Representations by Back-Propagating Errors.
Rumelhart, D. E., Hinton, G. E., Williams, R. J. · 1986 · Nature, 323
Automatic Differentiation in Machine Learning: a Survey
Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. · 2018 · JMLR
PyTorch: An Imperative Style, High-Performance Deep Learning Library.
Paszke, A., et al. · 2019 · NeurIPS 2019
Optimizing Model Parameters.
PyTorch Contributors · 2026 · Official tutorial