Master the calculus behind every training step: partial derivatives, the chain rule with real numbers, gradient vectors, and why Adam and learning-rate schedules exist, using a concrete two-feature delivery-time predictor.
Calculus is the hidden engine under every training run. When you call loss.backward() in PyTorch or watch an Adam optimizer update millions of weights, you are watching the chain rule and partial derivatives in action.[1] This chapter teaches you the exact arithmetic so the later backpropagation lessons stop feeling like magic. We use one running e-commerce example the whole way: predicting how many hours it will take for a package to reach a customer from warehouse distance and package weight.
You only need basic algebra and the ability to read a small table of numbers. We will do every derivative by hand with real digits before we write a single line of NumPy. By the end you will be able to compute a gradient vector for a two-weight model, run gradient descent on paper, spot why a training curve is misbehaving, and understand exactly why Adam exists.
Suppose your model is dead simple. You only have one knob: a weight w that multiplies distance.
For one package:
x = 2 (200 km)y = 5.0 hoursw = 3.0ŷ = w · x = 6.0L = (6.0 - 5.0)² = 1.0You want to know: if I change w a tiny amount, does loss go up or down, and by how much?
The derivative dL/dw answers exactly that question. It is the slope of the loss curve at the current w.
We can compute it two ways. First the long way that shows what "tiny change" really means.
| Quantity | Value |
|---|---|
w_new | 3.0 + 0.001 = 3.001 |
ŷ_new | 3.001 × 2 = 6.002 |
L_new | (6.002 - 5.0)² = 1.004004 |
| change in loss | about 0.004 |
| slope estimate | 0.004 / 0.001 = 4 |
That is the definition of the derivative: how much loss changes per unit change in the weight.
Now the fast algebraic way. Expand the loss:
The power rule and chain rule give:
Plug in the current value:
The number matches. Positive 4 tells us: increasing w increases loss. To reduce loss we must decrease w.
This single number is the entire reason gradient descent works.
Real models have many weights. Let's give our delivery predictor two knobs:
w1 for distance (hundreds of km)w2 for weight (kg)Current values: w1 = 1.0, w2 = 2.0
One training example:
x1 = 2, x2 = 3y = 5.0Forward pass:
Loss:
A partial derivative asks: "If I change only this one weight while freezing the other, how does loss change?"
We compute two of them.
First hold w2 fixed and treat everything as a function of w1 only. It is exactly the scalar case we just did, but the "x" that multiplies w1 is now x1 = 2.
Similarly for w2:
Both partials are positive, so both weights are currently too high for this example. The gradient vector is therefore:
This vector points uphill on the loss surface in the (w1, w2) plane. Gradient descent will move in the opposite direction.
In the example above the prediction was a direct linear combination. In a real neural net the path from a weight to the loss goes through many layers: linear transform, activation, another layer, softmax, cross-entropy. The chain rule is what lets us push the error signal all the way back through every step.
Let's keep the same delivery numbers but make the path explicit so the multiplications become visible.
We have:
ŷ = w1·x1 + w2·x2e = ŷ - yL = e²At our point: ŷ = 8, e = 3, L = 9
The local derivatives are easy:
dL/de = 2e = 6de/dŷ = 1dŷ/dw1 = x1 = 2dŷ/dw2 = x2 = 3The chain rule says the total derivative for w1 is the product of every local piece along the path:
Exactly what we got earlier. The same multiplication happens for w2.
Here is the same story as a small computation graph (forward left to right, gradients flow right to left):
Every arrow on the backward pass carries a number. Multiply the numbers that touch a weight and you have its gradient. That is literally all backpropagation does, just on a graph with thousands of nodes instead of five.
The update rule is brutally simple:
Pick a learning rate lr = 0.05 (a typical starting value for small problems).
[1.0, 2.0], gradient [12, 18]w1 = 1.0 - 0.05 × 12 = 1.0 - 0.6 = 0.4w2 = 2.0 - 0.05 × 18 = 2.0 - 0.9 = 1.1Now run the forward pass again with the new weights on the same example:
ŷ = 0.4×2 + 1.1×3 = 0.8 + 3.3 = 4.1L = (4.1 - 5.0)² = 0.81Do it again (second step):
∂L/∂w1 = 2×(4.1-5)×2 = -3.6, ∂L/∂w2 = 2×(4.1-5)×3 = -5.4w1 = 0.4 - 0.05×(-3.6) = 0.58w2 = 1.1 - 0.05×(-5.4) = 1.37
Loss now 0.18. You can keep going. After five or six steps on this tiny surface you are already under 0.01.
Here is the same story the illustration shows with four descent steps on the loss surface and the loss bars dropping 4.0 → 0.08.
If every feature had the same scale and every direction on the loss surface were equally curved, vanilla gradient descent would be perfect. In real LLM training neither is true.
These are exactly the problems the Adam optimizer (and its cousins) were invented to solve.
Adam keeps two running averages for each weight:
Weights whose gradients have been consistently large or small get their step size adjusted automatically. That single trick, plus a couple of bias-correction terms, is why almost every production training script uses AdamW today.
Even with Adam you usually still decay the learning rate.
A common simple schedule: start at lr = 3e-4, then multiply by 0.1 every time the validation loss stops improving for a while (or use cosine decay that smoothly goes to a tiny final value).
Why it matters:
In the illustration you can see the failure box: lr=3.0 on the same surface sends the weights flying to huge values and loss explodes to 180. The success box uses a modest lr plus a schedule and reaches 0.01 cleanly.
Everything above was hand arithmetic. Now we turn it into code you can run and modify.
First, a minimal scalar example so you can see the loop.
python1import numpy as np 2 3def scalar_gd(): 4 w = 3.0 5 x = 2.0 6 y = 5.0 7 lr = 0.1 8 print("step | w | pred | loss | grad") 9 for step in range(6): 10 pred = w * x 11 loss = (pred - y) ** 2 12 grad = 2 * (pred - y) * x 13 print(f"{step:4d} | {w:6.3f} | {pred:6.3f} | {loss:6.3f} | {grad:6.3f}") 14 w = w - lr * grad 15 16scalar_gd()
Typical output:
text1step | w | pred | loss | grad 2 0 | 3.000 | 6.000 | 1.000 | 4.000 3 1 | 2.600 | 5.200 | 0.040 | 0.800 4 2 | 2.520 | 5.040 | 0.002 | 0.160 5 3 | 2.504 | 5.008 | 0.000 | 0.032 6 ...
Loss collapses toward zero exactly as the math predicted.
Now the full two-weight delivery predictor with a tiny synthetic dataset.
python1import numpy as np 2 3# 5 realistic delivery examples: [distance_hundreds_km, weight_kg] -> hours 4X = np.array([ 5 [2.0, 3.0], # 200km, 3kg 6 [1.5, 1.0], 7 [4.0, 5.0], 8 [0.8, 0.5], 9 [3.2, 2.5], 10], dtype=np.float32) 11y = np.array([5.0, 3.2, 9.5, 2.1, 6.8], dtype=np.float32) 12 13w = np.array([0.8, 1.5], dtype=np.float32) # starting point 14lr = 0.05 15print("step | w1 | w2 | avg_loss") 16for step in range(12): 17 preds = X @ w # vectorized forward 18 errors = preds - y 19 loss = np.mean(errors ** 2) 20 # manual gradient: dL/dw = mean( 2*error * x for each example ) 21 grad = (2 * errors[:, None] * X).mean(axis=0) 22 print(f"{step:4d} | {w[0]:6.3f} | {w[1]:6.3f} | {loss:8.4f}") 23 w = w - lr * grad 24 25print("Final weights:", w)
You will see loss drop from ~8-10 down below 0.2 in a dozen steps. Change lr to 1.5 and watch it explode or oscillate. Change it to 0.001 and watch it crawl.
python1# Same X, y, w from above 2m = np.zeros_like(w) # first moment 3v = np.zeros_like(w) # second moment 4beta1, beta2 = 0.9, 0.999 5eps = 1e-8 6lr = 0.05 7t = 0 8 9for step in range(20): 10 t += 1 11 preds = X @ w 12 errors = preds - y 13 loss = np.mean(errors ** 2) 14 grad = (2 * errors[:, None] * X).mean(axis=0) 15 16 m = beta1 * m + (1 - beta1) * grad 17 v = beta2 * v + (1 - beta2) * (grad ** 2) 18 19 m_hat = m / (1 - beta1 ** t) 20 v_hat = v / (1 - beta2 ** t) 21 22 w = w - lr * m_hat / (np.sqrt(v_hat) + eps) 23 if step % 5 == 0: 24 print(f"step {step:2d} loss {loss:8.4f} w {w}")
Even on this tiny problem Adam reaches a lower loss faster and with less oscillation than plain GD. On real embedding tables with sparse updates the difference is dramatic.
A neural network is just a long chain (actually a DAG) of the same operations: matrix multiply, add bias, activation, etc. The loss is still a scalar at the end.
Backpropagation walks the graph backward, multiplying the incoming error signal by the local derivative of each operation (exactly the chain rule arrows in the mermaid diagram). The only engineering trick is that we never materialize the full Jacobian; we only ever need the gradient with respect to the parameters we actually store.
Once you can do the two-weight example by hand, reading the backward pass of a transformer block is just "the same idea, 50,000 times, with careful shape bookkeeping."
lr * grad.dL/dŷ and applied it directly to w. Your gradient will be off by exactly the value of x (or the activation derivative). Always trace the full path.Print the gradient vector itself during early debugging. If any entry is 0 while the loss is still high, that weight is not receiving any signal (dead ReLU, masked attention, or a feature that never varies).
w1=0.5, w2=1.0, x1=3, x2=4, y=6. Compute ŷ, L, both partials, and the gradient vector.lr=0.1. What are the new weights? What is the new loss on the same example?w3 that multiplies a bias term (always 1). How does the chain rule change for w3?Solution sketches appear at the end of the article. The important part is that you can do the arithmetic without running code.
You own the core calculus that every training loop relies on.
When you later read "the gradient of the loss with respect to the embedding table" or "we clipped the gradient norm at 1.0", you will know exactly which numbers are being computed and why the choice of optimizer and schedule changes the shape of the loss curve you see in Weights & Biases.
The same chain-rule arithmetic that let you move two weights on a delivery predictor is what lets a 70-billion-parameter model improve next-token prediction after seeing one more batch.
ŷ = 0.53 + 1.04 = 5.5, L = (5.5-6)² = 0.25
∂L/∂w1 = 2*(-0.5)3 = -3, ∂L/∂w2 = 2(-0.5)*4 = -4
∇L = [-3, -4]
New w1 = 0.5 - 0.1*(-3) = 0.8, w2 = 1.0 - 0.1*(-4) = 1.4
New ŷ = 0.83 + 1.44 = 8.0, new L = 4.0 (actually went up because we overshot on this single example; on a full batch the average gradient usually points downhill).
With lr=2.0 the step becomes huge: w1 jumps by +6, w2 by +8. Loss on the example becomes enormous. Classic divergence.
The bias term has x3 = 1 for every row, so ∂L/∂w3 = 2*error * 1. The chain rule is identical; the "feature" is just constantly 1.
You now understand exactly how a model measures error and nudges its parameters downhill. The next chapter takes the matrix and vector language you have been using implicitly and makes it precise: you will compute SVD by hand, see why principal components are the natural axes of your data, and learn exactly how rank and condition numbers affect every embedding lookup and training step. Those factorizations are the hidden engine behind the optimizers and schedulers that follow.
Deep Learning.
Goodfellow, I., Bengio, Y., Courville, A. · 2016
Pattern Recognition and Machine Learning.
Bishop, C. M. · 2006
Learning Representations by Back-Propagating Errors.
Rumelhart, D. E., Hinton, G. E., Williams, R. J. · 1986 · Nature, 323
Machine Learning: A Probabilistic Perspective.
Murphy, K. P. · 2012