Trace SGD, momentum, Adam, AdamW, schedules, and gradient clipping on one uneven loss surface. Learn what each optimizer buffer measures and how to validate a training choice.
A triage model can learn the wrong lesson even while its loss goes down. Suppose it uses one weight for the common signal "escalated incident" and another for the rarer contrast "security issue rather than routine failure." A training step can correct the common weight too aggressively, jump past the best setting, and barely improve the contrast weight.
A matrix can have strong and weak directions. Training has a related geometry: a loss can bend sharply in one parameter direction and gently in another. A gradient says which way is uphill; an optimizer decides how much to move downhill, what history to remember, and how its pace changes over time.[1]
Measure everything on the smallest surface that exposes the problem. Let a be error in the common-signal weight and b be error in the contrast weight. The best possible setting is [a, b] = [0, 0]. Near that setting, use this loss:
The multiplier 100 says that the loss curves much more sharply along a than along b. It doesn't say that common tickets matter 100 times more to customers. It describes local training geometry.
Differentiate the loss to get its gradient:
At the initial parameter error [1, 1], the loss is 50.5 and the gradient is [100, 1]. Both parameters are equally far from their target, but the first coordinate asks for a step 100 times larger.
This first program makes the imbalance visible before introducing a new optimizer.
1import numpy as np
2
3def loss(w: np.ndarray) -> float:
4 return float(0.5 * (100.0 * w[0] ** 2 + w[1] ** 2))
5
6def gradient(w: np.ndarray) -> np.ndarray:
7 return np.array([100.0 * w[0], w[1]])
8
9w = np.array([1.0, 1.0])
10g = gradient(w)
11next_w = w - 0.018 * g
12
13print("start loss", loss(w))
14print("gradient", g.tolist())
15print("one SGD step", next_w.round(3).tolist())
16print("next loss", round(loss(next_w), 3))1start loss 50.5
2gradient [100.0, 1.0]
3one SGD step [-0.8, 0.982]
4next loss 32.482The loss falls after one step, but look at the coordinates: a crosses from 1.0 to -0.8, while b only reaches 0.982. That crossing is the zigzag. The optimizer is reducing the steep error by bouncing across zero while the shallow error moves slowly.
Stochastic gradient descent (SGD) updates parameters from a gradient computed on a minibatch of examples:
Here, is the parameter vector after step , is the minibatch gradient, and (eta) is the learning rate, the step-size multiplier.
Our tiny loss is deterministic rather than minibatched, so it removes batch noise and leaves the geometry exposed. Try three learning rates for eight steps:
1import numpy as np
2
3def loss(w: np.ndarray) -> float:
4 return float(0.5 * (100.0 * w[0] ** 2 + w[1] ** 2))
5
6def gradient(w: np.ndarray) -> np.ndarray:
7 return np.array([100.0 * w[0], w[1]])
8
9def run_sgd(lr: float, steps: int = 8) -> tuple[np.ndarray, float]:
10 w = np.array([1.0, 1.0])
11 for _ in range(steps):
12 w -= lr * gradient(w)
13 return w, loss(w)
14
15for lr in (0.005, 0.018, 0.021):
16 w, final_loss = run_sgd(lr)
17 print(f"lr={lr:.3f} w={w.round(3).tolist()} loss={final_loss:.3f}")1lr=0.005 w=[0.004, 0.961] loss=0.462
2lr=0.018 w=[0.168, 0.865] loss=1.781
3lr=0.021 w=[2.144, 0.844] loss=230.105The small rate is calm but slow on b. Over only eight steps it also reports the lowest loss of these three runs, which is a reminder to measure rather than equate a larger rate with faster learning. The middle rate drives down a while it still crawls on b. The rate just above 0.02 makes the steep coordinate grow in magnitude, because its multiplicative update factor is 1 - 100 * 0.021 = -1.1.
SGD reacts to the current gradient only. Momentum stores a moving average of gradients. Use the exponential-moving-average convention because it matches Adam's first buffer:
The coefficient controls memory. With , the new gradient contributes 0.1 and previous direction contributes 0.9. Some libraries use an unnormalized velocity convention instead; that changes the numerical learning-rate scale, not the idea.
Across the steep wall, gradients alternate sign. Along the shallow valley floor, they tend to agree. Watch the moving average cancel one alternating component while accumulating the consistent component:
1import numpy as np
2
3beta = 0.9
4gradients = [
5 np.array([100.0, 1.00]),
6 np.array([-80.0, 0.98]),
7 np.array([64.0, 0.96]),
8 np.array([-51.2, 0.94]),
9]
10m = np.zeros(2)
11
12for step, g in enumerate(gradients, start=1):
13 m = beta * m + (1.0 - beta) * g
14 print(step, "gradient", g.round(2).tolist(), "memory", m.round(3).tolist())11 gradient [100.0, 1.0] memory [10.0, 0.1]
22 gradient [-80.0, 0.98] memory [1.0, 0.188]
33 gradient [64.0, 0.96] memory [7.3, 0.265]
44 gradient [-51.2, 0.94] memory [1.45, 0.333]The first memory coordinate keeps changing direction and stays much smaller than the raw wall gradients. The second remains positive because each step agrees that b should shrink.
Momentum doesn't grant one perfect learning rate for every surface. It changes the path: useful repeated direction builds up, while alternating motion is damped. The next experiment uses the same loss and compares measured results rather than promising a winner in every task.
1import numpy as np
2
3def loss(w: np.ndarray) -> float:
4 return float(0.5 * (100.0 * w[0] ** 2 + w[1] ** 2))
5
6def gradient(w: np.ndarray) -> np.ndarray:
7 return np.array([100.0 * w[0], w[1]])
8
9def sgd(steps: int, lr: float) -> np.ndarray:
10 w = np.array([1.0, 1.0])
11 for _ in range(steps):
12 w -= lr * gradient(w)
13 return w
14
15def momentum(steps: int, lr: float, beta: float = 0.9) -> np.ndarray:
16 w = np.array([1.0, 1.0])
17 m = np.zeros_like(w)
18 for _ in range(steps):
19 m = beta * m + (1.0 - beta) * gradient(w)
20 w -= lr * m
21 return w
22
23for name, w in [
24 ("sgd", sgd(40, lr=0.018)),
25 ("momentum", momentum(40, lr=0.018)),
26]:
27 print(name, "w", w.round(4).tolist(), "loss", round(loss(w), 4))1sgd w [0.0001, 0.4836] loss 0.1169
2momentum w [0.0307, 0.5324] loss 0.1889At this unchanged numeric learning rate, SGD ends lower than our EMA-style momentum run. That's a valid result, not a failed lesson: momentum changes the update scale and usually needs its own learning-rate and memory-coefficient sweep. The benefit to look for is a better tuned path, not a guaranteed win from adding a buffer.
Nesterov momentum changes when the gradient is measured: it asks for the gradient near the point the current velocity is about to reach, rather than only at the current point. It can anticipate a turn. Focus on understanding stored direction; later training labs can compare library momentum variants on real minibatches.
Momentum still uses one global learning rate after smoothing. Adam (Adaptive Moment Estimation) adds a second moving average: the elementwise square of gradients. Its original formulation is:[2]
The symbols have precise jobs:
| Symbol | Meaning | What it remembers |
|---|---|---|
| first-moment estimate | recent signed direction | |
| uncentered second-moment estimate | recent squared gradient magnitude | |
| bias-corrected estimates | startup-adjusted state | |
| small denominator guard | avoids division by a near-zero scale |
v is often casually called variance. That wording is misleading here: Adam averages g ** 2; it doesn't subtract the mean gradient to calculate statistical variance.
There is another subtle correction to make. On Adam's very first step, a nonzero gradient of 100 and a nonzero gradient of 1 both produce an update close to one learning rate after bias correction. One observation isn't enough to infer which coordinate repeatedly needs caution.
1import numpy as np
2
3g = np.array([100.0, 1.0])
4beta1, beta2, lr, eps = 0.9, 0.999, 0.01, 1e-8
5m = (1.0 - beta1) * g
6v = (1.0 - beta2) * g**2
7m_hat = m / (1.0 - beta1)
8v_hat = v / (1.0 - beta2)
9update = lr * m_hat / (np.sqrt(v_hat) + eps)
10
11print("m_hat", m_hat.tolist())
12print("v_hat", v_hat.tolist())
13print("update", update.round(6).tolist())1m_hat [100.0, 1.0]
2v_hat [10000.0, 1.0]
3update [0.01, 0.01]
Now provide a short gradient history. The first coordinate alternates with large values, while the second keeps agreeing on a small positive direction. Adam's step reflects both pieces of state.
1import numpy as np
2
3gradients = [
4 np.array([100.0, 1.00]),
5 np.array([-80.0, 0.98]),
6 np.array([70.0, 0.96]),
7 np.array([-60.0, 0.94]),
8]
9beta1, beta2, lr, eps = 0.9, 0.999, 0.01, 1e-8
10m = np.zeros(2)
11v = np.zeros(2)
12
13for t, g in enumerate(gradients, start=1):
14 m = beta1 * m + (1.0 - beta1) * g
15 v = beta2 * v + (1.0 - beta2) * g**2
16 m_hat = m / (1.0 - beta1**t)
17 v_hat = v / (1.0 - beta2**t)
18 update = lr * m_hat / (np.sqrt(v_hat) + eps)
19 print(t, "m_hat", m_hat.round(3).tolist(), "step", update.round(5).tolist())11 m_hat [100.0, 1.0] step [0.01, 0.01]
22 m_hat [5.263, 0.989] step [0.00058, 0.00999]
33 m_hat [29.151, 0.979] step [0.00346, 0.00998]
44 m_hat [3.228, 0.967] step [0.00041, 0.00997]The alternating large coordinate receives a small signed move once its directional evidence conflicts. The consistently positive coordinate keeps a positive update. Adam isn't finding curvature or proving which feature matters; it transforms recent gradient history into coordinate-wise steps.
Both moving averages begin at zero. Early in training, they are pulled toward that initialization. With Adam's usual coefficients, the first raw first-moment buffer is 0.1 * g, while the first raw second-moment buffer is 0.001 * g**2. Dividing those raw values would make the startup update too large.
This program compares the first update with and without correction for one scalar parameter:
1import math
2
3gradient = 4.0
4beta1, beta2, lr, eps = 0.9, 0.999, 0.01, 1e-8
5m = (1.0 - beta1) * gradient
6v = (1.0 - beta2) * gradient**2
7
8uncorrected_step = lr * m / (math.sqrt(v) + eps)
9m_hat = m / (1.0 - beta1)
10v_hat = v / (1.0 - beta2)
11corrected_step = lr * m_hat / (math.sqrt(v_hat) + eps)
12
13print("uncorrected step", round(uncorrected_step, 5))
14print("corrected step", round(corrected_step, 5))1uncorrected step 0.03162
2corrected step 0.01Bias correction isn't a cosmetic detail. It makes the startup behavior correspond to estimates that aren't artificially pulled toward the all-zero initial buffers.
Regularization often nudges parameter magnitudes downward. With ordinary SGD, adding an L2 penalty to the loss can produce the same shrinkage behavior as weight decay after adjusting coefficients. With adaptive optimizers, those operations aren't equivalent: putting the penalty inside Adam's gradient also sends it through coordinate-wise scaling.[3]
AdamW applies a decoupled decay directly to the prior parameter value:
is the weight-decay coefficient. The data-gradient update and shrinkage term are visible separately.
The difference appears even when the data gradient is zero. A parameter of 10.0 should decay a small amount. If its L2 penalty is instead fed through Adam on the first step, normalization can produce a very different move:
1import math
2
3theta = 10.0
4lr = 0.1
5weight_decay = 0.01
6eps = 1e-8
7
8adamw_theta = theta - lr * weight_decay * theta
9
10coupled_gradient = weight_decay * theta
11coupled_adam_step = lr * coupled_gradient / (abs(coupled_gradient) + eps)
12coupled_theta = theta - coupled_adam_step
13
14print("AdamW with zero data gradient", round(adamw_theta, 4))
15print("Adam plus coupled L2 first step", round(coupled_theta, 4))1AdamW with zero data gradient 9.99
2Adam plus coupled L2 first step 9.9This isn't a recommendation to choose a decay coefficient from a single number. It proves a narrower point: AdamW decay and Adam with an L2 term implement different update rules.
An optimizer decides how to interpret gradients at one step. A learning-rate schedule decides how the global multiplier changes across many steps.
| Phase | What you're controlling | A measurable reason to adjust pace |
|---|---|---|
| startup | avoid a large initial global step | loss or gradient norm spikes immediately |
| middle | make progress while the model learns | validation loss is still improving |
| late | reduce movement around a good region | progress becomes noisy or stalls |
The original Transformer training recipe used Adam with a schedule that increased its learning rate linearly for 4000 warmup steps and then decreased it in proportion to the inverse square root of the step number.[4] That's evidence for one successful recipe, not a universal setting.
Warmup followed by cosine decay is another useful shape to understand. It reaches a peak slowly, then follows a smooth curve toward a configured final rate. This self-contained example makes the schedule inspectable:
1import math
2
3def warmup_cosine(step: int, total_steps: int, warmup_steps: int, peak_lr: float, final_lr: float) -> float:
4 if step < warmup_steps:
5 return peak_lr * (step + 1) / warmup_steps
6 progress = (step - warmup_steps) / max(1, total_steps - warmup_steps - 1)
7 cosine = 0.5 * (1.0 + math.cos(math.pi * min(progress, 1.0)))
8 return final_lr + (peak_lr - final_lr) * cosine
9
10values = [
11 warmup_cosine(step, total_steps=10, warmup_steps=2, peak_lr=3e-4, final_lr=3e-5)
12 for step in range(10)
13]
14print([f"{lr:.6f}" for lr in values])1['0.000150', '0.000300', '0.000300', '0.000287', '0.000249', '0.000195', '0.000135', '0.000081', '0.000043', '0.000030']A schedule choice needs an experiment: log the learning rate beside training loss, validation loss, and gradient norm. A bad curve doesn't identify its own cause.
A malformed or unusually difficult batch can create an unusually large gradient. Global-norm gradient clipping treats every parameter gradient as one long vector, measures its L2 norm, and rescales all gradients together when the norm exceeds a limit:
Here, is the clipping threshold. Because every coordinate gets the same scaling factor, clipping preserves the gradient's direction before it enters the optimizer.
1import numpy as np
2
3def clip_global_norm(gradient: np.ndarray, max_norm: float) -> tuple[np.ndarray, float]:
4 norm = float(np.linalg.norm(gradient))
5 scale = min(1.0, max_norm / (norm + 1e-12))
6 return gradient * scale, norm
7
8ordinary = np.array([3.0, 4.0])
9outlier = np.array([300.0, 400.0])
10
11for name, g in [("ordinary", ordinary), ("outlier", outlier)]:
12 clipped, original_norm = clip_global_norm(g, max_norm=5.0)
13 print(name, "before", original_norm, "after", round(float(np.linalg.norm(clipped)), 3), "value", clipped.round(2).tolist())1ordinary before 5.0 after 5.0 value [3.0, 4.0]
2outlier before 500.0 after 5.0 value [3.0, 4.0]Clipping limits the gradient passed to the optimizer. It doesn't guarantee a bound on an AdamW parameter update, because Adam state, learning rate, and weight decay also affect that update. If clipping fires on nearly every batch, investigate data, loss scaling, model stability, or the threshold rather than hiding the signal.
Now implement AdamW from the equations and test it against SGD on the same loss surface. The test logs final loss and both coordinate errors. This test doesn't declare an optimizer universally superior; it proves that your implementation handles this measured failure case.
1import math
2import numpy as np
3
4def loss(w: np.ndarray) -> float:
5 return float(0.5 * (100.0 * w[0] ** 2 + w[1] ** 2))
6
7def gradient(w: np.ndarray) -> np.ndarray:
8 return np.array([100.0 * w[0], w[1]])
9
10def cosine_lr(step: int, total_steps: int, peak_lr: float) -> float:
11 return peak_lr * 0.5 * (1.0 + math.cos(math.pi * step / (total_steps - 1)))
12
13def run_sgd(steps: int = 80) -> np.ndarray:
14 w = np.array([1.0, 1.0])
15 for step in range(steps):
16 w -= cosine_lr(step, steps, peak_lr=0.018) * gradient(w)
17 return w
18
19def run_adamw(steps: int = 80) -> np.ndarray:
20 w = np.array([1.0, 1.0])
21 m = np.zeros_like(w)
22 v = np.zeros_like(w)
23 beta1, beta2, eps, decay = 0.9, 0.999, 1e-8, 0.01
24 for step in range(1, steps + 1):
25 g = gradient(w)
26 m = beta1 * m + (1.0 - beta1) * g
27 v = beta2 * v + (1.0 - beta2) * g**2
28 m_hat = m / (1.0 - beta1**step)
29 v_hat = v / (1.0 - beta2**step)
30 lr = cosine_lr(step - 1, steps, peak_lr=0.08)
31 old_w = w.copy()
32 w -= lr * m_hat / (np.sqrt(v_hat) + eps)
33 w -= lr * decay * old_w
34 return w
35
36for name, result in [("SGD", run_sgd()), ("AdamW", run_adamw())]:
37 print(name, "loss", round(loss(result), 6), "w", result.round(5).tolist())1SGD loss 0.117302 w [-0.0, 0.48436]
2AdamW loss 0.023509 w [0.02158, 0.02158]Your acceptance test is explicit: both coordinates should shrink and the reported loss should be finite. A comparison on this constructed surface supports reasoning about this surface only. Real model selection still needs validation metrics.
In a neural-network training loop, PyTorch computes gradients by backpropagation and provides tested optimizer implementations. This short example optimizes the same two-coordinate loss, clips the gradient before optimizer.step(), and applies a decaying learning rate. PyTorch's clip_grad_norm_() returns the total norm before clipping. Capture it when you need evidence about how often the guardrail fires, and pass error_if_nonfinite=True so a NaN or Inf norm raises before it enters optimizer state.
1import math
2import torch
3
4torch.manual_seed(0)
5w = torch.nn.Parameter(torch.tensor([1.0, 1.0]))
6optimizer = torch.optim.AdamW([w], lr=0.08, weight_decay=0.01)
7clipped_updates = 0
8max_gradient_norm = 0.0
9
10for step in range(60):
11 optimizer.zero_grad()
12 loss = 0.5 * (100.0 * w[0] ** 2 + w[1] ** 2)
13 loss.backward()
14 gradient_norm = torch.nn.utils.clip_grad_norm_([w], max_norm=20.0, error_if_nonfinite=True).item()
15 max_gradient_norm = max(max_gradient_norm, gradient_norm)
16 clipped_updates += int(gradient_norm > 20.0)
17 lr = 0.08 * 0.5 * (1.0 + math.cos(math.pi * step / 59))
18 optimizer.param_groups[0]["lr"] = lr
19 optimizer.step()
20
21final_loss = 0.5 * (100.0 * w[0] ** 2 + w[1] ** 2)
22print("loss", round(final_loss.item(), 6))
23print("w", [round(value, 5) for value in w.detach().tolist()])
24print("max pre-clip gradient norm", round(max_gradient_norm, 3))
25print("clipped updates", clipped_updates)1loss 0.047348
2w [0.03062, 0.03062]
3max pre-clip gradient norm 100.005
4clipped updates 11The sequence matters:
This example assigns the current update's learning rate directly before optimizer.step(). If you replace that assignment with a built-in PyTorch scheduler, call scheduler.step() after optimizer.step(); calling it first skips the schedule's first value.PyTorch documents that ordering explicitly.
Later training chapters will add minibatches, validation data, mixed precision, checkpoints, and distributed state. The optimizer logic you traced here remains inside that larger loop.
For each trained parameter, plain SGD has no moving-average buffer, momentum stores one buffer, and AdamW stores two (m and v). If both AdamW buffers are stored as 32-bit floats, they require:
A model with 7 billion parameters therefore needs about 56 GB for those two buffers alone:
1parameters = 7_000_000_000
2bytes_per_float32 = 4
3moment_buffers = 2
4bytes_used = parameters * bytes_per_float32 * moment_buffers
5
6print("AdamW moment buffers in GB:", bytes_used / 1_000_000_000)1AdamW moment buffers in GB: 56.0That count omits parameters, gradients, activations, and any master-weight copies used by a training setup. At large scale, sharding optimizer state across devices is one reason systems such as ZeRO exist.[5]
No training curve proves a cause by itself. Collect a small evidence set before changing an optimizer:
| Symptom | Measure next | Candidate fix to test |
|---|---|---|
| loss becomes non-finite after one batch | gradient norm, batch contents, mixed-precision scale | correct bad data or numerics; compare clipping |
| loss oscillates while one parameter group barely changes | per-group update norm, learning rate, gradient scale | scale inputs or compare adaptive update |
| loss spikes at end of warmup | logged learning-rate boundary, gradient norm | lower peak rate or smooth the transition |
| training loss improves but validation worsens | validation metric, decay sweep, data leakage check | test regularization or stop earlier |
| run won't fit device memory | optimizer-state bytes and activation bytes | shard or reduce stored state |
This is how optimizer work becomes research practice: state a failure hypothesis, log the quantity that would expose it, change one mechanism, and re-run.
[a, b] = [1, 1], compute one SGD update with lr = 0.01. Does either coordinate cross zero?[20, 0.2], lr = 0.001, and ignore epsilon. What is the bias-corrected update?50 and threshold 5. What scaling factor does clipping apply?Solution checks:
[100, 1]; the new vector is [0, 0.99]. Neither coordinate overshoots, but the second is still slow.[0.001, 0.001]: first-step normalization removes magnitude differences for nonzero coordinates.5 / 50 = 0.1; the direction stays the same.v an uncentered second-moment or squared-gradient estimate.old_w and apply -lr * decay * old_w.Answer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.
Deep Learning.
Goodfellow, I., Bengio, Y., Courville, A. · 2016
Adam: A Method for Stochastic Optimization.
Kingma, D. P., Ba, J. · 2015 · ICLR 2015
Decoupled Weight Decay Regularization.
Loshchilov, I., Hutter, F. · 2019 · ICLR 2019
Attention Is All You Need.
Vaswani, A., et al. · 2017
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.
Rajbhandari, S., et al. · 2020 · SC 2020