LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnMath & StatisticsAdam, Momentum, Schedulers
⚡EasyFine-Tuning & Training

Adam, Momentum, Schedulers

Trace SGD, momentum, Adam, AdamW, schedules, and gradient clipping on one uneven loss surface. Learn what each optimizer buffer measures and how to validate a training choice.

15 min read
Learning path
Step 13 of 158 in the full curriculum
Linear Algebra for MLProbability for Machine Learning

A triage model can learn the wrong lesson even while its loss goes down. Suppose it uses one weight for the common signal "escalated incident" and another for the rarer contrast "security issue rather than routine failure." A training step can correct the common weight too aggressively, jump past the best setting, and barely improve the contrast weight.

A matrix can have strong and weak directions. Training has a related geometry: a loss can bend sharply in one parameter direction and gently in another. A gradient says which way is uphill; an optimizer decides how much to move downhill, what history to remember, and how its pace changes over time.[1]Reference 1Deep Learning.https://www.deeplearningbook.org/

Two-coordinate support-ticket loss valley showing SGD bouncing across the steep coordinate, optimizer state retaining directional and squared-gradient history, and warmup-decay pace control. Two-coordinate support-ticket loss valley showing SGD bouncing across the steep coordinate, optimizer state retaining directional and squared-gradient history, and warmup-decay pace control.
One loss valley creates three decisions: control bouncing across its steep wall, remember useful gradient history, and lower the step size as training settles.

One loss surface has two speeds

Measure everything on the smallest surface that exposes the problem. Let a be error in the common-signal weight and b be error in the contrast weight. The best possible setting is [a, b] = [0, 0]. Near that setting, use this loss:

L(a,b)=12(100a2+b2)L(a, b) = \frac{1}{2}(100a^2 + b^2)L(a,b)=21​(100a2+b2)

The multiplier 100 says that the loss curves much more sharply along a than along b. It doesn't say that common tickets matter 100 times more to customers. It describes local training geometry.

Differentiate the loss to get its gradient:

∇L(a,b)=[100a,b]\nabla L(a, b) = [100a, b]∇L(a,b)=[100a,b]

At the initial parameter error [1, 1], the loss is 50.5 and the gradient is [100, 1]. Both parameters are equally far from their target, but the first coordinate asks for a step 100 times larger.

This first program makes the imbalance visible before introducing a new optimizer.

measure-the-loss-valley.py
1import numpy as np 2 3def loss(w: np.ndarray) -> float: 4 return float(0.5 * (100.0 * w[0] ** 2 + w[1] ** 2)) 5 6def gradient(w: np.ndarray) -> np.ndarray: 7 return np.array([100.0 * w[0], w[1]]) 8 9w = np.array([1.0, 1.0]) 10g = gradient(w) 11next_w = w - 0.018 * g 12 13print("start loss", loss(w)) 14print("gradient", g.tolist()) 15print("one SGD step", next_w.round(3).tolist()) 16print("next loss", round(loss(next_w), 3))
Output
1start loss 50.5 2gradient [100.0, 1.0] 3one SGD step [-0.8, 0.982] 4next loss 32.482

The loss falls after one step, but look at the coordinates: a crosses from 1.0 to -0.8, while b only reaches 0.982. That crossing is the zigzag. The optimizer is reducing the steep error by bouncing across zero while the shallow error moves slowly.

SGD gives every coordinate one learning rate

Stochastic gradient descent (SGD) updates parameters from a gradient computed on a minibatch of examples:

wt=wt−1−ηgtw_t = w_{t-1} - \eta g_twt​=wt−1​−ηgt​

Here, wtw_twt​ is the parameter vector after step ttt, gtg_tgt​ is the minibatch gradient, and η\etaη (eta) is the learning rate, the step-size multiplier.

Our tiny loss is deterministic rather than minibatched, so it removes batch noise and leaves the geometry exposed. Try three learning rates for eight steps:

sgd-needs-one-rate-for-two-speeds.py
1import numpy as np 2 3def loss(w: np.ndarray) -> float: 4 return float(0.5 * (100.0 * w[0] ** 2 + w[1] ** 2)) 5 6def gradient(w: np.ndarray) -> np.ndarray: 7 return np.array([100.0 * w[0], w[1]]) 8 9def run_sgd(lr: float, steps: int = 8) -> tuple[np.ndarray, float]: 10 w = np.array([1.0, 1.0]) 11 for _ in range(steps): 12 w -= lr * gradient(w) 13 return w, loss(w) 14 15for lr in (0.005, 0.018, 0.021): 16 w, final_loss = run_sgd(lr) 17 print(f"lr={lr:.3f} w={w.round(3).tolist()} loss={final_loss:.3f}")
Output
1lr=0.005 w=[0.004, 0.961] loss=0.462 2lr=0.018 w=[0.168, 0.865] loss=1.781 3lr=0.021 w=[2.144, 0.844] loss=230.105

The small rate is calm but slow on b. Over only eight steps it also reports the lowest loss of these three runs, which is a reminder to measure rather than equate a larger rate with faster learning. The middle rate drives down a while it still crawls on b. The rate just above 0.02 makes the steep coordinate grow in magnitude, because its multiplicative update factor is 1 - 100 * 0.021 = -1.1.

Momentum remembers directions that agree

SGD reacts to the current gradient only. Momentum stores a moving average of gradients. Use the exponential-moving-average convention because it matches Adam's first buffer:

mt=βmt−1+(1−β)gtm_t = \beta m_{t-1} + (1 - \beta)g_tmt​=βmt−1​+(1−β)gt​ wt=wt−1−ηmtw_t = w_{t-1} - \eta m_twt​=wt−1​−ηmt​

The coefficient β\betaβ controls memory. With β=0.9\beta = 0.9β=0.9, the new gradient contributes 0.1 and previous direction contributes 0.9. Some libraries use an unnormalized velocity convention instead; that changes the numerical learning-rate scale, not the idea.

Across the steep wall, gradients alternate sign. Along the shallow valley floor, they tend to agree. Watch the moving average cancel one alternating component while accumulating the consistent component:

momentum-cancels-zigzag.py
1import numpy as np 2 3beta = 0.9 4gradients = [ 5 np.array([100.0, 1.00]), 6 np.array([-80.0, 0.98]), 7 np.array([64.0, 0.96]), 8 np.array([-51.2, 0.94]), 9] 10m = np.zeros(2) 11 12for step, g in enumerate(gradients, start=1): 13 m = beta * m + (1.0 - beta) * g 14 print(step, "gradient", g.round(2).tolist(), "memory", m.round(3).tolist())
Output
11 gradient [100.0, 1.0] memory [10.0, 0.1] 22 gradient [-80.0, 0.98] memory [1.0, 0.188] 33 gradient [64.0, 0.96] memory [7.3, 0.265] 44 gradient [-51.2, 0.94] memory [1.45, 0.333]

The first memory coordinate keeps changing direction and stays much smaller than the raw wall gradients. The second remains positive because each step agrees that b should shrink.

Momentum doesn't grant one perfect learning rate for every surface. It changes the path: useful repeated direction builds up, while alternating motion is damped. The next experiment uses the same loss and compares measured results rather than promising a winner in every task.

compare-sgd-and-momentum-paths.py
1import numpy as np 2 3def loss(w: np.ndarray) -> float: 4 return float(0.5 * (100.0 * w[0] ** 2 + w[1] ** 2)) 5 6def gradient(w: np.ndarray) -> np.ndarray: 7 return np.array([100.0 * w[0], w[1]]) 8 9def sgd(steps: int, lr: float) -> np.ndarray: 10 w = np.array([1.0, 1.0]) 11 for _ in range(steps): 12 w -= lr * gradient(w) 13 return w 14 15def momentum(steps: int, lr: float, beta: float = 0.9) -> np.ndarray: 16 w = np.array([1.0, 1.0]) 17 m = np.zeros_like(w) 18 for _ in range(steps): 19 m = beta * m + (1.0 - beta) * gradient(w) 20 w -= lr * m 21 return w 22 23for name, w in [ 24 ("sgd", sgd(40, lr=0.018)), 25 ("momentum", momentum(40, lr=0.018)), 26]: 27 print(name, "w", w.round(4).tolist(), "loss", round(loss(w), 4))
Output
1sgd w [0.0001, 0.4836] loss 0.1169 2momentum w [0.0307, 0.5324] loss 0.1889

At this unchanged numeric learning rate, SGD ends lower than our EMA-style momentum run. That's a valid result, not a failed lesson: momentum changes the update scale and usually needs its own learning-rate and memory-coefficient sweep. The benefit to look for is a better tuned path, not a guaranteed win from adding a buffer.

Nesterov momentum changes when the gradient is measured: it asks for the gradient near the point the current velocity is about to reach, rather than only at the current point. It can anticipate a turn. Focus on understanding stored direction; later training labs can compare library momentum variants on real minibatches.

Adam stores direction and squared-gradient scale

Momentum still uses one global learning rate after smoothing. Adam (Adaptive Moment Estimation) adds a second moving average: the elementwise square of gradients. Its original formulation is:[2]Reference 2Adam: A Method for Stochastic Optimization.https://arxiv.org/abs/1412.6980

mt=β1mt−1+(1−β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1)g_tmt​=β1​mt−1​+(1−β1​)gt​ vt=β2vt−1+(1−β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2)g_t^2vt​=β2​vt−1​+(1−β2​)gt2​ m^t=mt1−β1t,v^t=vt1−β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}m^t​=1−β1t​mt​​,v^t​=1−β2t​vt​​ wt=wt−1−ηm^tv^t+ϵw_t = w_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}wt​=wt−1​−ηv^t​​+ϵm^t​​

The symbols have precise jobs:

SymbolMeaningWhat it remembers
mtm_tmt​first-moment estimaterecent signed direction
vtv_tvt​uncentered second-moment estimaterecent squared gradient magnitude
m^t,v^t\hat{m}_t, \hat{v}_tm^t​,v^t​bias-corrected estimatesstartup-adjusted state
ϵ\epsilonϵsmall denominator guardavoids division by a near-zero scale

v is often casually called variance. That wording is misleading here: Adam averages g ** 2; it doesn't subtract the mean gradient to calculate statistical variance.

There is another subtle correction to make. On Adam's very first step, a nonzero gradient of 100 and a nonzero gradient of 1 both produce an update close to one learning rate after bias correction. One observation isn't enough to infer which coordinate repeatedly needs caution.

adam-first-step-normalizes-magnitude.py
1import numpy as np 2 3g = np.array([100.0, 1.0]) 4beta1, beta2, lr, eps = 0.9, 0.999, 0.01, 1e-8 5m = (1.0 - beta1) * g 6v = (1.0 - beta2) * g**2 7m_hat = m / (1.0 - beta1) 8v_hat = v / (1.0 - beta2) 9update = lr * m_hat / (np.sqrt(v_hat) + eps) 10 11print("m_hat", m_hat.tolist()) 12print("v_hat", v_hat.tolist()) 13print("update", update.round(6).tolist())
Output
1m_hat [100.0, 1.0] 2v_hat [10000.0, 1.0] 3update [0.01, 0.01]
Adam update diagram showing that first-step bias correction maps gradients 100 and 1 to equal-magnitude signed updates, while later moment history determines coordinate-specific steps. Adam update diagram showing that first-step bias correction maps gradients 100 and 1 to equal-magnitude signed updates, while later moment history determines coordinate-specific steps.
Adam's first nonzero update normalizes raw magnitude. Later coordinate-specific caution comes from signed and squared-gradient moving averages.

Now provide a short gradient history. The first coordinate alternates with large values, while the second keeps agreeing on a small positive direction. Adam's step reflects both pieces of state.

adam-uses-history-not-one-large-gradient.py
1import numpy as np 2 3gradients = [ 4 np.array([100.0, 1.00]), 5 np.array([-80.0, 0.98]), 6 np.array([70.0, 0.96]), 7 np.array([-60.0, 0.94]), 8] 9beta1, beta2, lr, eps = 0.9, 0.999, 0.01, 1e-8 10m = np.zeros(2) 11v = np.zeros(2) 12 13for t, g in enumerate(gradients, start=1): 14 m = beta1 * m + (1.0 - beta1) * g 15 v = beta2 * v + (1.0 - beta2) * g**2 16 m_hat = m / (1.0 - beta1**t) 17 v_hat = v / (1.0 - beta2**t) 18 update = lr * m_hat / (np.sqrt(v_hat) + eps) 19 print(t, "m_hat", m_hat.round(3).tolist(), "step", update.round(5).tolist())
Output
11 m_hat [100.0, 1.0] step [0.01, 0.01] 22 m_hat [5.263, 0.989] step [0.00058, 0.00999] 33 m_hat [29.151, 0.979] step [0.00346, 0.00998] 44 m_hat [3.228, 0.967] step [0.00041, 0.00997]

The alternating large coordinate receives a small signed move once its directional evidence conflicts. The consistently positive coordinate keeps a positive update. Adam isn't finding curvature or proving which feature matters; it transforms recent gradient history into coordinate-wise steps.

Bias correction repairs zero-initialized state

Both moving averages begin at zero. Early in training, they are pulled toward that initialization. With Adam's usual coefficients, the first raw first-moment buffer is 0.1 * g, while the first raw second-moment buffer is 0.001 * g**2. Dividing those raw values would make the startup update too large.

This program compares the first update with and without correction for one scalar parameter:

bias-correction-repairs-startup-scale.py
1import math 2 3gradient = 4.0 4beta1, beta2, lr, eps = 0.9, 0.999, 0.01, 1e-8 5m = (1.0 - beta1) * gradient 6v = (1.0 - beta2) * gradient**2 7 8uncorrected_step = lr * m / (math.sqrt(v) + eps) 9m_hat = m / (1.0 - beta1) 10v_hat = v / (1.0 - beta2) 11corrected_step = lr * m_hat / (math.sqrt(v_hat) + eps) 12 13print("uncorrected step", round(uncorrected_step, 5)) 14print("corrected step", round(corrected_step, 5))
Output
1uncorrected step 0.03162 2corrected step 0.01

Bias correction isn't a cosmetic detail. It makes the startup behavior correspond to estimates that aren't artificially pulled toward the all-zero initial buffers.

AdamW separates data gradients from weight shrinkage

Regularization often nudges parameter magnitudes downward. With ordinary SGD, adding an L2 penalty to the loss can produce the same shrinkage behavior as weight decay after adjusting coefficients. With adaptive optimizers, those operations aren't equivalent: putting the penalty inside Adam's gradient also sends it through coordinate-wise scaling.[3]Reference 3Decoupled Weight Decay Regularization.https://arxiv.org/abs/1711.05101

AdamW applies a decoupled decay directly to the prior parameter value:

wt=wt−1−ηm^tv^t+ϵ−ηλwt−1w_t = w_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \eta \lambda w_{t-1}wt​=wt−1​−ηv^t​​+ϵm^t​​−ηλwt−1​

λ\lambdaλ is the weight-decay coefficient. The data-gradient update and shrinkage term are visible separately.

The difference appears even when the data gradient is zero. A parameter of 10.0 should decay a small amount. If its L2 penalty is instead fed through Adam on the first step, normalization can produce a very different move:

adamw-keeps-decay-outside-adaptive-scaling.py
1import math 2 3theta = 10.0 4lr = 0.1 5weight_decay = 0.01 6eps = 1e-8 7 8adamw_theta = theta - lr * weight_decay * theta 9 10coupled_gradient = weight_decay * theta 11coupled_adam_step = lr * coupled_gradient / (abs(coupled_gradient) + eps) 12coupled_theta = theta - coupled_adam_step 13 14print("AdamW with zero data gradient", round(adamw_theta, 4)) 15print("Adam plus coupled L2 first step", round(coupled_theta, 4))
Output
1AdamW with zero data gradient 9.99 2Adam plus coupled L2 first step 9.9

This isn't a recommendation to choose a decay coefficient from a single number. It proves a narrower point: AdamW decay and Adam with an L2 term implement different update rules.

Schedules change pace across the run

An optimizer decides how to interpret gradients at one step. A learning-rate schedule decides how the global multiplier η\etaη changes across many steps.

PhaseWhat you're controllingA measurable reason to adjust pace
startupavoid a large initial global steploss or gradient norm spikes immediately
middlemake progress while the model learnsvalidation loss is still improving
latereduce movement around a good regionprogress becomes noisy or stalls

The original Transformer training recipe used Adam with a schedule that increased its learning rate linearly for 4000 warmup steps and then decreased it in proportion to the inverse square root of the step number.[4]Reference 4Attention Is All You Need.https://arxiv.org/abs/1706.03762 That's evidence for one successful recipe, not a universal setting.

Warmup followed by cosine decay is another useful shape to understand. It reaches a peak slowly, then follows a smooth curve toward a configured final rate. This self-contained example makes the schedule inspectable:

warmup-then-cosine-schedule.py
1import math 2 3def warmup_cosine(step: int, total_steps: int, warmup_steps: int, peak_lr: float, final_lr: float) -> float: 4 if step < warmup_steps: 5 return peak_lr * (step + 1) / warmup_steps 6 progress = (step - warmup_steps) / max(1, total_steps - warmup_steps - 1) 7 cosine = 0.5 * (1.0 + math.cos(math.pi * min(progress, 1.0))) 8 return final_lr + (peak_lr - final_lr) * cosine 9 10values = [ 11 warmup_cosine(step, total_steps=10, warmup_steps=2, peak_lr=3e-4, final_lr=3e-5) 12 for step in range(10) 13] 14print([f"{lr:.6f}" for lr in values])
Output
1['0.000150', '0.000300', '0.000300', '0.000287', '0.000249', '0.000195', '0.000135', '0.000081', '0.000043', '0.000030']

A schedule choice needs an experiment: log the learning rate beside training loss, validation loss, and gradient norm. A bad curve doesn't identify its own cause.

Gradient clipping bounds the input to the optimizer

A malformed or unusually difficult batch can create an unusually large gradient. Global-norm gradient clipping treats every parameter gradient as one long vector, measures its L2 norm, and rescales all gradients together when the norm exceeds a limit:

gused={g,∥g∥2≤cgc∥g∥2,∥g∥2>cg_{\text{used}} = \begin{cases} g, & \lVert g \rVert_2 \leq c \\ g \frac{c}{\lVert g \rVert_2}, & \lVert g \rVert_2 > c \end{cases}gused​={g,g∥g∥2​c​,​∥g∥2​≤c∥g∥2​>c​

Here, ccc is the clipping threshold. Because every coordinate gets the same scaling factor, clipping preserves the gradient's direction before it enters the optimizer.

clip-one-outlier-gradient.py
1import numpy as np 2 3def clip_global_norm(gradient: np.ndarray, max_norm: float) -> tuple[np.ndarray, float]: 4 norm = float(np.linalg.norm(gradient)) 5 scale = min(1.0, max_norm / (norm + 1e-12)) 6 return gradient * scale, norm 7 8ordinary = np.array([3.0, 4.0]) 9outlier = np.array([300.0, 400.0]) 10 11for name, g in [("ordinary", ordinary), ("outlier", outlier)]: 12 clipped, original_norm = clip_global_norm(g, max_norm=5.0) 13 print(name, "before", original_norm, "after", round(float(np.linalg.norm(clipped)), 3), "value", clipped.round(2).tolist())
Output
1ordinary before 5.0 after 5.0 value [3.0, 4.0] 2outlier before 500.0 after 5.0 value [3.0, 4.0]

Clipping limits the gradient passed to the optimizer. It doesn't guarantee a bound on an AdamW parameter update, because Adam state, learning rate, and weight decay also affect that update. If clipping fires on nearly every batch, investigate data, loss scaling, model stability, or the threshold rather than hiding the signal.

Build it: optimize the valley with AdamW

Now implement AdamW from the equations and test it against SGD on the same loss surface. The test logs final loss and both coordinate errors. This test doesn't declare an optimizer universally superior; it proves that your implementation handles this measured failure case.

build-adamw-on-the-valley.py
1import math 2import numpy as np 3 4def loss(w: np.ndarray) -> float: 5 return float(0.5 * (100.0 * w[0] ** 2 + w[1] ** 2)) 6 7def gradient(w: np.ndarray) -> np.ndarray: 8 return np.array([100.0 * w[0], w[1]]) 9 10def cosine_lr(step: int, total_steps: int, peak_lr: float) -> float: 11 return peak_lr * 0.5 * (1.0 + math.cos(math.pi * step / (total_steps - 1))) 12 13def run_sgd(steps: int = 80) -> np.ndarray: 14 w = np.array([1.0, 1.0]) 15 for step in range(steps): 16 w -= cosine_lr(step, steps, peak_lr=0.018) * gradient(w) 17 return w 18 19def run_adamw(steps: int = 80) -> np.ndarray: 20 w = np.array([1.0, 1.0]) 21 m = np.zeros_like(w) 22 v = np.zeros_like(w) 23 beta1, beta2, eps, decay = 0.9, 0.999, 1e-8, 0.01 24 for step in range(1, steps + 1): 25 g = gradient(w) 26 m = beta1 * m + (1.0 - beta1) * g 27 v = beta2 * v + (1.0 - beta2) * g**2 28 m_hat = m / (1.0 - beta1**step) 29 v_hat = v / (1.0 - beta2**step) 30 lr = cosine_lr(step - 1, steps, peak_lr=0.08) 31 old_w = w.copy() 32 w -= lr * m_hat / (np.sqrt(v_hat) + eps) 33 w -= lr * decay * old_w 34 return w 35 36for name, result in [("SGD", run_sgd()), ("AdamW", run_adamw())]: 37 print(name, "loss", round(loss(result), 6), "w", result.round(5).tolist())
Output
1SGD loss 0.117302 w [-0.0, 0.48436] 2AdamW loss 0.023509 w [0.02158, 0.02158]

Your acceptance test is explicit: both coordinates should shrink and the reported loss should be finite. A comparison on this constructed surface supports reasoning about this surface only. Real model selection still needs validation metrics.

Use PyTorch after understanding the state

In a neural-network training loop, PyTorch computes gradients by backpropagation and provides tested optimizer implementations. This short example optimizes the same two-coordinate loss, clips the gradient before optimizer.step(), and applies a decaying learning rate. PyTorch's clip_grad_norm_() returns the total norm before clipping. Capture it when you need evidence about how often the guardrail fires, and pass error_if_nonfinite=True so a NaN or Inf norm raises before it enters optimizer state.

pytorch-adamw-training-step.py
1import math 2import torch 3 4torch.manual_seed(0) 5w = torch.nn.Parameter(torch.tensor([1.0, 1.0])) 6optimizer = torch.optim.AdamW([w], lr=0.08, weight_decay=0.01) 7clipped_updates = 0 8max_gradient_norm = 0.0 9 10for step in range(60): 11 optimizer.zero_grad() 12 loss = 0.5 * (100.0 * w[0] ** 2 + w[1] ** 2) 13 loss.backward() 14 gradient_norm = torch.nn.utils.clip_grad_norm_([w], max_norm=20.0, error_if_nonfinite=True).item() 15 max_gradient_norm = max(max_gradient_norm, gradient_norm) 16 clipped_updates += int(gradient_norm > 20.0) 17 lr = 0.08 * 0.5 * (1.0 + math.cos(math.pi * step / 59)) 18 optimizer.param_groups[0]["lr"] = lr 19 optimizer.step() 20 21final_loss = 0.5 * (100.0 * w[0] ** 2 + w[1] ** 2) 22print("loss", round(final_loss.item(), 6)) 23print("w", [round(value, 5) for value in w.detach().tolist()]) 24print("max pre-clip gradient norm", round(max_gradient_norm, 3)) 25print("clipped updates", clipped_updates)
Output
1loss 0.047348 2w [0.03062, 0.03062] 3max pre-clip gradient norm 100.005 4clipped updates 11

The sequence matters:

  1. Clear old gradients.
  2. Compute a loss and run backpropagation.
  3. Inspect or clip gradients before the optimizer consumes them; reject non-finite norms.
  4. Set the scheduled learning rate for this update.
  5. Apply the optimizer step.

This example assigns the current update's learning rate directly before optimizer.step(). If you replace that assignment with a built-in PyTorch scheduler, call scheduler.step() after optimizer.step(); calling it first skips the schedule's first value.PyTorch documents that ordering explicitly.

Later training chapters will add minibatches, validation data, mixed precision, checkpoints, and distributed state. The optimizer logic you traced here remains inside that larger loop.

Optimizer state is also a memory budget

For each trained parameter, plain SGD has no moving-average buffer, momentum stores one buffer, and AdamW stores two (m and v). If both AdamW buffers are stored as 32-bit floats, they require:

2×4 bytes=8 bytes per parameter2 \times 4 \text{ bytes} = 8 \text{ bytes per parameter}2×4 bytes=8 bytes per parameter

A model with 7 billion parameters therefore needs about 56 GB for those two buffers alone:

count-adamw-moment-memory.py
1parameters = 7_000_000_000 2bytes_per_float32 = 4 3moment_buffers = 2 4bytes_used = parameters * bytes_per_float32 * moment_buffers 5 6print("AdamW moment buffers in GB:", bytes_used / 1_000_000_000)
Output
1AdamW moment buffers in GB: 56.0

That count omits parameters, gradients, activations, and any master-weight copies used by a training setup. At large scale, sharding optimizer state across devices is one reason systems such as ZeRO exist.[5]Reference 5ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.https://arxiv.org/abs/1910.02054

Diagnose from traces, not slogans

No training curve proves a cause by itself. Collect a small evidence set before changing an optimizer:

SymptomMeasure nextCandidate fix to test
loss becomes non-finite after one batchgradient norm, batch contents, mixed-precision scalecorrect bad data or numerics; compare clipping
loss oscillates while one parameter group barely changesper-group update norm, learning rate, gradient scalescale inputs or compare adaptive update
loss spikes at end of warmuplogged learning-rate boundary, gradient normlower peak rate or smooth the transition
training loss improves but validation worsensvalidation metric, decay sweep, data leakage checktest regularization or stop earlier
run won't fit device memoryoptimizer-state bytes and activation bytesshard or reduce stored state

This is how optimizer work becomes research practice: state a failure hypothesis, log the quantity that would expose it, change one mechanism, and re-run.

Practice with answers

  1. Starting from [a, b] = [1, 1], compute one SGD update with lr = 0.01. Does either coordinate cross zero?
  2. At Adam step one, take gradient [20, 0.2], lr = 0.001, and ignore epsilon. What is the bias-corrected update?
  3. A global gradient has norm 50 and threshold 5. What scaling factor does clipping apply?
  4. Your run clips 95 percent of minibatches. Is "clipping works" an adequate conclusion?

Solution checks:

  1. The gradient is [100, 1]; the new vector is [0, 0.99]. Neither coordinate overshoots, but the second is still slow.
  2. The update is [0.001, 0.001]: first-step normalization removes magnitude differences for nonzero coordinates.
  3. The factor is 5 / 50 = 0.1; the direction stays the same.
  4. No. Persistent clipping is evidence that your threshold, learning rate, data, or numerical path needs inspection.

Mastery check

Key concepts

  • Uneven curvature makes one SGD learning rate awkward across parameter directions.
  • Momentum smooths signed gradient direction across steps.
  • Adam tracks signed first moments and uncentered squared-gradient second moments.
  • Bias correction fixes zero-initialized moving averages during startup.
  • AdamW keeps decay outside adaptive gradient scaling.
  • Schedules control global pace; clipping controls unusually large input gradients.
  • AdamW's moment buffers have a concrete memory cost.

Evaluation rubric

  • Foundational: Computes SGD, momentum, and Adam's first bias-corrected update by hand on the two-coordinate valley and explains the observed movement.
  • Intermediate: Implements AdamW with decay from the prior parameter value, adds a schedule and clipping experiment, and reports loss plus coordinate errors.
  • Advanced: Reads learning-rate, gradient-norm, validation-loss, and optimizer-memory traces, then proposes a falsifiable optimizer change rather than guessing.

Follow-up questions

Common pitfalls

  • Symptom: An Adam explanation says its second buffer measures gradient variance. Cause: Squared gradients were confused with variance around a mean. Fix: Call v an uncentered second-moment or squared-gradient estimate.
  • Symptom: An AdamW reimplementation decays the already-updated weight. Cause: The decay term wasn't read from the prior parameter value. Fix: preserve old_w and apply -lr * decay * old_w.
  • Symptom: A schedule change is credited for improvement without a controlled run. Cause: multiple optimizer settings changed at once. Fix: log the schedule and compare one changed mechanism against the same validation set.
  • Symptom: Clipping hides exploding gradients until a different threshold fails. Cause: clipping was treated as diagnosis. Fix: measure clipping frequency and trace the source of extreme gradients.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.For L(a, b) = 0.5 * (100a^2 + b^2), plain SGD gives a_t = (1 - 100 * eta) a_{t-1} and b_t = (1 - eta) b_{t-1}. If eta = 0.021, what happens to a, and what does that imply about using one larger rate to speed up b?
2.Using EMA momentum m_t = 0.9 m_{t-1} + 0.1 g_t from m_0 = [0, 0], the gradients are [100, 1.00], [-80, 0.98], [64, 0.96], and [-51.2, 0.94]. Which pattern should appear in m after four updates?
3.At Adam step 1, the gradient is [20, 0.2], beta1 = 0.9, beta2 = 0.999, lr = 0.001, and epsilon is ignored. Using Adam's bias correction, what update vector is subtracted from the parameters?
4.After several Adam steps, coordinate x has large gradients that alternate sign, while coordinate y has small gradients that stay positive. Which state interpretation explains a small signed update for x and a persistent positive update for y?
5.A scalar parameter is theta = 10.0, the data gradient is zero, lr = 0.1, and weight_decay = 0.01. What happens on the first step with AdamW decay, and why can coupled L2 inside Adam differ?
6.Training loss spikes exactly when warmup ends. You want to test whether the schedule transition causes the spike without confounding optimizer changes. Which experiment directly tests that hypothesis?
7.A global gradient is [300, 400], so its L2 norm is 500. With a clipping threshold of 5, which gradient enters AdamW, and what does clipping guarantee?
8.A PyTorch loop directly assigns the scheduled learning rate and clips AdamW gradients. Which order makes clipping affect the current update and records the pre-clip norm?
9.A model has 7 billion parameters. AdamW stores two float32 moment buffers, with 4 bytes per value. Using decimal gigabytes, what memory do these buffers require?

9 questions remaining.

Next Step
Continue to Probability for Machine Learning

Optimization taught you how parameter updates respond to uncertain minibatch gradients. Probability gives you language for the uncertainty in events, model scores, and evidence that those trained parameters must represent.

PreviousLinear Algebra for ML
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Deep Learning.

Goodfellow, I., Bengio, Y., Courville, A. · 2016

Adam: A Method for Stochastic Optimization.

Kingma, D. P., Ba, J. · 2015 · ICLR 2015

Decoupled Weight Decay Regularization.

Loshchilov, I., Hutter, F. · 2019 · ICLR 2019

Attention Is All You Need.

Vaswani, A., et al. · 2017

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.

Rajbhandari, S., et al. · 2020 · SC 2020