LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnMath & StatisticsGradients and Backprop
⚡EasyFine-Tuning & Training

Gradients and Backprop

Learn why training works by nudging one response-latency weight, tracing and summing chain-rule paths, checking gradients, and confirming them with PyTorch.

13 min read
Learning path
Step 10 of 158 in the full curriculum
Algorithms for ML EngineersVectors, Matrices & Tensors

Measuring engineering work turns vague performance into numbers. Training builds on the same instinct, but with a different problem: the system begins with poor numbers and must learn better ones from examples.

Suppose a coding assistant predicts response latency from prompt length. A request used 200 input tokens and finished in 5 seconds, but the current rule predicts 6. Which stored number should change, in which direction, and by how much? Gradients answer that question.

One CodeFlow latency example moving through prediction, squared loss, slope, and a corrected weight update. One CodeFlow latency example moving through prediction, squared loss, slope, and a corrected weight update.
Training starts with a wrong latency prediction. A derivative measures which direction lowers its loss, and one update makes the same example less wrong.

A model with one adjustable number

A parameter is a number a model is allowed to change while it learns. Start with one parameter, w, representing predicted seconds per 100 input tokens.

For one completed request:

QuantityMeaningValue
xprompt length in hundreds of tokens2.0
yobserved response seconds5.0
wcurrent seconds-per-token-block parameter3.0
prediction = w * xmodel estimate6.0
loss = (prediction - y)²penalty for being wrong1.0

The loss is one non-negative number. Zero means the prediction matches this row. Larger numbers mean the miss is worse.

one-parameter-loss.py
1def squared_loss(weight: float, token_blocks: float, actual_seconds: float) -> float: 2 prediction = weight * token_blocks 3 return (prediction - actual_seconds) ** 2 4 5w = 3.0 6x = 2.0 7y = 5.0 8prediction = w * x 9loss = squared_loss(w, x, y) 10 11print("prediction_seconds", prediction) 12print("loss", loss)
Output
1prediction_seconds 6.0 2loss 1.0

That code evaluates a model. It doesn't yet learn. Learning requires knowing whether to move w upward or downward.

Nudge the weight to estimate a slope

Change w by a tiny amount and recompute the loss:

WeightPredictionLossWhat changed?
3.0006.0001.000000starting point
3.0016.0021.004004weight moved upward

Loss increased by 0.004004 when the weight increased by 0.001. Divide those changes:

slope estimate=1.004004−1.0000000.001=4.004\text{slope estimate} = \frac{1.004004 - 1.000000}{0.001} = 4.004slope estimate=0.0011.004004−1.000000​=4.004

A positive slope means increasing w makes loss worse near this point. To lower the loss, move w down.

finite-difference-nudge.py
1def loss(weight: float) -> float: 2 return (weight * 2.0 - 5.0) ** 2 3 4w = 3.0 5eps = 0.001 6slope_estimate = (loss(w + eps) - loss(w)) / eps 7 8print("loss_before", loss(w)) 9print("loss_after_nudge", loss(w + eps)) 10print("slope_estimate", round(slope_estimate, 3))
Output
1loss_before 1.0 2loss_after_nudge 1.0040039999999995 3slope_estimate 4.004

This is a finite difference: a practical approximation to a derivative. It's slow if you repeat it for millions of parameters, but it's an excellent check for a small formula.

The derivative gives the exact local slope

A derivative is the slope as the nudge becomes infinitesimally small. Here the loss is:

L(w)=(2w−5)2L(w) = (2w - 5)^2L(w)=(2w−5)2

Apply the chain rule in two pieces:

dLdw=2(2w−5)×2\frac{dL}{dw} = 2(2w - 5) \times 2dwdL​=2(2w−5)×2

The first factor differentiates the square. The final 2 comes from the prediction 2w: changing w by one changes the prediction by two seconds for this row.

At w = 3:

dLdw=2(6−5)×2=4\frac{dL}{dw} = 2(6 - 5) \times 2 = 4dwdL​=2(6−5)×2=4

The finite-difference estimate was 4.004; the derivative is exactly 4 at this point.

analytical-versus-numerical-slope.py
1def loss(weight: float) -> float: 2 return (weight * 2.0 - 5.0) ** 2 3 4def analytical_gradient(weight: float) -> float: 5 prediction = weight * 2.0 6 return 2 * (prediction - 5.0) * 2.0 7 8w = 3.0 9eps = 1e-5 10centered = (loss(w + eps) - loss(w - eps)) / (2 * eps) 11exact = analytical_gradient(w) 12 13print("analytical_gradient", exact) 14print("finite_difference", round(centered, 6)) 15print("agree", abs(centered - exact) < 1e-6)
Output
1analytical_gradient 4.0 2finite_difference 4.0 3agree True

A centered finite difference nudges left and right, so it's usually a better numerical check than a one-sided nudge.

Gradient descent takes a downhill step

The derivative tells you uphill. Gradient descent moves the opposite way. A learning rate is a small chosen step scale:

wnew=w−learning rate×dLdww_{\text{new}} = w - \text{learning rate} \times \frac{dL}{dw}wnew​=w−learning rate×dwdL​

With w = 3, derivative 4, and learning rate 0.1:

wnew=3−0.1×4=2.6w_{\text{new}} = 3 - 0.1 \times 4 = 2.6wnew​=3−0.1×4=2.6

Now the prediction is 2.6 * 2 = 5.2, and loss is (5.2 - 5)² = 0.04. One update reduced the loss from 1.0 to 0.04.

repeat-gradient-descent.py
1w = 3.0 2x = 2.0 3y = 5.0 4learning_rate = 0.1 5 6for step in range(5): 7 prediction = w * x 8 loss = (prediction - y) ** 2 9 gradient = 2 * (prediction - y) * x 10 print(step, round(w, 4), round(prediction, 4), round(loss, 6), round(gradient, 4)) 11 w = w - learning_rate * gradient
Output
10 3.0 6.0 1.0 4.0 21 2.6 5.2 0.04 0.8 32 2.52 5.04 0.0016 0.16 43 2.504 5.008 6.4e-05 0.032 54 2.5008 5.0016 3e-06 0.0064

A correct gradient can still take a bad step

Calculus supplies direction. It doesn't choose a safe learning rate for you.

From the same starting point, a learning rate of 1.0 produces:

wnew=3−1.0×4=−1w_{\text{new}} = 3 - 1.0 \times 4 = -1wnew​=3−1.0×4=−1

The new prediction is -2 seconds and loss becomes (-2 - 5)² = 49. The derivative was right; the step jumped straight past the good region.

learning-rate-overshoot.py
1def one_step(learning_rate: float) -> tuple[float, float]: 2 w, x, y = 3.0, 2.0, 5.0 3 prediction = w * x 4 gradient = 2 * (prediction - y) * x 5 updated_w = w - learning_rate * gradient 6 updated_loss = (updated_w * x - y) ** 2 7 return updated_w, updated_loss 8 9for lr in (0.1, 1.0): 10 updated_w, updated_loss = one_step(lr) 11 print("lr", lr, "new_w", updated_w, "new_loss", round(updated_loss, 4))
Output
1lr 0.1 new_w 2.6 new_loss 0.04 2lr 1.0 new_w -1.0 new_loss 49.0
Comparison of a useful gradient step, an overshooting step, and a numerical gradient check for the latency predictor. Comparison of a useful gradient step, an overshooting step, and a numerical gradient check for the latency predictor.
A gradient provides a direction, not a guarantee. Check the update size and keep a finite-difference test ready when your handwritten backward rule looks suspicious.

In a real training loop, watch for non-finite loss values such as NaN or infinity. They can signal an unstable step or invalid input. Later optimization chapters will teach adaptive optimizers and schedules; first you need a backward calculation you can trust.

Several parameters mean several partial derivatives

Response latency doesn't depend on prompt length alone. Add a second parameter for queue delay:

y^=wpxp+wqxq\hat{y} = w_p x_p + w_q x_qy^​=wp​xp​+wq​xq​

Here x_p is prompt length in hundreds of tokens, and x_q is queue delay in seconds. For one request:

ValueNumber
prompt blocks x_p2
queue delay x_q1
current weights [w_p, w_q][2, 2]
prediction2*2 + 2*1 = 6
actual seconds5
error1
squared loss1

A partial derivative changes one parameter while holding the other fixed:

∂L∂wp=2(y^−y)xp=2×1×2=4\frac{\partial L}{\partial w_p} = 2(\hat{y} - y)x_p = 2 \times 1 \times 2 = 4∂wp​∂L​=2(y^​−y)xp​=2×1×2=4 ∂L∂wq=2(y^−y)xq=2×1×1=2\frac{\partial L}{\partial w_q} = 2(\hat{y} - y)x_q = 2 \times 1 \times 1 = 2∂wq​∂L​=2(y^​−y)xq​=2×1×1=2

Collect those slopes in order and you have the gradient: [4, 2]. The next article gives that list its full vector-and-shape language. For now, read it as one correction signal per stored parameter.

two-parameter-gradient.py
1import numpy as np 2 3features = np.array([2.0, 1.0]) # prompt token blocks, queue delay 4weights = np.array([2.0, 2.0]) 5actual_seconds = 5.0 6 7prediction = float(features @ weights) 8error = prediction - actual_seconds 9gradient = 2 * error * features 10updated = weights - 0.1 * gradient 11new_loss = float((features @ updated - actual_seconds) ** 2) 12 13print("prediction", prediction) 14print("gradient", gradient.tolist()) 15print("updated_weights", updated.tolist()) 16print("new_loss", new_loss)
Output
1prediction 6.0 2gradient [4.0, 2.0] 3updated_weights [1.6, 1.8] 4new_loss 0.0

One row can illustrate an update, but it can't uniquely determine what prompt length and queue delay should contribute across all future requests. Learning useful weights requires several observed requests.

Backpropagation is chain-rule bookkeeping

The two-parameter gradient came from a chain of small operations:

  1. Predict: prediction = w_p * x_p + w_q * x_q
  2. Subtract target: error = prediction - y
  3. Square error: loss = error ** 2

During the forward pass, these operations produce prediction = 6, error = 1, and loss = 1. During the backward pass, begin at the scalar loss and move backward:

Local stepDerivativeValue
loss = error²d loss / d error = 2 * error2
error = prediction - yd error / d prediction1
prediction = w_p*x_p + w_q*x_qd prediction / d w_p = x_p2
same predictiond prediction / d w_q = x_q1

Multiply along each path. Prompt length gets 2 * 1 * 2 = 4; queue delay gets 2 * 1 * 1 = 2.

Forward latency prediction graph and backward chain-rule signals for prompt and queue parameters. Forward latency prediction graph and backward chain-rule signals for prompt and queue parameters.
The single loss signal flows backward through subtraction and multiplication. Each parameter receives the upstream signal multiplied by its own feature value.

Backpropagation efficiently repeats this reverse chain-rule calculation through a layered model. Rumelhart, Hinton, and Williams used back-propagated error derivatives to learn internal representations in multilayer networks in their 1986 Nature paper.[1]Reference 1Learning Representations by Back-Propagating Errors.https://doi.org/10.1038/323533a0

trace-chain-rule-factors.py
1x_prompt = 2.0 2x_queue = 1.0 3w_prompt = 2.0 4w_queue = 2.0 5actual = 5.0 6 7prediction = w_prompt * x_prompt + w_queue * x_queue 8error = prediction - actual 9loss = error ** 2 10 11d_loss_d_error = 2 * error 12d_error_d_prediction = 1.0 13d_prediction_d_prompt_weight = x_prompt 14d_prediction_d_queue_weight = x_queue 15 16d_loss_d_prompt_weight = d_loss_d_error * d_error_d_prediction * d_prediction_d_prompt_weight 17d_loss_d_queue_weight = d_loss_d_error * d_error_d_prediction * d_prediction_d_queue_weight 18 19print("forward", prediction, error, loss) 20print("prompt_path", d_loss_d_error, "*", d_prediction_d_prompt_weight, "=", d_loss_d_prompt_weight) 21print("queue_path", d_loss_d_error, "*", d_prediction_d_queue_weight, "=", d_loss_d_queue_weight)
Output
1forward 6.0 1.0 1.0 2prompt_path 2.0 * 2.0 = 4.0 3queue_path 2.0 * 1.0 = 2.0

When paths meet, add their signals

The example above sends the loss signal to two different parameters. A computation graph can also use the same parameter more than once. When backward paths meet at one parameter, add their contributions.

Use an intentionally simple latency rule to isolate that idea. The same weight w scales both the prompt-length estimate and the queue-delay estimate:

y^=wxp+wxq\hat{y} = w x_p + w x_qy^​=wxp​+wxq​

With w = 2, x_p = 2, x_q = 1, and y = 5, prediction is 6 and error is 1. The prompt path contributes 2 * 1 * 2 = 4. The queue path contributes 2 * 1 * 1 = 2. Both paths reach w, so its gradient is 4 + 2 = 6.

sum-shared-parameter-paths.py
1w = 2.0 2x_prompt = 2.0 3x_queue = 1.0 4actual = 5.0 5 6prediction = w * x_prompt + w * x_queue 7error = prediction - actual 8prompt_path = 2 * error * x_prompt 9queue_path = 2 * error * x_queue 10gradient = prompt_path + queue_path 11 12print("prediction", prediction) 13print("path_contributions", [prompt_path, queue_path]) 14print("shared_weight_gradient", gradient) 15assert gradient == 6.0
Output
1prediction 6.0 2path_contributions [4.0, 2.0] 3shared_weight_gradient 6.0

This scalar rule is deliberately small. Larger models reuse parameters across many rows or token positions, and branches can merge inside a graph. The next batch example combines row contributions too, then divides by the number of rows because its loss is a mean.

Build it: learn from a small latency table

Now use four completed requests rather than one. Each row has prompt length and recorded queue delay; the target is observed response seconds. You already used NumPy arrays earlier in the curriculum, so X @ weights computes all four predictions at once.

train-latency-predictor.py
1import numpy as np 2 3X = np.array([ 4 [1.0, 0.5], 5 [2.0, 1.0], 6 [3.0, 0.5], 7 [4.0, 2.0], 8]) 9y = np.array([2.5, 5.0, 6.5, 10.0]) 10weights = np.array([0.0, 0.0]) 11learning_rate = 0.03 12 13for step in range(101): 14 predictions = X @ weights 15 errors = predictions - y 16 loss = float(np.mean(errors ** 2)) 17 gradient = (2 / len(X)) * (X.T @ errors) 18 if not np.isfinite(loss) or not np.all(np.isfinite(gradient)): 19 raise FloatingPointError("non-finite loss or gradient") 20 if step in (0, 1, 10, 100): 21 print(step, round(loss, 6), np.round(weights, 4).tolist()) 22 weights = weights - learning_rate * gradient 23 24print("learned_weights", np.round(weights, 4).tolist()) 25print("final_predictions", np.round(X @ weights, 3).tolist())
Output
10 43.375 [0.0, 0.0] 21 9.852759 [1.08, 0.4425] 310 0.003643 [2.0574, 0.8557] 4100 0.000709 [2.0259, 0.9364] 5learned_weights [2.0257, 0.937] 6final_predictions [2.494, 4.988, 6.546, 9.977]

The finite-value guard doesn't make training stable by itself. It stops the loop before a NaN or infinity update silently corrupts the weights. If it fires, inspect the learning rate and input scale first.

This is still a tiny model, not a production latency service. Its value is transparency: every prediction, loss, derivative, and update fits into code you can explain.

Gradient checking catches backward bugs

A manual backward formula can silently omit a factor, transpose the wrong array, or reuse the wrong value. Compare it to centered finite differences before trusting it.

For each parameter, nudge only that one value by eps, compute the two losses, and compare the numerical slope to the formula from your backward pass.

gradient-check-two-parameters.py
1import numpy as np 2 3X = np.array([[1.0, 0.5], [2.0, 1.0], [3.0, 0.5], [4.0, 2.0]]) 4y = np.array([2.5, 5.0, 6.5, 10.0]) 5weights = np.array([1.3, 1.1]) 6 7def mean_squared_loss(current: np.ndarray) -> float: 8 errors = X @ current - y 9 return float(np.mean(errors ** 2)) 10 11errors = X @ weights - y 12manual_gradient = (2 / len(X)) * (X.T @ errors) 13 14eps = 1e-5 15numerical_gradient = np.zeros_like(weights) 16for i in range(len(weights)): 17 direction = np.zeros_like(weights) 18 direction[i] = eps 19 numerical_gradient[i] = ( 20 mean_squared_loss(weights + direction) - mean_squared_loss(weights - direction) 21 ) / (2 * eps) 22 23print("manual", np.round(manual_gradient, 6).tolist()) 24print("numerical", np.round(numerical_gradient, 6).tolist()) 25print("check_passed", np.allclose(manual_gradient, numerical_gradient, atol=1e-6))
Output
1manual [-9.9, -3.925] 2numerical [-9.9, -3.925] 3check_passed True

Finite differences are far too expensive for normal training because they need repeated forward computations per parameter. They are ideal for testing a small backward formula.

PyTorch computes the same backward pass

Reverse-mode automatic differentiation propagates chain-rule products backward from outputs to inputs. It's especially useful when one scalar loss depends on many parameters, which is the training shape we have here.[2]Reference 2Automatic Differentiation in Machine Learning: a Surveyhttps://www.jmlr.org/papers/v18/17-468.html

PyTorch autograd records operations involving tensors that request gradients, then exposes the reverse pass through a call to backward().[3]Reference 3PyTorch: An Imperative Style, High-Performance Deep Learning Library.https://arxiv.org/abs/1912.01703 This cell repeats the small latency table and checks PyTorch against the manual NumPy derivative. It also runs backward twice to expose one important default: PyTorch adds each new result into the leaf tensor's .grad buffer.

pytorch-confirms-the-gradient.py
1import numpy as np 2import torch 3 4X_np = np.array([[1.0, 0.5], [2.0, 1.0], [3.0, 0.5], [4.0, 2.0]], dtype=np.float32) 5y_np = np.array([2.5, 5.0, 6.5, 10.0], dtype=np.float32) 6w_np = np.array([1.3, 1.1], dtype=np.float32) 7 8manual = (2 / len(X_np)) * (X_np.T @ (X_np @ w_np - y_np)) 9 10X = torch.tensor(X_np) 11y = torch.tensor(y_np) 12w = torch.tensor(w_np, requires_grad=True) 13loss = torch.mean((X @ w - y) ** 2) 14loss.backward() 15first_backward = w.grad.detach().clone() 16 17same_loss_again = torch.mean((X @ w - y) ** 2) 18same_loss_again.backward() 19after_second_backward = w.grad.detach().clone() 20 21w.grad = None 22fresh_loss = torch.mean((X @ w - y) ** 2) 23fresh_loss.backward() 24after_reset_backward = w.grad.detach().clone() 25 26def rounded(values: torch.Tensor | np.ndarray) -> list[float]: 27 return [round(float(value), 3) for value in values] 28 29print("loss", round(float(loss.item()), 6)) 30print("manual", rounded(manual)) 31print("first_backward", rounded(first_backward)) 32print("agree", np.allclose(manual, first_backward.numpy(), atol=1e-6)) 33print("after_second_backward", rounded(after_second_backward)) 34print("after_reset_backward", rounded(after_reset_backward))
Output
1loss 3.268751 2manual [-9.9, -3.925] 3first_backward [-9.9, -3.925] 4agree True 5after_second_backward [-19.8, -7.85] 6after_reset_backward [-9.9, -3.925]

The second backward pass doubles the stored numbers because it adds the same gradient again. A training loop normally calls optimizer.zero_grad() before the next batch, or sets buffers to None, unless accumulating across several batches is intentional.[4]Reference 4Optimizing Model Parameters.https://docs.pytorch.org/tutorials/beginner/basics/optimization_tutorial.html Frameworks save you from writing every derivative. They don't remove your responsibility to understand why a gradient has its sign, shape, and size.

Practice with pencil before code

Use one latency row with x_prompt = 3, x_queue = 1, actual seconds y = 7, and weights [w_prompt, w_queue] = [2, 2].

  1. Compute prediction, error, and squared loss.
  2. Compute both partial derivatives by multiplying 2 * error by the relevant feature.
  3. Apply one gradient-descent update with learning rate 0.05. What is the new prediction and loss?
  4. If a centered finite-difference check for the prompt parameter returns 5.99999 while your manual derivative says 6, is that evidence of a bug?
  5. If loss becomes NaN after a very large update, what two things should you inspect first?

Solution sketches

  1. Prediction is 2*3 + 2*1 = 8, error is 8 - 7 = 1, and loss is 1.
  2. The prompt partial is 2*1*3 = 6; the queue partial is 2*1*1 = 2. The gradient is [6, 2].
  3. New weights are [2 - 0.05*6, 2 - 0.05*2] = [1.7, 1.9]. Prediction is 1.7*3 + 1.9 = 7.0, so loss is 0.
  4. No. A numerical derivative is approximate, and agreement within a small tolerance with the same sign is a passing check.
  5. Inspect the learning rate and the scale or validity of input values. A huge step or non-finite data can make the forward loss invalid before a later optimizer can help.

What you're ready for next

Training should now read as a concrete process:

StepQuestion you can answer
Forward passWhat prediction did current parameters make?
LossHow wrong was that prediction?
DerivativeHow would loss respond to a small change in one parameter?
GradientWhat is the correction signal for every parameter?
BackpropagationHow do local derivatives carry that signal backward?
Gradient checkHow do I test a handwritten backward formula?

We deliberately stopped before optimizer algorithms such as momentum or Adam. They control how to use gradients over many steps; a later chapter teaches them once vectors, matrices, and deeper linear algebra are available.

Mastery check

Key concepts

  • A scalar loss turns prediction error into a number training can minimize.
  • A derivative measures local loss sensitivity to one parameter.
  • A gradient collects one partial derivative per parameter.
  • Gradient descent subtracts the gradient scaled by a learning rate.
  • Backpropagation applies the chain rule backward through stored operations.
  • When several backward paths reach one parameter, their gradient contributions add.
  • Centered finite differences and PyTorch autograd can verify a manual gradient.

Evaluation rubric

  • Foundational: Calculates prediction, squared loss, and one derivative for the single-feature latency rule.
  • Intermediate: Traces local chain-rule factors for the prompt-and-queue model and implements its NumPy update.
  • Advanced: Diagnoses an overshooting update, validates a manual batch gradient numerically, and confirms it with PyTorch autograd.

Follow-up questions

Common pitfalls

Updating in the uphill direction

  • Symptom: Loss increases immediately even with a small learning rate.
  • Cause: Code adds learning_rate * gradient instead of subtracting it.
  • Fix: Print one derivative and verify its sign with a tiny finite-difference nudge.

Forgetting a local chain-rule factor

  • Symptom: Your derivative has the right sign but disagrees with a numerical check by a constant multiplier.
  • Cause: The backward formula omitted a feature value or the 2 * error term from squared loss.
  • Fix: Write the forward computation as three operations and multiply one local derivative per edge.

Trusting a step that's too large

  • Symptom: Correct gradients produce exploding or non-finite loss.
  • Cause: Learning rate or input scale turns a useful direction into an unsafe jump.
  • Fix: Reduce step size, inspect input magnitude, and fail when loss isn't finite.

Using autograd without understanding the result

  • Symptom: backward() runs, but you can't explain a zero, huge, or wrong-shaped gradient.
  • Cause: The framework is hiding derivative details before the local derivatives are understood.
  • Fix: Reproduce one tiny row by hand and compare its derivative with autograd before debugging a large model.

Carrying stale gradients into the next batch

  • Symptom: A repeated PyTorch backward pass doubles .grad even though the batch didn't change.
  • Cause: PyTorch accumulates gradients in leaf buffers until you clear them.
  • Fix: Call optimizer.zero_grad() before each fresh batch, unless summing across batches is intentional.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A latency model predicts 6 seconds for a request that took 5; the prompt-length feature for the prompt-length weight is 2. For squared loss, which chain-rule calculation gives the gradient for that weight?
2.For the one-weight rule with w = 3, x = 2, y = 5, the derivative is 4. What happens if you apply gradient descent with learning rates 0.1 and 1.0?
3.A latency row has x_prompt = 3, x_queue = 1, actual time y = 7, and weights [2, 2]. With squared loss and learning rate 0.05, which result follows from one gradient-descent step?
4.The same stored latency weight is used in w * x_prompt + w * x_queue. Backpropagation sends contributions 4 and 2 to that same weight. Which gradient should update it?
5.For a batch with feature matrix X, error vector errors = X @ weights - y, and loss mean(errors ** 2), which gradient formula is correct for this NumPy update?
6.At weights [1.3, 1.1] on the four-row latency table, a manual mean-squared-error gradient is [-9.9, -3.925]. A centered finite-difference check that nudges one parameter at a time returns the same values within tolerance. What should you conclude?
7.In the PyTorch latency-table check, loss.backward() first stores w.grad = [-9.9, -3.925]. You compute the same loss again and call backward() without clearing w.grad, then set w.grad = None and call backward() once more. What should you expect?
8.A training loop checks that loss and gradient values are finite. After a very large update, the check raises because the loss is NaN. What should you inspect first?

8 questions remaining.

Next Step
Continue to Vectors, Matrices & Tensors

A gradient for several parameters is already a list of numbers with structure. Next comes the vector, matrix, and tensor language that makes those structures readable in model code.

PreviousAlgorithms for ML Engineers
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Learning Representations by Back-Propagating Errors.

Rumelhart, D. E., Hinton, G. E., Williams, R. J. · 1986 · Nature, 323

Automatic Differentiation in Machine Learning: a Survey

Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. · 2018 · JMLR

PyTorch: An Imperative Style, High-Performance Deep Learning Library.

Paszke, A., et al. · 2019 · NeurIPS 2019

Optimizing Model Parameters.

PyTorch Contributors · 2026 · Official tutorial