LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnPreparation & PrerequisitesTraining & Backpropagation
📝EasyNLP Fundamentals

Training & Backpropagation

Follow a shipment-delay model through prediction, loss, gradients, parameter updates, scalar autograd, mini-batches, validation checks, and PyTorch.

12 min read
Learning path
Step 17 of 155 in the full curriculum
CNNs from ScratchSoftmax, Cross-Entropy & Optimization

In the previous lesson, you traced a CNN score while its kernel values stayed fixed. A useful detector can't be supplied by hand for every crack, barcode smear, or torn label in a shipment photo. The model has to discover parameter values from examples.

We will make that discovery visible using a small regression problem: predict extra delivery delay in hours from a route-disruption score. A real CNN has far more parameters, but each kernel weight changes for the same reason as the one weight below: changing it changes a prediction, which changes a loss.

One update has four distinct jobs

A model learns by repeating four steps:

StepWhat happensEvidence you can inspect
Forward passCurrent parameters produce a prediction.Print predicted delay.
LossA scalar measures the error against a known target.Print one nonnegative number.
Backward passDerivatives measure how each parameter affects loss.Print dL/dw and dL/db.
UpdateAn optimizer changes parameters using those derivatives.Recompute and compare loss.

Suppose a shipment has route-disruption score x = 2 and later arrives y = 6 hours late. Start with this model:

y^=wx+b\hat{y} = wx + by^​=wx+b

Set w = 1 and b = 0. The predicted extra delay is 2 hours, far below the observed 6. For this numerical lesson, use half squared error:

L=12(y^−y)2L = \frac{1}{2}(\hat{y} - y)^2L=21​(y^​−y)2

The factor 1/2 doesn't change which parameters are good; it cancels the 2 when we differentiate the square. Here the initial loss is:

L=12(2−6)2=8L = \frac{1}{2}(2 - 6)^2 = 8L=21​(2−6)2=8

The derivative of loss with respect to the prediction is -4: increasing the predicted delay would reduce this mistake. Because w is multiplied by x = 2, its gradient is twice as large:

∂L∂y^=y^−y=−4,∂L∂w=(y^−y)x=−8,∂L∂b=y^−y=−4\frac{\partial L}{\partial \hat{y}} = \hat{y} - y = -4, \qquad \frac{\partial L}{\partial w} = (\hat{y} - y)x = -8, \qquad \frac{\partial L}{\partial b} = \hat{y} - y = -4∂y^​∂L​=y^​−y=−4,∂w∂L​=(y^​−y)x=−8,∂b∂L​=y^​−y=−4

With learning rate η=0.1\eta = 0.1η=0.1, subtract those gradients:

wnew=1−0.1(−8)=1.8,bnew=0−0.1(−4)=0.4w_{\text{new}} = 1 - 0.1(-8) = 1.8, \qquad b_{\text{new}} = 0 - 0.1(-4) = 0.4wnew​=1−0.1(−8)=1.8,bnew​=0−0.1(−4)=0.4

The revised prediction is 1.8 * 2 + 0.4 = 4.0 hours, and loss falls from 8.0 to 2.0.

one-update-by-hand.py
1x, target = 2.0, 6.0 2w, bias = 1.0, 0.0 3learning_rate = 0.1 4 5prediction = w * x + bias 6loss = 0.5 * (prediction - target) ** 2 7grad_prediction = prediction - target 8grad_w = grad_prediction * x 9grad_bias = grad_prediction 10 11new_w = w - learning_rate * grad_w 12new_bias = bias - learning_rate * grad_bias 13new_prediction = new_w * x + new_bias 14new_loss = 0.5 * (new_prediction - target) ** 2 15 16print(f"before: prediction={prediction:.1f} loss={loss:.1f}") 17print(f"gradients: dL/dw={grad_w:.1f} dL/db={grad_bias:.1f}") 18print(f"after: prediction={new_prediction:.1f} loss={new_loss:.1f}")
Output
1before: prediction=2.0 loss=8.0 2gradients: dL/dw=-8.0 dL/db=-4.0 3after: prediction=4.0 loss=2.0
Paired gradient-descent plots showing half-squared loss falling as predicted delay moves from two to four hours, plus the matching weight and bias update toward the zero-loss line. Paired gradient-descent plots showing half-squared loss falling as predicted delay moves from two to four hours, plus the matching weight and bias update toward the zero-loss line.
The loss curve slopes downward at the two-hour prediction. Subtracting the negative gradient moves (w, b) from (1, 0) to (1.8, 0.4), raising the prediction to four hours and cutting loss from 8 to 2.

One update helps, but training is a repeated correction:

repeat-gradient-descent.py
1x, target = 2.0, 6.0 2w, bias = 1.0, 0.0 3learning_rate = 0.1 4 5for step in range(5): 6 prediction = w * x + bias 7 loss = 0.5 * (prediction - target) ** 2 8 error = prediction - target 9 w -= learning_rate * error * x 10 bias -= learning_rate * error 11 print(f"step {step}: prediction={prediction:.3f} loss={loss:.3f} w={w:.3f} b={bias:.3f}")
Output
1step 0: prediction=2.000 loss=8.000 w=1.800 b=0.400 2step 1: prediction=4.000 loss=2.000 w=2.200 b=0.600 3step 2: prediction=5.000 loss=0.500 w=2.400 b=0.700 4step 3: prediction=5.500 loss=0.125 w=2.500 b=0.750 5step 4: prediction=5.750 loss=0.031 w=2.550 b=0.775

Step size can rescue or wreck learning

The gradient says which direction lowers loss locally. The learning rate says how far to move. A very small rate makes progress slowly. A useful rate settles toward a low-loss setting. A large rate can jump past it and grow the error on every oscillation.

The same one-example problem makes this visible because we can hold everything except learning rate constant:

learning-rate-behavior.py
1def train_for_five_steps(learning_rate: float) -> list[float]: 2 x, target = 2.0, 6.0 3 w, bias = 1.0, 0.0 4 losses = [] 5 for _ in range(5): 6 prediction = w * x + bias 7 error = prediction - target 8 losses.append(0.5 * error ** 2) 9 w -= learning_rate * error * x 10 bias -= learning_rate * error 11 return losses 12 13for rate in (0.01, 0.1, 0.5): 14 losses = train_for_five_steps(rate) 15 formatted = ", ".join(f"{loss:.3f}" for loss in losses) 16 print(f"lr={rate:.2f}: {formatted}")
Output
1lr=0.01: 8.000, 7.220, 6.516, 5.881, 5.307 2lr=0.10: 8.000, 2.000, 0.500, 0.125, 0.031 3lr=0.50: 8.000, 18.000, 40.500, 91.125, 205.031

Don't assume that one loss decrease proves the rate is safe. Real batches disagree with one another, and deep networks can have steep regions in some directions and shallow ones in others. Training code should log loss and gradient norms so a bad run becomes visible quickly.

Backpropagation assigns credit through a graph

Gradient descent needs dL/dw for every trainable parameter. Calculating it by nudging each parameter separately is useful as a check for a tiny formula, but it would repeat forward evaluation for every parameter direction. Backpropagation uses reverse-mode automatic differentiation: it reuses the forward computation graph and moves derivatives backward from one scalar loss through each local operation.[1]

Rumelhart, Hinton, and Williams showed that this error-driven weight adjustment could learn useful hidden representations in multilayer networks in their 1986 Nature paper.[2] For our small shipment-delay model, the graph is already enough to see the mechanism:

Forward computation DAG for a delivery-delay regressor, tracing disruption score two and weight one through multiplication, bias addition, target comparison, residual minus four, and half-squared loss eight. Forward computation DAG for a delivery-delay regressor, tracing disruption score two and weight one through multiplication, bias addition, target comparison, residual minus four, and half-squared loss eight.
The graph records every intermediate value: x and w form m = 2, bias keeps the prediction at 2, target comparison gives residual −4, and half-squared error produces loss 8. Backpropagation will reuse these exact nodes and values.

The chain rule follows paths from loss back to parameters:

∂L∂w=∂L∂y^∂y^∂w=(−4)(2)=−8\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w} = (-4)(2) = -8∂w∂L​=∂y^​∂L​∂w∂y^​​=(−4)(2)=−8 ∂L∂b=∂L∂y^∂y^∂b=(−4)(1)=−4\frac{\partial L}{\partial b} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial b} = (-4)(1) = -4∂b∂L​=∂y^​∂L​∂b∂y^​​=(−4)(1)=−4
Reverse-mode gradient DAG for one delivery-delay example, tracing a seeded loss gradient through residual, prediction, addition, and multiplication to produce bias gradient minus four and weight gradient minus eight. Reverse-mode gradient DAG for one delivery-delay example, tracing a seeded loss gradient through residual, prediction, addition, and multiplication to produce bias gradient minus four and weight gradient minus eight.
Starting from loss gradient 1, the half-square node sends −4 to the residual, addition copies −4 to the bias path, and multiplication scales the weight path by x = 2 to produce dL/dw = −8. The target stays fixed and isn't updated.

A useful engineering habit is to compare an analytical gradient with a centered finite difference. The check is slow for training, but excellent for finding a missing factor or wrong sign in a new backward formula.

finite-difference-gradient-check.py
1x, target = 2.0, 6.0 2w, bias = 1.0, 0.0 3epsilon = 1e-5 4 5def loss_at(weight: float, intercept: float) -> float: 6 prediction = weight * x + intercept 7 return 0.5 * (prediction - target) ** 2 8 9analytic_w = (w * x + bias - target) * x 10analytic_bias = w * x + bias - target 11numeric_w = (loss_at(w + epsilon, bias) - loss_at(w - epsilon, bias)) / (2 * epsilon) 12numeric_bias = (loss_at(w, bias + epsilon) - loss_at(w, bias - epsilon)) / (2 * epsilon) 13 14print(f"dL/dw analytic={analytic_w:.6f} numeric={numeric_w:.6f}") 15print(f"dL/db analytic={analytic_bias:.6f} numeric={numeric_bias:.6f}") 16print("checks pass:", abs(analytic_w - numeric_w) < 1e-8 and abs(analytic_bias - numeric_bias) < 1e-8)
Output
1dL/dw analytic=-8.000000 numeric=-8.000000 2dL/db analytic=-4.000000 numeric=-4.000000 3checks pass: True

Build a scalar autograd engine

Andrej Karpathy's Micrograd makes the backward sweep inspectable: each object stores one scalar, its accumulated gradient, the values that produced it, and a local backward function.[3] Tensor libraries apply the same principle to matrix operations; a scalar version keeps the accounting readable.

For prediction = w * x + b and loss = 0.5 * (prediction - target) ** 2, every operation leaves a small backward rule:

OperationLocal backward rule
out = left + rightAdd out.grad to both parent gradients.
out = left * rightAdd right * out.grad to the left parent, and vice versa.
out = value ** 2Add 2 * value * out.grad to the parent.

The following compact engine builds the computation graph during the forward pass, orders nodes so children run before parents during the backward pass, and accumulates gradients with +=.

scalar-autograd-engine.py
1class Value: 2 def __init__(self, data, parents=(), operation=""): 3 self.data = float(data) 4 self.grad = 0.0 5 self.parents = set(parents) 6 self.operation = operation 7 self._backward = lambda: None 8 9 def __add__(self, other): 10 other = other if isinstance(other, Value) else Value(other) 11 out = Value(self.data + other.data, (self, other), "+") 12 def backward(): 13 self.grad += out.grad 14 other.grad += out.grad 15 out._backward = backward 16 return out 17 18 __radd__ = __add__ 19 20 def __mul__(self, other): 21 other = other if isinstance(other, Value) else Value(other) 22 out = Value(self.data * other.data, (self, other), "*") 23 def backward(): 24 self.grad += other.data * out.grad 25 other.grad += self.data * out.grad 26 out._backward = backward 27 return out 28 29 __rmul__ = __mul__ 30 31 def __neg__(self): 32 return self * -1.0 33 34 def __sub__(self, other): 35 return self + (-other) 36 37 def __pow__(self, exponent): 38 out = Value(self.data ** exponent, (self,), f"**{exponent}") 39 def backward(): 40 self.grad += exponent * self.data ** (exponent - 1) * out.grad 41 out._backward = backward 42 return out 43 44 def backward(self): 45 order = [] 46 seen = set() 47 def visit(node): 48 if node not in seen: 49 seen.add(node) 50 for parent in node.parents: 51 visit(parent) 52 order.append(node) 53 visit(self) 54 self.grad = 1.0 55 for node in reversed(order): 56 node._backward() 57 58x, target = Value(2.0), Value(6.0) 59w, bias = Value(1.0), Value(0.0) 60prediction = w * x + bias 61loss = 0.5 * (prediction - target) ** 2 62loss.backward() 63 64print(f"prediction={prediction.data:.1f} loss={loss.data:.1f}") 65print(f"dL/dw={w.grad:.1f} dL/db={bias.grad:.1f}")
Output
1prediction=2.0 loss=8.0 2dL/dw=-8.0 dL/db=-4.0

One rule is easy to miss: if a value affects loss through two paths, its gradient is the sum of those paths. The simple expression a * a exposes an incorrect engine immediately.

gradient-accumulation-branch.py
1a = Value(3.0) 2squared = a * a 3squared.backward() 4print(f"a*a={squared.data:.1f}") 5print(f"d(a*a)/da={a.grad:.1f}")
Output
1a*a=9.0 2d(a*a)/da=6.0
Scalar autograd DAG for s equals a times a with a equal to three, showing one output gradient splitting into two plus-three contributions that accumulate to a gradient of six. Scalar autograd DAG for s equals a times a with a equal to three, showing one output gradient splitting into two plus-three contributions that accumulate to a gradient of six.
The value a enters the multiply node twice, so backpropagation returns two derivative contributions. Each path contributes 3, and the shared gradient buffer must add them to produce 6 instead of overwriting one path.

In a tensor framework, parameter gradient buffers also accumulate until they are cleared. That makes intentional gradient accumulation possible, but it also creates a common bug: forgetting to clear gradients between independent optimizer steps.[4]

Move from one shipment to a batch

One damaged shipment doesn't represent a delivery network. Training normally estimates a useful update from a mini-batch, a subset of training examples processed before one parameter update. For mean loss, the batch gradient is the average of the example gradients.

This tiny dataset contains only three shipments, so the code below uses all three at once: technically, it is full-batch gradient descent. The arithmetic is the same one a mini-batch uses; a larger run uses a subset before each update. Here the true relationship is extra_delay = 2 * disruption + 1. Starting from zero parameters:

batch-gradient.py
1shipments = [(1.0, 3.0), (2.0, 5.0), (3.0, 7.0)] 2w, bias = 0.0, 0.0 3learning_rate = 0.08 4 5for step in range(20): 6 errors = [w * x + bias - target for x, target in shipments] 7 loss = sum(0.5 * error ** 2 for error in errors) / len(shipments) 8 grad_w = sum(error * x for error, (x, _) in zip(errors, shipments)) / len(shipments) 9 grad_bias = sum(errors) / len(shipments) 10 w -= learning_rate * grad_w 11 bias -= learning_rate * grad_bias 12 if step in (0, 1, 19): 13 print(f"step {step:2d}: loss={loss:.4f} w={w:.3f} b={bias:.3f}")
Output
1step 0: loss=13.8333 w=0.907 b=0.400 2step 1: loss=4.2812 w=1.411 b=0.623 3step 19: loss=0.0005 w=2.037 b=0.917

An epoch means one complete pass through the training examples. LLM pretraining runs are commonly described in optimizer steps and tokens processed because the dataset is very large; the batch contract remains the same: forward, average loss, backward, update.

Validation loss tests whether learning transfers

Training loss only measures examples used for updates. A validation set contains held-out examples used for evaluation, not for backpropagation. If training loss drops while validation loss rises, the new parameters fit the training set better but transfer worse. That pattern is a warning, not a diagnosis: investigate data mismatch, leakage, noise, model capacity and regularization before deciding why it happened.

This deliberately mismatched example makes the warning visible. Training records follow a peak-season delay pattern; validation records follow a normal-week pattern. Starting from a model that already matches normal weeks, fitting peak season improves train loss and damages validation loss:

training-vs-validation-warning.py
1train = [(1.0, 4.0), (2.0, 8.0), (3.0, 12.0)] # peak season 2validation = [(1.0, 2.0), (2.0, 4.0), (3.0, 6.0)] # normal week 3w = 2.0 4learning_rate = 0.02 5 6def mean_loss(records, weight): 7 return sum(0.5 * (weight * x - target) ** 2 for x, target in records) / len(records) 8 9for step in range(4): 10 grad_w = sum((w * x - target) * x for x, target in train) / len(train) 11 w -= learning_rate * grad_w 12 print(f"step {step}: train={mean_loss(train, w):.3f} validation={mean_loss(validation, w):.3f} w={w:.3f}")
Output
1step 0: train=7.672 validation=0.081 w=2.187 2step 1: train=6.307 validation=0.296 w=2.356 3step 2: train=5.185 validation=0.605 w=2.509 4step 3: train=4.262 validation=0.981 w=2.648

For a real run, compare train and validation curves at consistent checkpoints and log enough context to reproduce the run: data version, split policy, batch size, learning rate, optimizer, seed and checkpoint identifier.

See the same responsibilities in PyTorch

Libraries remove manual derivative bookkeeping, not the conceptual separation. In the code below, multiplying PyTorch's mean squared error by 0.5 keeps the same half-squared-error convention used above. loss.backward() fills parameter gradients and optimizer.step() applies an update. The loss drops after one step on the same three-example batch.[4]

pytorch-one-step.py
1import torch 2from torch import nn 3 4features = torch.tensor([[1.0], [2.0], [3.0]]) 5targets = torch.tensor([[3.0], [5.0], [7.0]]) 6model = nn.Linear(1, 1) 7 8with torch.no_grad(): 9 model.weight.fill_(0.0) 10 model.bias.fill_(0.0) 11 12optimizer = torch.optim.SGD(model.parameters(), lr=0.08) 13loss_fn = nn.MSELoss(reduction="mean") 14 15optimizer.zero_grad() 16before = 0.5 * loss_fn(model(features), targets) 17before.backward() 18print(f"before={before.item():.3f}") 19print(f"weight gradient={model.weight.grad.item():.3f} bias gradient={model.bias.grad.item():.3f}") 20optimizer.step() 21after = 0.5 * loss_fn(model(features), targets) 22print(f"after={after.item():.3f}")
Output
1before=13.833 2weight gradient=-11.333 bias gradient=-5.000 3after=4.281

Because gradients accumulate, two backward calls without clearing produce twice the gradient for the same loss:

why-zero-grad-matters.py
1import torch 2 3w = torch.tensor(1.0, requires_grad=True) 4loss = 0.5 * (w * 2.0 - 6.0) ** 2 5loss.backward() 6first_gradient = w.grad.item() 7 8loss = 0.5 * (w * 2.0 - 6.0) ** 2 9loss.backward() 10accumulated_gradient = w.grad.item() 11 12w.grad.zero_() 13loss = 0.5 * (w * 2.0 - 6.0) ** 2 14loss.backward() 15reset_gradient = w.grad.item() 16 17print(f"first backward: {first_gradient:.1f}") 18print(f"without zeroing: {accumulated_gradient:.1f}") 19print(f"after zeroing again: {reset_gradient:.1f}")
Output
1first backward: -8.0 2without zeroing: -16.0 3after zeroing again: -8.0

Why the next loss function matters

We used half squared error because predicting delay hours is a regression problem and its derivative is easy to inspect. A CNN that selects damaged, intact, or needs-review, and an LLM that selects the next token, face a classification problem: the output layer produces competing logits rather than one continuous hour estimate.

The backward-pass mechanics won't disappear. The next lesson supplies the missing objective: softmax converts logits into a distribution, cross-entropy scores the correct choice, and their gradient becomes the signal sent backward through the model.

Mastery check

Key concepts

  • Training separates forward computation, loss evaluation, backward differentiation and parameter update.
  • A gradient measures loss sensitivity; gradient descent moves in the opposite direction.
  • Backpropagation applies local chain-rule derivatives in reverse order through a computation graph.
  • A finite-difference check verifies a formula but isn't an efficient training method.
  • Shared graph paths and repeated backward calls require gradient accumulation.
  • With mean loss, mini-batches provide one averaged update from a subset of training examples.
  • Validation loss measures transfer to held-out data and can reveal a failing run.

Evaluation rubric

  • Foundational: Compute prediction, half squared loss, dL/dw, dL/db and one SGD update for the shipment-delay example.
  • Intermediate: Explain why reverse-mode differentiation is preferable to nudging every parameter, and implement a small scalar graph that accumulates gradients correctly.
  • Advanced: Use mini-batch and validation output to diagnose a failed run, then map manual gradient steps to PyTorch's backward(), zero_grad() and optimizer update.

Common pitfalls

  • Symptom: Parameters move in the wrong direction. Cause: Raw loss was subtracted instead of the gradient, or a derivative sign is wrong. Fix: Compare analytical gradients with centered finite differences on a tiny case.
  • Symptom: A reused value such as a * a receives half its expected gradient. Cause: Graph paths overwrite rather than accumulate contributions. Fix: Update parent gradients with +=.
  • Symptom: PyTorch gradients unexpectedly double across steps. Cause: An independent batch started before gradient buffers were cleared. Fix: Run optimizer.zero_grad() before backward() unless accumulation is deliberate.
  • Symptom: Training loss falls while validation loss rises. Cause: The model is fitting a training-specific pattern, or data and split policy don't match evaluation. Fix: Inspect held-out examples, leakage, distribution shift and regularization before continuing.
  • Symptom: Loss oscillates or grows rapidly. Cause: Learning rate may be too large for the local curvature. Fix: Lower step size and inspect gradient norms and numeric values.

Follow-up questions

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.In the model yhat = w * x + b, use x = 2, target y = 6, w = 1, b = 0, half squared error L = 0.5 * (yhat - y)^2, and learning rate 0.1. Which result is correct after one gradient-descent update?
2.A one-example training loop is run for five steps with only the learning rate changed. The losses are lr=0.01: 8.000, 7.220, 6.516, 5.881, 5.307; lr=0.10: 8.000, 2.000, 0.500, 0.125, 0.031; lr=0.50: 8.000, 18.000, 40.500, 91.125, 205.031. Which interpretation is correct?
3.You implement a new backward rule and want to detect a wrong sign or missing factor before using it in a large network. Which procedure uses finite differences appropriately?
4.A scalar autograd engine evaluates z = a * a + a with a = 3 and starts the reverse sweep with z.grad = 1. What should happen to a.grad?
5.A training set has 10,000 records. One optimizer step uses 64 selected records and the mean half-squared loss over those records. How should the update be classified and computed?
6.During training, loss on update examples falls at every checkpoint while loss on held-out validation examples rises. Which conclusion is justified by those curves alone?
7.A PyTorch job processes independent mini-batches. On the second batch, parameter gradients unexpectedly include contributions from the first batch. Which sequence fixes the bug and applies one update?

7 questions remaining.

Next Step
Continue to Softmax, Cross-Entropy & Optimization

You can now trace how a scalar loss changes weights in a regression model. Next you will build the classification loss used when a model must choose among support actions or next-token candidates.

PreviousCNNs from Scratch
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Deep Learning.

Goodfellow, I., Bengio, Y., Courville, A. · 2016

Learning Representations by Back-Propagating Errors.

Rumelhart, D. E., Hinton, G. E., Williams, R. J. · 1986 · Nature, 323

micrograd.

Karpathy, A. · 2020

Optimizing Model Parameters.

PyTorch Contributors · 2026 · Official tutorial