Understand how neural networks actually learn: the training loop, gradient descent, backpropagation, and how learning rates control the update steps.
In the previous article, we saw how a neural network uses its parameters (weights and biases) to turn inputs into predictions. But we skipped the most important question: how does the network know what values those parameters should have?
A fresh neural network starts completely ignorant. Its parameters are initialized with random numbers. If you ask a random network to classify an image, it will just guess. The process of adjusting those random numbers until the network makes accurate predictions is called training.
This article explains how neural networks learn. You'll learn about the loss function, gradient descent, backpropagation, and the delicate balance between learning patterns and memorizing data.
๐ก Key insight: Training a neural network is an optimization problem. The goal is to find the specific combination of billions of parameters that minimizes the number of mistakes the network makes.
Training happens in a continuous loop. The network looks at data, makes a guess, checks how wrong it was, and adjusts its parameters to be slightly less wrong next time. This loop has four main steps:
| Step | What the model does | Visible beginner proof |
|---|---|---|
| Forward pass | Turns input into a prediction. | Printed prediction changes when weights change. |
| Loss | Measures prediction error. | Wrong answers have larger loss than right answers. |
| Backward pass | Computes gradients for parameters. | Each trainable parameter has a gradient value. |
| Optimizer step | Updates parameters to reduce future loss. | Next loss usually moves downward on the toy example. |
Let's break down each concept.
Before a model can improve, it needs to know how badly it's doing. This is the job of the loss function (sometimes called the cost function or objective function).
The loss function takes the model's prediction and the actual correct answer (the "ground truth") and outputs a single number representing the error.
Different tasks use different loss functions. If you're predicting a continuous number (like house prices), you might use Mean Squared Error (MSE). If you're classifying data or predicting the next word, you use Cross-Entropy Loss, which penalizes the model heavily when it's confident but wrong. (We'll cover this in detail in the next article).
Imagine you're blindfolded on a rugged mountain, and your goal is to find the lowest valley. You can't see the whole mountain, but you can feel the slope of the ground directly beneath your feet. To get to the bottom, you take a step in the direction that slopes downward the steepest. Once you take a step, you feel the slope again, and take another step.
This is exactly how Gradient Descent works.[1]
When you take a step down the mountain, how big should that step be? This is controlled by a hyperparameter called the learning rate.
Choosing the right learning rate is one of the most critical parts of training an AI model.
๐ฏ Production tip: Modern training doesn't use standard gradient descent. It uses advanced optimizers like Adam (Adaptive Moment Estimation) or AdamW. These algorithms automatically adjust the learning rate for each individual parameter based on the history of previous gradients, making training much faster and more stable.[2]
We know we need the gradient (the slope) to update our parameters. But how do we find the gradient for a specific weight buried deep in layer 2 of a 100-layer network?
This was the problem that stalled neural network research for decades, until a technique called backpropagation was popularized in the 1980s.[3]
Backpropagation uses the chain rule from calculus. It works backwards from the output:
Because it reuses calculations as it moves backward, backprop can find the gradients for billions of parameters in a single, highly efficient sweep. Without backpropagation, modern deep learning would be impossible.
| Backprop checkpoint | Beginner proof |
|---|---|
| Final layer gets gradients first. | Output error has a direct path to final weights. |
| Earlier layers get gradients through the chain rule. | Each layer receives a reusable blame signal. |
| Optimizer sees all parameter gradients. | One update can improve the whole network. |
In reality, we don't update the weights after looking at just one example. That would be chaotic, as the model would wildly adjust itself for every single picture or sentence it sees.
Instead, we group data into batches (or mini-batches).
Using batches smooths out the gradient updates and allows GPUs to process data massively in parallel, making training dramatically faster.
An epoch is when the model has seen every single example in the training dataset exactly once. Smaller models often train for many epochs. Massive LLM pre-training runs are usually discussed in tokens processed, not repeated passes over a small dataset, because the dataset itself is enormous.
The goal of training isn't just to get the loss to zero. If that were the goal, the network could just memorize the exact answers to the training data.
Imagine a student studying for a math test by memorizing the answers to a specific practice worksheet without learning the underlying formulas. If the final exam is exactly the same worksheet, they'll score 100%. But if the teacher changes the numbers, they'll fail.
This is called overfitting. The model has memorized the noise in the training data instead of learning the underlying patterns.
To monitor this, we split our data into two sets:
If the training loss is going down, but the validation loss starts going up, the model is overfitting. It's memorizing the training data at the expense of general intelligence. This is why techniques like regularization and vast datasets are so important in modern AI.
Training a neural network is an iterative optimization process:
Next, continue to Softmax, Cross-Entropy & Optimization, where training learns from probabilities instead of raw scores.