Turn raw action scores into stable probabilities and a useful learning signal, then apply the same loss to next-token predictions.
In the previous lesson, a model predicted one number: extra delivery delay in hours. Its loss could say "increase the prediction." A support model now faces a different job. Given a damaged parcel ticket, should it recommend refund, reship, or escalate?
Suppose the evidence says reship is correct, but the model currently favors refund. We need a score for how bad that choice is, and a gradient that says which competing scores to raise or lower. Softmax turns raw scores into probability shares. Cross-entropy measures how little probability reached the correct choice. Together they form the standard categorical output-and-loss pair taught for neural classifiers and language models.[1]
The final layer of a classifier emits one number per possible action. These numbers are logits: unconstrained scores, not probabilities.
| Action | Current logit | What the ticket says |
|---|---|---|
refund | 3.0 | Model's current favorite, but wrong |
reship | 1.0 | Correct supervised label |
escalate | 0.0 | Possible, but not correct here |
A larger logit means the model prefers an action relative to its competitors. It doesn't mean 3.0 is 300%, and a logit can be negative without creating a negative probability.
This first snippet makes the mistake visible: the scores select a favorite, but don't satisfy probability rules.
1labels = ["refund", "reship", "escalate"]
2logits = [3.0, 1.0, 0.0]
3
4best_index = max(range(len(logits)), key=logits.__getitem__)
5print("largest logit:", labels[best_index])
6print("raw score sum:", sum(logits))
7print("valid probability distribution:", all(0 <= z <= 1 for z in logits) and sum(logits) == 1)1largest logit: refund
2raw score sum: 4.0
3valid probability distribution: False
Softmax takes every logit , exponentiates it, and divides by the sum of all exponentiated logits:
Here is the probability assigned to action , is its raw logit, and is the number of choices. Exponentiation makes each share positive. Dividing by their total makes the shares sum to one.
For our three action scores, the arithmetic is small enough to do by hand:
| Action | Logit | Probability | |
|---|---|---|---|
refund | 3.0 | 20.09 | 0.844 |
reship | 1.0 | 2.72 | 0.114 |
escalate | 0.0 | 1.00 | 0.042 |
| Total | 23.81 | 1.000 |
The model places only 0.114 probability on the correct reship action. That is the quantity the loss must expose.
Use NumPy to reproduce the table. This version is already stable because it subtracts the largest logit before taking exponentials.
1import numpy as np
2
3def stable_softmax(logits: np.ndarray) -> np.ndarray:
4 shifted = logits - np.max(logits)
5 exp_shifted = np.exp(shifted)
6 return exp_shifted / exp_shifted.sum()
7
8labels = ["refund", "reship", "escalate"]
9logits = np.array([3.0, 1.0, 0.0])
10probabilities = stable_softmax(logits)
11
12for label, probability in zip(labels, probabilities):
13 print(f"{label:8s} {probability:.3f}")
14print("sum ", round(float(probabilities.sum()), 3))1refund 0.844
2reship 0.114
3escalate 0.042
4sum 1.0Softmax depends on differences between logits, not on their absolute offset. Adding or subtracting the same constant from every score leaves every probability unchanged:
That cancellation gives us a safety rule: subtract the largest logit before exponentiating. The largest exponent becomes , while the relative differences stay intact.
The following failure case uses the same score gaps shifted upward by 997. Direct exponentiation overflows; max-shifting produces the original distribution.
1import numpy as np
2
3def stable_softmax(logits: np.ndarray) -> np.ndarray:
4 shifted = logits - np.max(logits)
5 exp_shifted = np.exp(shifted)
6 return exp_shifted / exp_shifted.sum()
7
8large_logits = np.array([1000.0, 998.0, 997.0])
9with np.errstate(over="ignore", invalid="ignore"):
10 naive = np.exp(large_logits) / np.exp(large_logits).sum()
11
12stable = stable_softmax(large_logits)
13print("naive finite:", np.isfinite(naive).all())
14print("stable:", np.round(stable, 3))
15print("stable sum:", round(float(stable.sum()), 3))1naive finite: False
2stable: [0.844 0.114 0.042]
3stable sum: 1.0Cross-entropy is most stable when computed straight from logits. For a correct class index , the loss can be written without first materializing a tiny probability:
The first term is called log-sum-exp. Its stable implementation uses the same maximum :
This code computes the loss for the correct reship action directly from logits. Notice that it works for ordinary and extremely large offsets.
1import numpy as np
2
3def cross_entropy_from_logits(logits: np.ndarray, target_index: int) -> float:
4 maximum = np.max(logits)
5 logsumexp = maximum + np.log(np.exp(logits - maximum).sum())
6 return float(logsumexp - logits[target_index])
7
8ordinary = np.array([3.0, 1.0, 0.0])
9shifted_high = ordinary + 1000.0
10
11print("ordinary reship loss:", round(cross_entropy_from_logits(ordinary, 1), 3))
12print("large-offset loss: ", round(cross_entropy_from_logits(shifted_high, 1), 3))1ordinary reship loss: 2.17
2large-offset loss: 2.17The supervised label for this ticket is reship. Written as a one-hot target vector in the same action order, it is:
One-hot means that exactly one class receives target weight one. Cross-entropy compares this target with the predicted probability vector:
Every term except reship is multiplied by zero, so the loss reduces to:
Had the label been the model's preferred refund, the loss would have been only . The loss is high because the supervised answer received little probability, not because the top choice merely happened to be wrong.
For a shorter arithmetic exercise, consider a second ticket with logits [1.0, 0.0, -2.0] in the same action order. Refund still leads, but by a smaller margin. Compare the loss under two possible labels.
Now compute it:
1import numpy as np
2
3def stable_softmax(logits):
4 shifted = logits - np.max(logits)
5 exp_logits = np.exp(shifted)
6 return exp_logits / exp_logits.sum()
7
8logits = np.array([1.0, 0.0, -2.0]) # refund, reship, escalate
9probabilities = stable_softmax(logits)
10
11print("reship loss", round(float(-np.log(probabilities[1])), 3))
12print("refund loss", round(float(-np.log(probabilities[0])), 3))1reship loss 1.349
2refund loss 0.349The earlier delay lesson used half squared error for a continuous target in hours. We could attach softmax to a classifier and square its probability error, but the softmax derivative then shrinks the signal when a wrong class is already saturated near probability one. Cross-entropy paired with softmax avoids that extra shrinkage: its logit gradient will be .[1]
This example makes the difference visible. The logits are extremely sure about the wrong refund action while the target is reship.
1import numpy as np
2
3def softmax(logits: np.ndarray) -> np.ndarray:
4 shifted = logits - np.max(logits)
5 exps = np.exp(shifted)
6 return exps / exps.sum()
7
8def squared_probability_loss(logits: np.ndarray, target: np.ndarray) -> float:
9 error = softmax(logits) - target
10 return float(0.5 * np.sum(error ** 2))
11
12logits = np.array([8.0, 0.0, 0.0])
13target = np.array([0.0, 1.0, 0.0])
14probabilities = softmax(logits)
15ce_gradient = probabilities - target
16
17epsilon = 1e-5
18mse_gradient = np.array([
19 (squared_probability_loss(logits + np.eye(3)[i] * epsilon, target)
20 - squared_probability_loss(logits - np.eye(3)[i] * epsilon, target))
21 / (2 * epsilon)
22 for i in range(3)
23])
24
25print("probabilities:", np.round(probabilities, 4))
26print("cross-entropy gradient:", np.round(ce_gradient, 4))
27print("squared-probability gradient:", np.round(mse_gradient, 4))1probabilities: [9.993e-01 3.000e-04 3.000e-04]
2cross-entropy gradient: [ 9.993e-01 -9.997e-01 3.000e-04]
3squared-probability gradient: [ 0.001 -0.0007 -0.0003]Cross-entropy isn't the only possible classification objective, but it gives a direct categorical likelihood objective and a useful correction even when a wrong option dominates.
For one labeled example, combine stable softmax with cross-entropy:
Differentiate with respect to any logit :
This derivative is commonly summarized component-wise as : predicted probability share minus target share.
The indicator equals one only for the correct class. For our reship ticket:
| Action | Probability | Target | Gradient | Gradient descent effect |
|---|---|---|---|---|
refund | 0.844 | 0 | +0.844 | Lowers wrong favorite |
reship | 0.114 | 1 | -0.886 | Raises correct action |
escalate | 0.042 | 0 | +0.042 | Lowers competitor slightly |
Logits themselves are outputs, not normally the parameters an optimizer stores. Updating them below is a diagnostic shortcut: it shows the direction that upstream weight updates are trying to produce on the next forward pass.
1import numpy as np
2
3def probabilities_and_loss(logits: np.ndarray, target_index: int):
4 shifted = logits - np.max(logits)
5 log_probs = shifted - np.log(np.exp(shifted).sum())
6 return np.exp(log_probs), float(-log_probs[target_index])
7
8labels = ["refund", "reship", "escalate"]
9logits = np.array([3.0, 1.0, 0.0])
10target_index = labels.index("reship")
11target = np.eye(3)[target_index]
12
13before, before_loss = probabilities_and_loss(logits, target_index)
14gradient = before - target
15after_logits = logits - 0.5 * gradient
16after, after_loss = probabilities_and_loss(after_logits, target_index)
17
18print("gradient:", np.round(gradient, 3))
19print("reship probability:", round(float(before[1]), 3), "->", round(float(after[1]), 3))
20print("loss:", round(before_loss, 3), "->", round(after_loss, 3))1gradient: [ 0.844 -0.886 0.042]
2reship probability: 0.114 -> 0.23
3loss: 2.17 -> 1.469As in the previous lesson, a finite-difference test catches sign mistakes in a new loss implementation. Nudging each logit by a tiny amount should agree with .
1import numpy as np
2
3def loss(logits: np.ndarray, target_index: int) -> float:
4 shifted = logits - np.max(logits)
5 return float(np.log(np.exp(shifted).sum()) - shifted[target_index])
6
7logits = np.array([3.0, 1.0, 0.0])
8target_index = 1
9shifted = logits - np.max(logits)
10probabilities = np.exp(shifted) / np.exp(shifted).sum()
11analytic = probabilities - np.eye(3)[target_index]
12
13epsilon = 1e-5
14numeric = np.array([
15 (loss(logits + np.eye(3)[i] * epsilon, target_index)
16 - loss(logits - np.eye(3)[i] * epsilon, target_index))
17 / (2 * epsilon)
18 for i in range(3)
19])
20
21print("analytic:", np.round(analytic, 6))
22print("numeric: ", np.round(numeric, 6))
23print("match:", np.allclose(analytic, numeric, atol=1e-6))1analytic: [ 0.843795 -0.885805 0.04201 ]
2numeric: [ 0.843795 -0.885805 0.04201 ]
3match: TrueTraining code shouldn't implement this loss from scratch unless you're testing your understanding or building a custom variant. In PyTorch, nn.CrossEntropyLoss receives raw logits for class-index targets and computes the equivalent of log-softmax followed by negative log-likelihood internally. Passing already-softmaxed probabilities changes the function being optimized.[2]
First, confirm that PyTorch returns the same loss and gradient as our hand calculation:
1import torch
2from torch import nn
3
4logits = torch.tensor([[3.0, 1.0, 0.0]], requires_grad=True)
5target = torch.tensor([1]) # reship
6loss_fn = nn.CrossEntropyLoss()
7
8loss = loss_fn(logits, target)
9loss.backward()
10probabilities = torch.softmax(logits.detach(), dim=1)
11
12print("probabilities:", probabilities.numpy().round(3))
13print("loss:", round(loss.item(), 3))
14print("gradient:", logits.grad.numpy().round(3))1probabilities: [[0.844 0.114 0.042]]
2loss: 2.17
3gradient: [[ 0.844 -0.886 0.042]]Now reproduce a common bug. The incorrect call passes probabilities where the loss expects logits, effectively normalizing an already normalized output again.
1import torch
2from torch import nn
3
4target = torch.tensor([1]) # reship
5loss_fn = nn.CrossEntropyLoss()
6
7correct_input = torch.tensor([[3.0, 1.0, 0.0]], requires_grad=True)
8correct_loss = loss_fn(correct_input, target)
9correct_loss.backward()
10
11wrong_input = torch.tensor([[3.0, 1.0, 0.0]], requires_grad=True)
12wrong_loss = loss_fn(torch.softmax(wrong_input, dim=1), target)
13wrong_loss.backward()
14
15print("raw logits loss:", round(correct_loss.item(), 3))
16print("probabilities passed as logits:", round(wrong_loss.item(), 3))
17print("correct reship gradient:", round(correct_input.grad[0, 1].item(), 3))
18print("distorted reship gradient:", round(wrong_input.grad[0, 1].item(), 3))1raw logits loss: 2.17
2probabilities passed as logits: 1.387
3correct reship gradient: -0.886
4distorted reship gradient: -0.127Shapes won't warn you about this bug. Keep the model's final classification layer linear during training and feed those raw logits into cross-entropy.
Softmax can be made sharper or flatter by dividing logits by a positive temperature :
When is below one, logit differences grow and the leading option takes more probability. When is above one, differences shrink and more probability remains on alternatives. Hinton, Vinyals, and Dean use this operation to reveal softer class relationships during knowledge distillation.[3] In text generation, the same mathematical scaling can shape the distribution a decoding policy samples from. Softmax still doesn't choose a token by itself.
| Temperature | refund | reship | escalate | Interpretation |
|---|---|---|---|---|
0.5 | 0.980 | 0.018 | 0.002 | Wrong favorite becomes harder to escape |
1.0 | 0.844 | 0.114 | 0.042 | Original model distribution |
2.0 | 0.629 | 0.231 | 0.140 | Alternatives receive more mass |
Use the same stable function to check each distribution:
1import numpy as np
2
3def softmax_at_temperature(logits: np.ndarray, temperature: float) -> np.ndarray:
4 scaled = logits / temperature
5 shifted = scaled - np.max(scaled)
6 exponentials = np.exp(shifted)
7 return exponentials / exponentials.sum()
8
9logits = np.array([3.0, 1.0, 0.0])
10for temperature in (0.5, 1.0, 2.0):
11 probabilities = softmax_at_temperature(logits, temperature)
12 print(f"T={temperature:.1f}", np.round(probabilities, 3))1T=0.5 [0.98 0.018 0.002]
2T=1.0 [0.844 0.114 0.042]
3T=2.0 [0.629 0.231 0.14 ]Our ticket action produced one loss. A language model trains on the same categorical idea many times at once: after each context position, it produces logits for the next token, and the observed next token supplies the class label. A sequence model still needs a mechanism for history, which is the question the next lesson will address.
To make the language-model version literal, use a tiny vocabulary: refund, reship, and today. Suppose a two-token reply should be reship today. At each supervised position, the model produces a new logit vector over that same vocabulary:
| Supervised position | Correct token | Probability on token | Loss |
|---|---|---|---|
| Reply token 1 | reship | 0.114 | 2.170 |
| Reply token 2 | today | 0.736 | 0.306 |
| Mean loss | 1.238 |
This final NumPy example applies the same stable loss across two supervised reply positions. Each row holds logits over the same tiny vocabulary; averaging yields the one scalar sent backward.
1import numpy as np
2
3def per_position_cross_entropy(logits: np.ndarray, targets: np.ndarray) -> np.ndarray:
4 shifted = logits - np.max(logits, axis=1, keepdims=True)
5 log_probs = shifted - np.log(np.exp(shifted).sum(axis=1, keepdims=True))
6 return -log_probs[np.arange(len(targets)), targets]
7
8# Vocabulary order: refund, reship, today.
9# Row 1 target is reship. Row 2 target is today.
10logits = np.array([[3.0, 1.0, 0.0], [0.5, 0.0, 2.0]])
11targets = np.array([1, 2])
12losses = per_position_cross_entropy(logits, targets)
13
14print("position losses:", np.round(losses, 3))
15print("mean loss:", round(float(losses.mean()), 3))1position losses: [2.17 0.306]
2mean loss: 1.238| Symptom | Likely cause | Check or fix |
|---|---|---|
Loss becomes NaN in custom NumPy code | Exponentiating large logits directly | Subtract row maximum, then compute log-sum-exp |
| Training loss is plausible but gradients are weak or wrong | Probabilities were passed into CrossEntropyLoss | Pass raw logits into the loss |
| Model is confidently wrong on a class | Correct label has tiny softmax mass | Inspect and confirm correct logit receives negative gradient |
| Generation samples unwanted alternatives | Sampling distribution is too flat | Inspect temperature and decoding policy separately from training |
| Mean sequence loss hides one severe miss | Mean reduction combines easy and hard positions | Print per-position losses before averaging |
reship ticket by hand.refund is wrongly preferred.NaN on large scores. Cause: Direct exponentiation overflowed. Fix: Compute max-shifted softmax or stable log-sum-exp.nn.CrossEntropyLoss runs but learns poorly. Cause: Model probabilities were passed as if they were logits. Fix: Pass the final linear layer's raw scores.Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.