LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

๐Ÿ› ๏ธComputing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
๐Ÿ“ŠMath & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
๐Ÿ“šPreparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
๐ŸงฎML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
๐Ÿ“ฆProduction ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
๐ŸงชCore LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
๐ŸงฐApplied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
๐ŸŽ“Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
๐Ÿง Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
๐ŸงฌAdvanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
๐Ÿค–Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
โšกInference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
๐Ÿ—๏ธSystem Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
๐ŸŽคAI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnPreparation & PrerequisitesSoftmax, Cross-Entropy & Optimization
๐Ÿ“EasyNLP Fundamentals

Softmax, Cross-Entropy & Optimization

Turn raw action scores into stable probabilities and a useful learning signal, then apply the same loss to next-token predictions.

13 min read
Learning path
Step 18 of 155 in the full curriculum
Training & BackpropagationRNNs, LSTMs, GRUs, and Sequence Modeling

In the previous lesson, a model predicted one number: extra delivery delay in hours. Its loss could say "increase the prediction." A support model now faces a different job. Given a damaged parcel ticket, should it recommend refund, reship, or escalate?

Suppose the evidence says reship is correct, but the model currently favors refund. We need a score for how bad that choice is, and a gradient that says which competing scores to raise or lower. Softmax turns raw scores into probability shares. Cross-entropy measures how little probability reached the correct choice. Together they form the standard categorical output-and-loss pair taught for neural classifiers and language models.[1]

A decision begins as three raw scores

The final layer of a classifier emits one number per possible action. These numbers are logits: unconstrained scores, not probabilities.

ActionCurrent logitWhat the ticket says
refund3.0Model's current favorite, but wrong
reship1.0Correct supervised label
escalate0.0Possible, but not correct here

A larger logit means the model prefers an action relative to its competitors. It doesn't mean 3.0 is 300%, and a logit can be negative without creating a negative probability.

This first snippet makes the mistake visible: the scores select a favorite, but don't satisfy probability rules.

inspect-raw-logits.py
1labels = ["refund", "reship", "escalate"] 2logits = [3.0, 1.0, 0.0] 3 4best_index = max(range(len(logits)), key=logits.__getitem__) 5print("largest logit:", labels[best_index]) 6print("raw score sum:", sum(logits)) 7print("valid probability distribution:", all(0 <= z <= 1 for z in logits) and sum(logits) == 1)
Output
1largest logit: refund 2raw score sum: 4.0 3valid probability distribution: False
Two bar charts show logits 3, 1, and 0 becoming exponential weights 20.09, 2.72, and 1.00, then a stacked probability bar splits one unit into 84.4%, 11.4%, and 4.2%. Two bar charts show logits 3, 1, and 0 becoming exponential weights 20.09, 2.72, and 1.00, then a stacked probability bar splits one unit into 84.4%, 11.4%, and 4.2%.
Exponentiation expands the score gaps into positive weights. Dividing every weight by the same `23.81` total preserves their proportions and packs them into one unit of probability mass.

Softmax assigns probability shares

Softmax takes every logit ziz_iziโ€‹, exponentiates it, and divides by the sum of all exponentiated logits:

pi=eziโˆ‘j=1Kezjp_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}piโ€‹=โˆ‘j=1Kโ€‹ezjโ€‹eziโ€‹โ€‹

Here pip_ipiโ€‹ is the probability assigned to action iii, ziz_iziโ€‹ is its raw logit, and KKK is the number of choices. Exponentiation makes each share positive. Dividing by their total makes the shares sum to one.

For our three action scores, the arithmetic is small enough to do by hand:

ActionLogit ziz_iziโ€‹ezie^{z_i}eziโ€‹Probability pip_ipiโ€‹
refund3.020.090.844
reship1.02.720.114
escalate0.01.000.042
Total23.811.000

The model places only 0.114 probability on the correct reship action. That is the quantity the loss must expose.

Use NumPy to reproduce the table. This version is already stable because it subtracts the largest logit before taking exponentials.

stable-softmax.py
1import numpy as np 2 3def stable_softmax(logits: np.ndarray) -> np.ndarray: 4 shifted = logits - np.max(logits) 5 exp_shifted = np.exp(shifted) 6 return exp_shifted / exp_shifted.sum() 7 8labels = ["refund", "reship", "escalate"] 9logits = np.array([3.0, 1.0, 0.0]) 10probabilities = stable_softmax(logits) 11 12for label, probability in zip(labels, probabilities): 13 print(f"{label:8s} {probability:.3f}") 14print("sum ", round(float(probabilities.sum()), 3))
Output
1refund 0.844 2reship 0.114 3escalate 0.042 4sum 1.0

Stable arithmetic matters before scale does

Softmax depends on differences between logits, not on their absolute offset. Adding or subtracting the same constant from every score leaves every probability unchanged:

eziโˆ’cโˆ‘jezjโˆ’c=ezieโˆ’ceโˆ’cโˆ‘jezj=eziโˆ‘jezj\frac{e^{z_i-c}}{\sum_j e^{z_j-c}} = \frac{e^{z_i}e^{-c}}{e^{-c}\sum_j e^{z_j}} = \frac{e^{z_i}}{\sum_j e^{z_j}}โˆ‘jโ€‹ezjโ€‹โˆ’ceziโ€‹โˆ’cโ€‹=eโˆ’cโˆ‘jโ€‹ezjโ€‹eziโ€‹eโˆ’cโ€‹=โˆ‘jโ€‹ezjโ€‹eziโ€‹โ€‹

That cancellation gives us a safety rule: subtract the largest logit before exponentiating. The largest exponent becomes e0=1e^0 = 1e0=1, while the relative differences stay intact.

The following failure case uses the same score gaps shifted upward by 997. Direct exponentiation overflows; max-shifting produces the original distribution.

overflow-then-stable.py
1import numpy as np 2 3def stable_softmax(logits: np.ndarray) -> np.ndarray: 4 shifted = logits - np.max(logits) 5 exp_shifted = np.exp(shifted) 6 return exp_shifted / exp_shifted.sum() 7 8large_logits = np.array([1000.0, 998.0, 997.0]) 9with np.errstate(over="ignore", invalid="ignore"): 10 naive = np.exp(large_logits) / np.exp(large_logits).sum() 11 12stable = stable_softmax(large_logits) 13print("naive finite:", np.isfinite(naive).all()) 14print("stable:", np.round(stable, 3)) 15print("stable sum:", round(float(stable.sum()), 3))
Output
1naive finite: False 2stable: [0.844 0.114 0.042] 3stable sum: 1.0

Cross-entropy is most stable when computed straight from logits. For a correct class index yyy, the loss can be written without first materializing a tiny probability:

L=logโก(โˆ‘jezj)โˆ’zyL = \log\left(\sum_j e^{z_j}\right) - z_yL=log(jโˆ‘โ€‹ezjโ€‹)โˆ’zyโ€‹

The first term is called log-sum-exp. Its stable implementation uses the same maximum m=maxโกjzjm = \max_j z_jm=maxjโ€‹zjโ€‹:

logโก(โˆ‘jezj)=m+logโก(โˆ‘jezjโˆ’m)\log\left(\sum_j e^{z_j}\right) = m + \log\left(\sum_j e^{z_j-m}\right)log(jโˆ‘โ€‹ezjโ€‹)=m+log(jโˆ‘โ€‹ezjโ€‹โˆ’m)

This code computes the loss for the correct reship action directly from logits. Notice that it works for ordinary and extremely large offsets.

cross-entropy-from-logits.py
1import numpy as np 2 3def cross_entropy_from_logits(logits: np.ndarray, target_index: int) -> float: 4 maximum = np.max(logits) 5 logsumexp = maximum + np.log(np.exp(logits - maximum).sum()) 6 return float(logsumexp - logits[target_index]) 7 8ordinary = np.array([3.0, 1.0, 0.0]) 9shifted_high = ordinary + 1000.0 10 11print("ordinary reship loss:", round(cross_entropy_from_logits(ordinary, 1), 3)) 12print("large-offset loss: ", round(cross_entropy_from_logits(shifted_high, 1), 3))
Output
1ordinary reship loss: 2.17 2large-offset loss: 2.17

Cross-entropy grades the correct action

The supervised label for this ticket is reship. Written as a one-hot target vector in the same action order, it is:

y=[0,ย 1,ย 0]y = [0,\ 1,\ 0]y=[0,ย 1,ย 0]

One-hot means that exactly one class receives target weight one. Cross-entropy compares this target with the predicted probability vector:

L=โˆ’โˆ‘iyilogโก(pi)L = -\sum_i y_i \log(p_i)L=โˆ’iโˆ‘โ€‹yiโ€‹log(piโ€‹)

Every term except reship is multiplied by zero, so the loss reduces to:

L=โˆ’logโก(preship)=โˆ’logโก(0.114)โ‰ˆ2.170L = -\log(p_{\text{reship}}) = -\log(0.114) \approx 2.170L=โˆ’log(preshipโ€‹)=โˆ’log(0.114)โ‰ˆ2.170

Had the label been the model's preferred refund, the loss would have been only โˆ’logโก(0.844)โ‰ˆ0.170-\log(0.844) \approx 0.170โˆ’log(0.844)โ‰ˆ0.170. The loss is high because the supervised answer received little probability, not because the top choice merely happened to be wrong.

For a shorter arithmetic exercise, consider a second ticket with logits [1.0, 0.0, -2.0] in the same action order. Refund still leads, but by a smaller margin. Compare the loss under two possible labels.

Now compute it:

compare-label-losses.py
1import numpy as np 2 3def stable_softmax(logits): 4 shifted = logits - np.max(logits) 5 exp_logits = np.exp(shifted) 6 return exp_logits / exp_logits.sum() 7 8logits = np.array([1.0, 0.0, -2.0]) # refund, reship, escalate 9probabilities = stable_softmax(logits) 10 11print("reship loss", round(float(-np.log(probabilities[1])), 3)) 12print("refund loss", round(float(-np.log(probabilities[0])), 3))
Output
1reship loss 1.349 2refund loss 0.349

Why squared probability error is a weak fit here

The earlier delay lesson used half squared error for a continuous target in hours. We could attach softmax to a classifier and square its probability error, but the softmax derivative then shrinks the signal when a wrong class is already saturated near probability one. Cross-entropy paired with softmax avoids that extra shrinkage: its logit gradient will be pโˆ’yp-ypโˆ’y.[1]

This example makes the difference visible. The logits are extremely sure about the wrong refund action while the target is reship.

confidently-wrong-gradients.py
1import numpy as np 2 3def softmax(logits: np.ndarray) -> np.ndarray: 4 shifted = logits - np.max(logits) 5 exps = np.exp(shifted) 6 return exps / exps.sum() 7 8def squared_probability_loss(logits: np.ndarray, target: np.ndarray) -> float: 9 error = softmax(logits) - target 10 return float(0.5 * np.sum(error ** 2)) 11 12logits = np.array([8.0, 0.0, 0.0]) 13target = np.array([0.0, 1.0, 0.0]) 14probabilities = softmax(logits) 15ce_gradient = probabilities - target 16 17epsilon = 1e-5 18mse_gradient = np.array([ 19 (squared_probability_loss(logits + np.eye(3)[i] * epsilon, target) 20 - squared_probability_loss(logits - np.eye(3)[i] * epsilon, target)) 21 / (2 * epsilon) 22 for i in range(3) 23]) 24 25print("probabilities:", np.round(probabilities, 4)) 26print("cross-entropy gradient:", np.round(ce_gradient, 4)) 27print("squared-probability gradient:", np.round(mse_gradient, 4))
Output
1probabilities: [9.993e-01 3.000e-04 3.000e-04] 2cross-entropy gradient: [ 9.993e-01 -9.997e-01 3.000e-04] 3squared-probability gradient: [ 0.001 -0.0007 -0.0003]

Cross-entropy isn't the only possible classification objective, but it gives a direct categorical likelihood objective and a useful correction even when a wrong option dominates.

The gradient says exactly what to correct

For one labeled example, combine stable softmax with cross-entropy:

L=logโก(โˆ‘jezj)โˆ’zyL = \log\left(\sum_j e^{z_j}\right) - z_yL=log(jโˆ‘โ€‹ezjโ€‹)โˆ’zyโ€‹

Differentiate with respect to any logit zkz_kzkโ€‹:

โˆ‚Lโˆ‚zk=ezkโˆ‘jezjโˆ’1[k=y]=pkโˆ’yk\frac{\partial L}{\partial z_k} = \frac{e^{z_k}}{\sum_j e^{z_j}} - \mathbf{1}[k=y] = p_k - y_kโˆ‚zkโ€‹โˆ‚Lโ€‹=โˆ‘jโ€‹ezjโ€‹ezkโ€‹โ€‹โˆ’1[k=y]=pkโ€‹โˆ’ykโ€‹

This derivative is commonly summarized component-wise as piโˆ’yip_i - y_ipiโ€‹โˆ’yiโ€‹: predicted probability share minus target share.

The indicator 1[k=y]\mathbf{1}[k=y]1[k=y] equals one only for the correct class. For our reship ticket:

ActionProbability pppTarget yyyGradient pโˆ’yp-ypโˆ’yGradient descent effect
refund0.8440+0.844Lowers wrong favorite
reship0.1141-0.886Raises correct action
escalate0.0420+0.042Lowers competitor slightly
Three probability number lines showing p minus y as an arrow from one-hot target to prediction: refund plus 0.844, reship minus 0.886, and escalate plus 0.042, with gradient descent lowering wrong logits and raising reship. Three probability number lines showing p minus y as an arrow from one-hot target to prediction: refund plus 0.844, reship minus 0.886, and escalate plus 0.042, with gradient descent lowering wrong logits and raising reship.
Each p โˆ’ y arrow runs from target share to predicted share. Refund's long positive arrow means its logit is pushed down; reship's long negative arrow means its logit is pushed up; escalate receives only a small correction.

Logits themselves are outputs, not normally the parameters an optimizer stores. Updating them below is a diagnostic shortcut: it shows the direction that upstream weight updates are trying to produce on the next forward pass.

one-logit-signal-step.py
1import numpy as np 2 3def probabilities_and_loss(logits: np.ndarray, target_index: int): 4 shifted = logits - np.max(logits) 5 log_probs = shifted - np.log(np.exp(shifted).sum()) 6 return np.exp(log_probs), float(-log_probs[target_index]) 7 8labels = ["refund", "reship", "escalate"] 9logits = np.array([3.0, 1.0, 0.0]) 10target_index = labels.index("reship") 11target = np.eye(3)[target_index] 12 13before, before_loss = probabilities_and_loss(logits, target_index) 14gradient = before - target 15after_logits = logits - 0.5 * gradient 16after, after_loss = probabilities_and_loss(after_logits, target_index) 17 18print("gradient:", np.round(gradient, 3)) 19print("reship probability:", round(float(before[1]), 3), "->", round(float(after[1]), 3)) 20print("loss:", round(before_loss, 3), "->", round(after_loss, 3))
Output
1gradient: [ 0.844 -0.886 0.042] 2reship probability: 0.114 -> 0.23 3loss: 2.17 -> 1.469

As in the previous lesson, a finite-difference test catches sign mistakes in a new loss implementation. Nudging each logit by a tiny amount should agree with pโˆ’yp-ypโˆ’y.

verify-cross-entropy-gradient.py
1import numpy as np 2 3def loss(logits: np.ndarray, target_index: int) -> float: 4 shifted = logits - np.max(logits) 5 return float(np.log(np.exp(shifted).sum()) - shifted[target_index]) 6 7logits = np.array([3.0, 1.0, 0.0]) 8target_index = 1 9shifted = logits - np.max(logits) 10probabilities = np.exp(shifted) / np.exp(shifted).sum() 11analytic = probabilities - np.eye(3)[target_index] 12 13epsilon = 1e-5 14numeric = np.array([ 15 (loss(logits + np.eye(3)[i] * epsilon, target_index) 16 - loss(logits - np.eye(3)[i] * epsilon, target_index)) 17 / (2 * epsilon) 18 for i in range(3) 19]) 20 21print("analytic:", np.round(analytic, 6)) 22print("numeric: ", np.round(numeric, 6)) 23print("match:", np.allclose(analytic, numeric, atol=1e-6))
Output
1analytic: [ 0.843795 -0.885805 0.04201 ] 2numeric: [ 0.843795 -0.885805 0.04201 ] 3match: True

PyTorch receives logits directly

Training code shouldn't implement this loss from scratch unless you're testing your understanding or building a custom variant. In PyTorch, nn.CrossEntropyLoss receives raw logits for class-index targets and computes the equivalent of log-softmax followed by negative log-likelihood internally. Passing already-softmaxed probabilities changes the function being optimized.[2]

First, confirm that PyTorch returns the same loss and pโˆ’yp-ypโˆ’y gradient as our hand calculation:

pytorch-cross-entropy.py
1import torch 2from torch import nn 3 4logits = torch.tensor([[3.0, 1.0, 0.0]], requires_grad=True) 5target = torch.tensor([1]) # reship 6loss_fn = nn.CrossEntropyLoss() 7 8loss = loss_fn(logits, target) 9loss.backward() 10probabilities = torch.softmax(logits.detach(), dim=1) 11 12print("probabilities:", probabilities.numpy().round(3)) 13print("loss:", round(loss.item(), 3)) 14print("gradient:", logits.grad.numpy().round(3))
Output
1probabilities: [[0.844 0.114 0.042]] 2loss: 2.17 3gradient: [[ 0.844 -0.886 0.042]]

Now reproduce a common bug. The incorrect call passes probabilities where the loss expects logits, effectively normalizing an already normalized output again.

do-not-softmax-before-cross-entropy.py
1import torch 2from torch import nn 3 4target = torch.tensor([1]) # reship 5loss_fn = nn.CrossEntropyLoss() 6 7correct_input = torch.tensor([[3.0, 1.0, 0.0]], requires_grad=True) 8correct_loss = loss_fn(correct_input, target) 9correct_loss.backward() 10 11wrong_input = torch.tensor([[3.0, 1.0, 0.0]], requires_grad=True) 12wrong_loss = loss_fn(torch.softmax(wrong_input, dim=1), target) 13wrong_loss.backward() 14 15print("raw logits loss:", round(correct_loss.item(), 3)) 16print("probabilities passed as logits:", round(wrong_loss.item(), 3)) 17print("correct reship gradient:", round(correct_input.grad[0, 1].item(), 3)) 18print("distorted reship gradient:", round(wrong_input.grad[0, 1].item(), 3))
Output
1raw logits loss: 2.17 2probabilities passed as logits: 1.387 3correct reship gradient: -0.886 4distorted reship gradient: -0.127

Shapes won't warn you about this bug. Keep the model's final classification layer linear during training and feed those raw logits into cross-entropy.

Temperature changes a distribution, not a label

Softmax can be made sharper or flatter by dividing logits by a positive temperature TTT:

pi(T)=ezi/Tโˆ‘jezj/Tp_i(T) = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}piโ€‹(T)=โˆ‘jโ€‹ezjโ€‹/Teziโ€‹/Tโ€‹

When TTT is below one, logit differences grow and the leading option takes more probability. When TTT is above one, differences shrink and more probability remains on alternatives. Hinton, Vinyals, and Dean use this operation to reveal softer class relationships during knowledge distillation.[3] In text generation, the same mathematical scaling can shape the distribution a decoding policy samples from. Softmax still doesn't choose a token by itself.

TemperaturerefundreshipescalateInterpretation
0.50.9800.0180.002Wrong favorite becomes harder to escape
1.00.8440.1140.042Original model distribution
2.00.6290.2310.140Alternatives receive more mass

Use the same stable function to check each distribution:

temperature-scales-logits.py
1import numpy as np 2 3def softmax_at_temperature(logits: np.ndarray, temperature: float) -> np.ndarray: 4 scaled = logits / temperature 5 shifted = scaled - np.max(scaled) 6 exponentials = np.exp(shifted) 7 return exponentials / exponentials.sum() 8 9logits = np.array([3.0, 1.0, 0.0]) 10for temperature in (0.5, 1.0, 2.0): 11 probabilities = softmax_at_temperature(logits, temperature) 12 print(f"T={temperature:.1f}", np.round(probabilities, 3))
Output
1T=0.5 [0.98 0.018 0.002] 2T=1.0 [0.844 0.114 0.042] 3T=2.0 [0.629 0.231 0.14 ]

A sequence supplies many supervised positions

Our ticket action produced one loss. A language model trains on the same categorical idea many times at once: after each context position, it produces logits for the next token, and the observed next token supplies the class label. A sequence model still needs a mechanism for history, which is the question the next lesson will address.

To make the language-model version literal, use a tiny vocabulary: refund, reship, and today. Suppose a two-token reply should be reship today. At each supervised position, the model produces a new logit vector over that same vocabulary:

Supervised positionCorrect tokenProbability on tokenLoss
Reply token 1reship0.1142.170
Reply token 2today0.7360.306
Mean loss1.238
Bar chart comparing per-token cross-entropy losses 2.170 for missed reship and 0.306 for correct today, with an 88 to 12 contribution split and mean loss 1.238 feeding one backward pass. Bar chart comparing per-token cross-entropy losses 2.170 for missed reship and 0.306 for correct today, with an 88 to 12 contribution split and mean loss 1.238 feeding one backward pass.
The missed reship position contributes 2.170 of the 2.476 total loss, about 88%, while today contributes 0.306. Averaging gives 1.238, but the larger miss still dominates the shared gradient.

This final NumPy example applies the same stable loss across two supervised reply positions. Each row holds logits over the same tiny vocabulary; averaging yields the one scalar sent backward.

average-position-losses.py
1import numpy as np 2 3def per_position_cross_entropy(logits: np.ndarray, targets: np.ndarray) -> np.ndarray: 4 shifted = logits - np.max(logits, axis=1, keepdims=True) 5 log_probs = shifted - np.log(np.exp(shifted).sum(axis=1, keepdims=True)) 6 return -log_probs[np.arange(len(targets)), targets] 7 8# Vocabulary order: refund, reship, today. 9# Row 1 target is reship. Row 2 target is today. 10logits = np.array([[3.0, 1.0, 0.0], [0.5, 0.0, 2.0]]) 11targets = np.array([1, 2]) 12losses = per_position_cross_entropy(logits, targets) 13 14print("position losses:", np.round(losses, 3)) 15print("mean loss:", round(float(losses.mean()), 3))
Output
1position losses: [2.17 0.306] 2mean loss: 1.238

Debugging checklist

SymptomLikely causeCheck or fix
Loss becomes NaN in custom NumPy codeExponentiating large logits directlySubtract row maximum, then compute log-sum-exp
Training loss is plausible but gradients are weak or wrongProbabilities were passed into CrossEntropyLossPass raw logits into the loss
Model is confidently wrong on a classCorrect label has tiny softmax massInspect pโˆ’yp-ypโˆ’y and confirm correct logit receives negative gradient
Generation samples unwanted alternativesSampling distribution is too flatInspect temperature and decoding policy separately from training
Mean sequence loss hides one severe missMean reduction combines easy and hard positionsPrint per-position losses before averaging

Mastery check

Key concepts

  • Logits are unconstrained scores whose differences determine probability shares.
  • Stable softmax subtracts the maximum logit before exponentiation.
  • Cross-entropy with a class label is the negative log probability of that label.
  • Softmax plus cross-entropy produces the logit gradient pโˆ’yp-ypโˆ’y.
  • PyTorch categorical cross-entropy receives raw logits, not probabilities.
  • Temperature rescales logits; decoding decides how a token is selected.
  • Sequence training averages this same categorical loss across supervised positions.

Evaluation rubric

  • Foundational: Compute the stable softmax probabilities and cross-entropy loss for the reship ticket by hand.
  • Intermediate: Explain why max shifting leaves probabilities unchanged and diagnose an overflowing implementation.
  • Intermediate: Derive pโˆ’yp-ypโˆ’y and explain the sign of each action gradient when refund is wrongly preferred.
  • Advanced: Verify the gradient numerically, use the PyTorch interface correctly, and connect per-position loss to next-token sequence training.

Common pitfalls

  • Symptom: A logit is reported as a confidence percentage. Cause: Raw output scores were confused with normalized probabilities. Fix: Apply softmax for interpretation and treat calibration as a separate evaluation question.
  • Symptom: Custom loss returns NaN on large scores. Cause: Direct exponentiation overflowed. Fix: Compute max-shifted softmax or stable log-sum-exp.
  • Symptom: nn.CrossEntropyLoss runs but learns poorly. Cause: Model probabilities were passed as if they were logits. Fix: Pass the final linear layer's raw scores.
  • Symptom: Lower temperature reinforces an incorrect option. Cause: Temperature sharpened existing score order; it doesn't learn a correction. Fix: Use training gradients to repair logits and tune decoding separately.
  • Symptom: Mean sequence loss looks acceptable despite a serious wrong action. Cause: Easy token positions dilute one high-loss event. Fix: Inspect unreduced position losses during debugging.

Follow-up questions

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A custom softmax over logits [1000.0, 998.0, 997.0] overflows, so you subtract the maximum before exponentiating and use [0.0, -2.0, -3.0]. Why should this produce the same probabilities as the original logits would in exact arithmetic?
2.For action order [refund, reship, escalate], softmax gives probabilities [0.844, 0.114, 0.042] and the correct label is reship. What does the cross-entropy logit gradient [0.844, -0.886, 0.042] cause under gradient descent?
3.Why should a PyTorch classifier pass raw logits, not softmax probabilities, into nn.CrossEntropyLoss during training?
4.A tiny language-model training example has two supervised next-token positions. The target token probabilities are 0.114 for reship at position 1 and 0.736 for today at position 2. If the training loss is the mean cross-entropy over positions, what is reported and which position contributes more?
5.With logits [3.0, 1.0, 0.0] in action order [refund, reship, escalate], refund is highest but the correct action is reship. If you lower the softmax temperature from 1.0 to 0.5, what happens?
6.For logits [3.0, 1.0, 0.0], the exponentials are approximately [20.09, 2.72, 1.00]. If reship is the second class, what are its softmax probability and cross-entropy loss?
7.A target probability underflows when cross-entropy is computed from very large logits. Which calculation preserves the intended loss most reliably?
8.A classifier gives probabilities [0.9993, 0.0003, 0.0003], but the second class is correct. How do cross-entropy and squared probability error differ in this confidently wrong case?

8 questions remaining.

Next Step
Continue to RNNs, LSTMs, GRUs, and Sequence Modeling

You can now turn competing logits into a stable loss and a correction signal at each supervised position. Next you will build the first model in this path that carries ordered history forward so those predictions can depend on what came before.

PreviousTraining & Backpropagation
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Deep Learning.

Goodfellow, I., Bengio, Y., Courville, A. ยท 2016

Optimizing Model Parameters.

PyTorch Contributors ยท 2026 ยท Official tutorial

Distilling the Knowledge in a Neural Network.

Hinton, G., Vinyals, O., & Dean, J. ยท 2015