LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnPreparation & PrerequisitesLanguage Modeling & Next Tokens
📝EasyNLP Fundamentals

Language Modeling & Next Tokens

Learn how next-token prediction becomes a trainable language model, from bigram counts and neural n-grams to causal Transformer generation and KV-cache serving.

25 min read
Learning path
Step 22 of 155 in the full curriculum
The Transformer Architecture End-to-EndFrom GPT to Modern LLMs

The Transformer chapter ended with logits and softmax: a vector of scores becomes a probability distribution over possible next tokens. This chapter explains what that distribution is used for during language modeling.

At the core, a large language model (LLM) does one repeated job: it predicts the next token. This article builds that job from counts into a trainable objective, then shows how a causal Transformer uses it during generation.

Key idea: An LLM doesn't generate a whole paragraph at once. It predicts a single token, appends it to its input, and runs the process again. This is called autoregressive generation.

Two next-token probability charts show warehouse winning after the prompt Order shipped from the, then a period winning after warehouse is appended to the context. Two next-token probability charts show warehouse winning after the prompt Order shipped from the, then a period winning after warehouse is appended to the context.
The prompt produces one probability distribution, where `warehouse` leads. After that token is selected and appended, the longer context produces a different distribution, where a period now leads.

What is a language model?

An autoregressive language model is a probability engine for token sequences: given the tokens so far, it assigns a probability distribution over the vocabulary for what comes next.

Think of it like a warehouse routing system that must choose the next destination for a package. It doesn't treat every destination as equally likely; it ranks them using the package's origin, weight, and delivery promise.

A useful way to make this precise: the probability of a whole sequence factorizes into a product of next-token predictions, one per position. This is just the chain rule of probability.

P(x1,x2,…,xn)=∏t=1nP(xt∣x1,…,xt−1)P(x_1, x_2, \dots, x_n) = \prod_{t=1}^{n} P(x_t \mid x_1, \dots, x_{t-1})P(x1​,x2​,…,xn​)=t=1∏n​P(xt​∣x1​,…,xt−1​)

Read it left to right: the chance of the first token, times the chance of the second given the first, times the chance of the third given the first two, and so on. Each factor is one next-token distribution. That single building block, "distribution over the next token given everything before it," is the entire job. Everything else in this chapter (generation, training, perplexity) is built on it.

Suppose a model receives this prompt: "The package left the"

Its actual probabilities depend on its tokenizer, weights, and preceding context. In a simplified example, it might rank four possible continuations like this:

Probability table

Candidate tokenProbabilityWhat reader should notice
warehouse85%Greedy decoding would choose this; sampling makes it the likeliest choice.
hub10%Plausible alternatives remain available when sampling.
carrier3%Lower-ranked words can appear with higher temperature.
coupon0.0001%A commerce-related token can still be unlikely in this position.

This table is only an excerpt. Tokens omitted from the table receive the remaining probability mass.

Read the distribution before sampling

The table above isn't a list of words the model "knows." It's the model's next-step belief after reading the prompt.

That distinction matters because the same model can behave differently after one extra token.

Prompt so farLikely next tokenWhy
The package left thewarehouseCommon logistics completion.
The package left the regionalhubNew context changes the expected location.
The package left the checkoutpageProduct context pulls the distribution toward software words.
The package left the plus a style instruction for a customer emailwarehouse or carrierTask instructions can change likely continuations.

The model doesn't store one fixed answer for "the package left the." It computes a distribution from the whole context. Change the context, and you change the probabilities.

Try the mental move:

  1. Read this prompt.
  2. Predict three likely next tokens.
  3. Compare your predictions with the options below.
text
1The carrier scan says the parcel was

Possible continuations:

text
1delivered 2delayed 3rerouted

All three are plausible. A model doesn't need certainty to generate text. It needs a probability ranking, then the decoding algorithm picks from that ranking.

Tiny vocabulary example

A production vocabulary contains many tokens, with its exact size set by the model's tokenizer. That's too many for a beginner example, so shrink the world to five tokens.

Imagine this prompt:

The support bot answered

The model produces logits, which are raw scores before softmax:

TokenLogitMeaning
correctly3.0Strong continuation.
slowly1.5Plausible but less likely.
angrily-1.0Grammatically possible, semantically odd.
the-2.0Usually wrong after this prompt.
</s>-3.0Ending now is unlikely.

Softmax turns those scores into probabilities:

One line in the code deserves explanation before you read it: logits - logits.max().

Why subtract the maximum logit? Because softmax only cares about differences between logits, not their absolute level. If you subtract the same constant from every score, the ranking and final probabilities stay the same. But the numerics get much safer: the largest shifted logit becomes 0, so its exponential is e^0 = 1 instead of some huge number. On tiny toy logits like [3.0, 1.5, -1.0, ...] this barely matters. On real model logits it prevents overflow and keeps the computation stable.

tiny-vocabulary-example.py
1import numpy as np 2 3tokens = ["correctly", "slowly", "angrily", "the", "</s>"] 4logits = np.array([3.0, 1.5, -1.0, -2.0, -3.0]) 5 6# Subtract max for numerical stability, then exponentiate 7exp = np.exp(logits - logits.max()) 8probs = exp / exp.sum() 9 10for token, prob in zip(tokens, probs): 11 print(f"{token:10s} {prob:.3f}") 12print(f"sum {probs.sum():.3f}") 13print("top token ", tokens[int(np.argmax(probs))])
Output
1correctly 0.800 2slowly 0.178 3angrily 0.015 4the 0.005 5</s> 0.002 6sum 1.000 7top token correctly

The exact numbers are less important than the shape:

  1. One token is clearly ahead.
  2. Another token still has a real chance.
  3. Some tokens are valid but unlikely.
  4. The probabilities add to 1.

That's next-token prediction in miniature.

From bigram tables to neural n-grams

The smallest context-aware language model is a bigram model. It looks at one previous token and predicts the next token from a table of counts:

text
1P(next_token | previous_token)

If the support-log corpus often contains left -> warehouse, the row for left assigns high probability to warehouse. If it rarely contains left -> coupon, that probability stays tiny.

Previous tokenCandidate next tokenCountProbability
leftwarehouse850.85
lefthub100.10
leftcarrier50.05

That table already teaches the core language-modeling contract: read context, produce a probability distribution over the next token. Its weakness is context length. A bigram model can't tell the difference between package left the and customer left the, because it only sees the final token.

Build a count-based language model

Start with six tracking-message fragments. Each adjacent token pair becomes one training event: after left, which token appeared next?

count-bigram-events.py
1from collections import Counter, defaultdict 2 3messages = [ 4 "package left warehouse", 5 "package left warehouse", 6 "parcel left hub", 7 "parcel left warehouse", 8 "order left hub", 9 "order left carrier", 10] 11 12counts = defaultdict(Counter) 13for message in messages: 14 tokens = ["<bos>", *message.split(), "<eos>"] 15 for previous, current in zip(tokens, tokens[1:]): 16 counts[previous][current] += 1 17 18row = counts["left"] 19total = sum(row.values()) 20for token, count in row.most_common(): 21 print(f"P({token:9s} | left) = {count}/{total} = {count / total:.3f}") 22print("unseen destination probability =", row["depot"] / total)
Output
1P(warehouse | left) = 3/6 = 0.500 2P(hub | left) = 2/6 = 0.333 3P(carrier | left) = 1/6 = 0.167 4unseen destination probability = 0.0

An unseen pair receives probability zero in this unsmoothed table. More data or smoothing helps, but a count table still can't share what it learned about related contexts.

An n-gram model predicts the next token from the previous n - 1 tokens:

text
1P(next_token | previous n - 1 tokens)

For tiny corpora, count tables still work. For real language, the table explodes because most exact contexts are rare. A neural n-gram model addresses that sparsity by learning embeddings and a small MLP (multi-layer perceptron):

  1. Look up embeddings for the previous n - 1 tokens.
  2. Concatenate them into one vector.
  3. Apply a matrix multiply, bias, and a nonlinearity such as the Gaussian Error Linear Unit (GELU).
  4. Project to vocabulary logits.
  5. Softmax into next-token probabilities.
text
1previous n - 1 token IDs -> embedding table -> joined embeddings -> matmul + GELU -> logits -> softmax

This is the bridge from count-based autocomplete to modern LLMs. The objective stays next-token prediction. The model family changes: bigram table, neural n-gram MLP, recurrent network, then Transformer. Each step gives the model a better way to use context.

A sparse bigram transition heatmap is compared with a neural n-gram pipeline that embeds left and regional into a two-by-six matrix, joins twelve features, applies shared twelve-by-sixteen weights, and produces logits with hub highest. A sparse bigram transition heatmap is compared with a neural n-gram pipeline that embeds left and regional into a two-by-six matrix, joins twelve features, applies shared twelve-by-sixteen weights, and produces logits with hub highest.
The bigram model selects one sparse probability row from the previous token. The neural n-gram instead embeds both context tokens, joins their features, and applies shared matrices before producing next-token logits.

Train a tiny neural n-gram

The next cell trains a small two-token-context, or trigram, model. It isn't a Transformer, but it exposes the same contract: embeddings and shared weights produce logits, and cross-entropy rewards the true next token.

tiny-neural-ngram.py
1import torch 2from torch import nn 3 4torch.manual_seed(7) 5sentences = [ 6 "package left regional hub", 7 "parcel left regional hub", 8 "order left regional hub", 9 "package left local warehouse", 10 "parcel left local warehouse", 11 "order left local warehouse", 12] 13vocab = sorted({word for sentence in sentences for word in sentence.split()}) 14to_id = {word: index for index, word in enumerate(vocab)} 15 16contexts, targets = [], [] 17for sentence in sentences: 18 words = sentence.split() 19 for index in range(2, len(words)): 20 contexts.append([to_id[words[index - 2]], to_id[words[index - 1]]]) 21 targets.append(to_id[words[index]]) 22 23x = torch.tensor(contexts) 24y = torch.tensor(targets) 25model = nn.Sequential( 26 nn.Embedding(len(vocab), 6), 27 nn.Flatten(), 28 nn.Linear(12, 16), 29 nn.GELU(), 30 nn.Linear(16, len(vocab)), 31) 32optimizer = torch.optim.Adam(model.parameters(), lr=0.05) 33 34with torch.no_grad(): 35 initial_loss = nn.functional.cross_entropy(model(x), y).item() 36for _ in range(120): 37 loss = nn.functional.cross_entropy(model(x), y) 38 optimizer.zero_grad() 39 loss.backward() 40 optimizer.step() 41 42query = torch.tensor([[to_id["left"], to_id["regional"]]]) 43with torch.no_grad(): 44 final_loss = nn.functional.cross_entropy(model(x), y).item() 45 predicted_id = int(model(query).argmax(dim=-1).item()) 46 47print(f"loss before training: {initial_loss:.3f}") 48print(f"loss after training: {final_loss:.3f}") 49print("after 'left regional':", vocab[predicted_id])
Output
1loss before training: 2.121 2loss after training: 0.347 3after 'left regional': hub

The tiny corpus is deliberately simple, so this cell demonstrates optimization rather than a production-quality model. The important bridge is that a neural model can use one set of learned parameters across contexts instead of storing an isolated probability row for every exact phrase.

Why temperature matters

So far we've ranked likely tokens and used the highest-probability one as the default. In practice, many applications sample from the distribution instead. A temperature parameter rescales the logits before softmax, giving the model more or less freedom to deviate from the top choice.

Why do we divide by temperature instead of adding or subtracting something? Because dividing changes the gap between logits. A temperature below 1 stretches the gaps, so the winner pulls farther ahead after softmax. A temperature above 1 compresses the gaps, so lower-ranked tokens stay more competitive. Adding a constant would not do that; softmax would cancel it out. The formula requires a temperature above 0. Treat a zero-temperature mode as greedy decoding in code instead of dividing logits by zero.

Here's the same logits at three temperatures:

why-temperature-matters.py
1import numpy as np 2 3tokens = ["correctly", "slowly", "angrily", "the", "</s>"] 4logits = np.array([3.0, 1.5, -1.0, -2.0, -3.0]) 5 6for temperature in [0.5, 1.0, 2.0]: 7 scaled = logits / temperature 8 exp = np.exp(scaled - scaled.max()) 9 probs = exp / exp.sum() 10 print(f"\nTemperature = {temperature}") 11 for token, prob in zip(tokens, probs): 12 print(f" {token:10s} {prob:.3f}") 13 print(f" sum {probs.sum():.3f}") 14 print(f" top token {tokens[int(np.argmax(probs))]}")
Output
1Temperature = 0.5 2 correctly 0.952 3 slowly 0.047 4 angrily 0.000 5 the 0.000 6 </s> 0.000 7 sum 1.000 8 top token correctly 9 10Temperature = 1.0 11 correctly 0.800 12 slowly 0.178 13 angrily 0.015 14 the 0.005 15 </s> 0.002 16 sum 1.000 17 top token correctly 18 19Temperature = 2.0 20 correctly 0.575 21 slowly 0.272 22 angrily 0.078 23 the 0.047 24 </s> 0.029 25 sum 1.000 26 top token correctly

At low temperature (0.5), the distribution sharpens. correctly dominates almost completely. This is close to greedy behavior: safe, repetitive, and predictable.


At high temperature (2.0), the distribution flattens. slowly becomes much more likely than before, and even odd words like angrily gain real probability. This introduces variety, but also raises the risk of incoherent output.

Production note: Low temperature keeps a support bot consistent and on-brand; higher temperature produces more varied drafts for creative tools. Temperature is one lever among several for shaping generation; a later chapter covers the full decoding toolkit.

Causal language modeling

This specific flavor of language modeling, reading from left to right and predicting the future, is called causal language modeling (or autoregressive modeling). GPT-style models use this objective.[1][2]

It's called "causal" because the model is bound by causality: each position can use earlier context, but it can't peek at future tokens. In a Transformer block, a token position may attend to itself and earlier positions; the mask blocks later positions.

GPT-style decoder models are causal language models.

ObjectiveCan readGood forPoor fit
Causal language modelingEarlier context, no future tokensGenerating text left to rightFilling a middle blank with future context
Masked language modelingLeft and right contextClassification, extraction, sentence understandingOpen-ended generation

(Note: There's another type called masked language modeling, used by Bidirectional Encoder Representations from Transformers (BERT), where a token in the middle of a sentence is hidden and the model looks at both the left and right context to predict the blank.[3] That objective builds representations from both sides of a blank; the left-to-right loop in this chapter instead needs causal modeling.)


How the model learns the distribution

We keep saying the model "predicts" the next token. Where does that skill come from? From training on text where the next token is already known. Every sentence in the training data is a free set of labeled examples: at each position, the correct answer is simply the token that actually came next.

Teacher forcing

During training the model is fed the real previous tokens at every position, not the tokens it would have guessed. This is called teacher forcing. The model never gets to wander off on its own mistakes while learning; each position is scored against ground truth using the true history.

This distinction matters whenever you interpret a training loss or measure generation latency:

PhaseWhat feeds position tWhy
Training (teacher forcing)The true tokens x_1 ... x_{t-1} from the datasetStable targets; all positions can be scored in one parallel forward pass.
Generation (autoregressive)The model's own previously sampled tokensNo ground truth exists; the model must build on what it produced.

Because the true history is known up front, training scores every position of a sequence in a single forward pass. The causal mask is what makes this honest: the logit produced from prefix x_1 ... x_{t-1} can't read its target token x_t.

This side-by-side view helps lock in why training is parallel while generation isn't:

Teacher forcing shows shifted input and target token rows with four true-token probabilities scored together, while generation shows a four-step staircase where each selected token expands the next context. Teacher forcing shows shifted input and target token rows with four true-token probabilities scored together, while generation shows a four-step staircase where each selected token expands the next context.
Teacher forcing shifts one known sequence into inputs and targets, so all four true-token probabilities can be scored in one masked pass. Generation has no target row; each selected token must be appended before the next row exists.

Shift tokens into inputs and labels

For a decoder, one training sequence supplies several labeled predictions. The input is the sequence shifted right; the target is the same sequence shifted left.

Context visible to modelCorrect next token
<bos>package
<bos> packageleft
<bos> package lefthub
<bos> package left hub<eos>
shift-next-token-labels.py
1import numpy as np 2 3vocab = {"<bos>": 0, "package": 1, "left": 2, "hub": 3, "<eos>": 4} 4tokens = ["<bos>", "package", "left", "hub", "<eos>"] 5ids = np.array([vocab[token] for token in tokens]) 6inputs = ids[:-1] 7targets = ids[1:] 8 9print("input token ids: ", inputs.tolist()) 10print("target token ids:", targets.tolist()) 11for position in range(len(inputs)): 12 context = " ".join(tokens[: position + 1]) 13 print(f"{context:25s} -> {tokens[position + 1]}")
Output
1input token ids: [0, 1, 2, 3] 2target token ids: [1, 2, 3, 4] 3<bos> -> package 4<bos> package -> left 5<bos> package left -> hub 6<bos> package left hub -> <eos>

Cross-entropy: the training loss

At each position the model outputs a distribution over the vocabulary. We want the probability it assigns to the true next token to be high. The loss for one position is the negative log of that probability, and the loss for a sequence is the average over positions. This is cross-entropy loss, the same objective introduced in the softmax chapter, now applied at every position.

L=−1n∑t=1nlog⁡P(xt∣x1,…,xt−1)\mathcal{L} = -\frac{1}{n} \sum_{t=1}^{n} \log P(x_t \mid x_1, \dots, x_{t-1})L=−n1​t=1∑n​logP(xt​∣x1​,…,xt−1​)

If the model gives the true token probability 1.0, that term contributes -log(1) = 0 loss: a perfect prediction. If it gives the true token a tiny probability, -log of a small number is large: a big penalty. Causal pretraining minimizes this average over a large corpus.

The logits at every shifted position can be scored together. This cell calculates stable cross-entropy for the four targets above.

cross-entropy-over-positions.py
1import numpy as np 2 3targets = np.array([1, 2, 3, 4]) # package, left, hub, <eos> 4logits = np.array([ 5 [0.0, 3.0, 0.5, -1.0, -2.0], 6 [-1.0, 0.1, 2.5, 0.0, -2.0], 7 [-1.0, 0.0, 0.2, 2.0, -1.5], 8 [-2.0, 0.0, -1.0, -0.5, 2.4], 9]) 10shifted = logits - logits.max(axis=1, keepdims=True) 11probs = np.exp(shifted) / np.exp(shifted).sum(axis=1, keepdims=True) 12true_probs = probs[np.arange(len(targets)), targets] 13losses = -np.log(true_probs) 14 15print("true-token probabilities:", np.round(true_probs, 3).tolist()) 16print("per-token losses: ", np.round(losses, 3).tolist()) 17print(f"mean cross-entropy: {losses.mean():.3f}")
Output
1true-token probabilities: [0.864, 0.824, 0.724, 0.839] 2per-token losses: [0.146, 0.194, 0.323, 0.175] 3mean cross-entropy: 0.209

Perplexity: loss in token units

Cross-entropy is measured in nats (because we used natural log), which is hard to feel. Perplexity is the friendlier form: just exponentiate the average loss.

Perplexity=exp⁡(L)\text{Perplexity} = \exp(\mathcal{L})Perplexity=exp(L)

Perplexity has a clean interpretation: it's the effective number of equally likely tokens the model is choosing between at each step. A perplexity of 1 means the model is certain every time. A perplexity of 50,000 on a 50,000-token vocabulary means the model learned nothing and is guessing uniformly. Lower is better.[4][5]

Work a tiny example by hand. Suppose the model predicts four tokens, assigning the true token these probabilities: 0.5, 0.5, 0.5, 0.125.

perplexity-by-hand.py
1import numpy as np 2 3# Probability the model gave to the token that was actually correct, per position 4true_token_probs = np.array([0.5, 0.5, 0.5, 0.125]) 5 6per_position_loss = -np.log(true_token_probs) # cross-entropy term per token 7avg_loss = per_position_loss.mean() # the training loss 8perplexity = np.exp(avg_loss) # loss in "effective choices" 9 10print(f"per-position loss: {np.round(per_position_loss, 4)}") 11print(f"average loss: {avg_loss:.4f}") 12print(f"perplexity: {perplexity:.4f}") 13 14# Perplexity equals the reciprocal geometric mean of the true-token probabilities 15geo_mean = np.prod(true_token_probs) ** (1 / len(true_token_probs)) 16print(f"1 / geometric mean:{1 / geo_mean:.4f}")
Output
1per-position loss: [0.6931 0.6931 0.6931 2.0794] 2average loss: 1.0397 3perplexity: 2.8284 41 / geometric mean:2.8284

A perplexity of about 2.83 means that, averaged across these four steps, the model behaved as if it were choosing between roughly 2.83 equally likely options. The one bad prediction (0.125) drags the average up. That is the whole point of the metric: it tells you, in token units, how surprised the model was by real text.

When you fine-tune a model and watch the loss curve, you're watching average cross-entropy fall. When an eval report uses the same tokenizer, evaluated tokens, and natural-log averaging, perplexity is exp of that same loss. Hold those choices constant before treating a mismatch as a regression.


The autoregressive loop

When you ask a chat model to write a shipping-delay update, it doesn't output the whole message in one step. It generates the message one token at a time in a loop.

This is the autoregressive generation loop:

  1. Input: The model receives the prompt.
  2. Predict: It calculates the probabilities for the very next token.
  3. Pick: We take the most likely token or sample from the distribution, optionally adjusting temperature.
  4. Append: That chosen token is glued to the end of the prompt.
  5. Repeat: The new, longer context determines the distribution for the next token.

This list describes the dependency between generated tokens, not an inefficient serving implementation. Production decode usually reuses earlier attention state through a key-value (KV) cache. It still performs one sequential decode step for each new token, but it doesn't rebuild earlier keys and values. The serving section below makes that tradeoff concrete.

Diagram showing Prompt: 'Delivery is', Model predicts next token, Sample: 'late', and Append to context. Diagram showing Prompt: 'Delivery is', Model predicts next token, Sample: 'late', and Append to context.
Prompt: 'Delivery is', Model predicts next token, Sample: 'late', and Append to context.

This explains why generation speed is measured in tokens per second (t/s). The model has to run another decoding step for every generated token. A KV cache avoids recomputing earlier keys and values from scratch, but each new token still has to attend to previous context.

A tiny generation loop in code

The code below isn't a neural network. It's a small hand-written language model that shows the same loop: look at current text, choose one next token, append it, and repeat.

a-tiny-generation-loop-in-code.py
1next_token_table = { 2 "Order shipped from the": [("warehouse", 0.85), ("store", 0.10), ("hub", 0.05)], 3 "Order shipped from the warehouse": [(".", 0.80), ("and", 0.20)], 4 "Order shipped from the warehouse.": [("</s>", 1.0)], 5} 6 7text = "Order shipped from the" 8 9for step in range(3): 10 choices = next_token_table[text] 11 token, _probability = max(choices, key=lambda item: item[1]) 12 13 if token == "</s>": 14 break 15 16 text = text + ("" if token == "." else " ") + token 17 print(step, repr(text)) 18print("final text:", repr(text))
Output
10 'Order shipped from the warehouse' 21 'Order shipped from the warehouse.' 3final text: 'Order shipped from the warehouse.'

Real LLMs replace the hand-written table with a Transformer, but the control flow is the same.

Diagram showing Text so far, Decode next position, Logits + temperature, and Sample one token. Diagram showing Text so far, Decode next position, Logits + temperature, and Sample one token.
Text so far, Decode next position, Logits + temperature, and Sample one token.

The loop ends when one of these things happens:

Stop conditionExample
Model emits an end token</s> or another stop marker.
Application or API hits an output-token limitThe server stops after a token budget.
Application stop string appearsThe app stops at \n\nUser: or a JSON boundary.
Safety or policy system interruptsThe provider refuses or truncates output.

Picking a token from the distribution

The loop says "sample a token," but how? The model hands you a distribution; turning it into a single chosen token is the job of a decoding strategy. The simplest is greedy decoding: always take the highest-probability token. It is deterministic, but it can fall into loops, repeating "We apologize for the inconvenience. We apologize for the inconvenience..." because "apologize" keeps scoring highest after "We."

The alternative is to sample: draw a token at random in proportion to its probability, so the second- or third-ranked token sometimes wins. Sampling can reduce repetition and add variety, but careless sampling can also hurt quality. The temperature knob from earlier controls how adventurous that draw is. Greedy versus sampling is the core fork; the full toolkit of decoding methods (and how to tune them) gets a dedicated chapter later, so here we only need the idea that one distribution can be turned into text in more than one way.

Common mistake: Setting a high temperature in a customer-facing support bot and wondering why answers drift off-brand, or setting temperature to 0 in a creative app and wondering why every user gets the same poem. At each step, the current context produces a distribution; the decoding strategy decides how much you gamble on it.

Use a seeded random generator when testing a sampler, so the result is repeatable.

greedy-versus-sampling.py
1import numpy as np 2 3tokens = np.array(["warehouse", "hub", "carrier"]) 4logits = np.array([2.4, 1.7, 0.2]) 5 6def probabilities(temperature): 7 scaled = logits / temperature 8 exp = np.exp(scaled - scaled.max()) 9 return exp / exp.sum() 10 11rng = np.random.default_rng(4) 12greedy = tokens[int(logits.argmax())] 13samples = [str(rng.choice(tokens, p=probabilities(1.4))) for _ in range(8)] 14 15print("greedy:", greedy) 16print("sampled at temperature 1.4:", samples) 17print("probabilities:", np.round(probabilities(1.4), 3).tolist())
Output
1greedy: warehouse 2sampled at temperature 1.4: ['carrier', 'warehouse', 'carrier', 'warehouse', 'hub', 'warehouse', 'hub', 'warehouse'] 3probabilities: [0.551, 0.334, 0.115]

The context window limit

Because the output is constantly being appended to the input, the text the model has to read keeps growing. If you generate a 1000-token support answer, by the end, the model is reading the original prompt plus the 999 tokens it just generated.

This growing sequence must fit into the model's context window, its short-term memory limit. If the context window is 8,000 tokens and the prompt plus generated text reaches 8,001 tokens, the system has to truncate earlier context, reject the request, summarize, or use a longer-context strategy.

What the KV cache changes

Without a cache, the model would recompute attention information for the full prompt at every generation step. That would be wasteful.

The KV cache stores key and value tensors from earlier tokens at each attention layer. New decode queries can attend to these cached tensors instead of rebuilding earlier keys and values at every step.[6]

Serving systems usually split this into two phases:

  • Prefill: process the full prompt in parallel and build the initial KV cache.
  • Decode: generate one new token at a time, appending each token and reusing the cache.
KV-cache diagram showing four prompt tokens written to key and value rows together during prefill, then three decode steps appending one highlighted cache column at a time; memory bars compare 1 GiB at 2,048 tokens with 16 GiB at 32,768 tokens. KV-cache diagram showing four prompt tokens written to key and value rows together during prefill, then three decode steps appending one highlighted cache column at a time; memory bars compare 1 GiB at 2,048 tokens with 16 GiB at 32,768 tokens.
Prefill writes key and value columns for the full prompt in parallel. Each decode step reuses existing columns and appends one new column. With the example settings below, increasing sequence length 16× grows KV-cache memory from 1 GiB to 16 GiB.
StepNaive workWith KV cache
First prompt passRead all prompt tokens.Read all prompt tokens.
Generate token 1Recompute prompt and token 1.Compute token 1, reuse prompt cache.
Generate token 2Recompute prompt, token 1, token 2.Compute token 2, reuse earlier cache.
Long answerWork keeps repeating old context.Memory grows, but old computation is reused.

This is why serving systems care so much about KV cache memory. The cache speeds generation, but it also consumes GPU memory proportional to context length, number of layers, key/value-head count, head dimension, and batch size.

Production note: If a model seems cheap at 4K context but expensive at 128K context, check KV cache memory before blaming the model weights. Long context changes serving economics.

Estimate cache memory

A simplified cache estimate makes the tradeoff concrete. For each request, layer, token, and key/value head, the server stores one key vector and one value vector. This example uses bfloat16 or float16 values at two bytes each.

kv-cache-memory-budget.py
1batch = 4 2layers = 32 3kv_heads = 8 4head_dim = 128 5bytes_per_value = 2 6 7def cache_gib(sequence_length): 8 values = batch * layers * sequence_length * kv_heads * head_dim * 2 9 return values * bytes_per_value / (1024 ** 3) 10 11for sequence_length in [2_048, 32_768]: 12 print(f"{sequence_length:>6,} tokens: {cache_gib(sequence_length):.2f} GiB")
Output
12,048 tokens: 1.00 GiB 232,768 tokens: 16.00 GiB

This is only the KV cache, not model weights, activations, allocator overhead, or batching metadata. It also shows why fewer key/value heads can reduce cache cost.


How can next-token prediction lead to useful capabilities?

A common critique of LLMs is that they're only autocomplete.

If it's autocomplete, how can it write code, solve logic puzzles, or translate languages?

The answer lies in scale, diversity, and optimization. When a neural network predicts the next token across huge mixtures of code, math, dialogue, documentation, and books, simple word association often isn't enough to minimize the loss function.

Worked prompt

Consider this prompt: "An order contains 5 items. After a return of 2 items, the retained item count is"

Model familyLikely shortcutBetter internal feature
Markov-style autocompleteNearby token pattern like 2, 3None; it mostly follows local statistics.
Large language modelStill predicts one tokenA subtraction-like representation helps reduce loss across many examples.

A simple Markov chain might look at 2 and guess 3 because 2, 3 is common. A larger neural language model can benefit from features that track subtraction-like structure across many related examples, because those features can improve next-token predictions. This is a pressure from the objective, not a guarantee of correct arithmetic.

To predict text from a complex world, the model has pressure to learn useful approximations of that world. Grammar, facts, reasoning patterns, and code structure can become helpful internal representations for minimizing next-token prediction loss.

What this explanation does and doesn't claim

Next-token prediction explains the training signal. It doesn't prove that every model has human-like understanding.

Good language here is careful:

Weak phrasingBetter phrasing
"The model understands the world.""The model can learn internal features that track useful structure in text and code."
"The model only memorizes text.""Memorization can happen, but broad next-token loss rewards reusable patterns too."
"Reasoning is guaranteed to emerge.""Reasoning-like behavior can emerge when it helps reduce loss across many tasks."
"Autocomplete can't do math.""Local autocomplete can't do much math, but large neural language models can learn arithmetic-like features."

This matters in system design and technical discussions. If you say "it's just next-token prediction," you miss learned structure. If you say "it's basically a human mind," you overclaim. Large generatively pretrained models have shown useful task behavior, but each capability still needs measurement on the task that matters.[1][7]


Practice: trace one completion

Take this prompt:

A shipment event JSON record starts with

Answer these questions before looking down:

  1. What are three plausible next tokens?
  2. Which token is likely highest probability?
  3. What would happen if the model outputs {?
  4. Why might a newline or quote still be possible?

One reasonable answer:

QuestionAnswer sketch
Plausible tokens{, a, the, or a newline.
Highest probabilityUsually {, because JSON syntax starts with an opening brace.
After {The brace gets appended, and the next prediction conditions on it.
Why alternatives existNatural-language explanations and formatted examples can start differently.

Now connect that to code generation. A model that has seen many shipment-event JSON examples can learn syntax regularities: opening braces, quoted keys, colons, commas, values, and closing braces. The same principle applies to Python indentation, SQL clauses, and Markdown tables.


Common failure cases

Treating probabilities as facts

The model's top token can be wrong.

Suppose a support bot sees:

The refund policy says annual plans are

A model might assign high probability to non-refundable because that phrase is common in policy text. But if your company's actual policy says annual plans are refundable within 14 days, the statistically likely continuation is still wrong.

The fix isn't to hope the base model "knows" your policy. Put the trusted policy into context, retrieve the right document, or use a tool that checks the source of truth.

SymptomCauseFix
Plausible but false answerTraining distribution dominates missing local facts.Add retrieved source text.
Old policy repeatedModel has stale or generic priors.Query current docs or database.
Confident unsupported claimProbability isn't evidence.Require citations or source-grounded output.
Long answer driftsContext fills with generated text.Re-anchor with instructions and source snippets.

The "database" myth

Many beginners assume an LLM contains a hidden database of facts that it looks up during generation. It doesn't. When a model completes Orders marked delivered are eligible for with refund, it isn't querying the current returns-policy table. It's producing a likely continuation from its context and learned parameters.

This distinction matters because:

  1. Facts can be outdated. Information stored in model parameters reflects training data. Without fresh context or tools, the model won't know about later events, policy changes, or catalog updates.
  2. Facts can be wrong. If the training data contains conflicting information, the model may reproduce the most common version, even if it's inaccurate.
  3. Facts can be hallucinated. When the model is uncertain, it may generate a plausible-sounding but false continuation rather than admitting ignorance.

The correct mental model is statistical generation, not retrieval. If you need accurate, up-to-date facts, you must ground the model with retrieval, tools, or explicit context.


Summary

  • Autoregressive language models output a probability distribution over a vocabulary for the next token; the probability of a whole sequence is the product of these per-token predictions (the chain rule).
  • Causal language modeling uses earlier context without peeking at future tokens.
  • Teacher forcing trains the model on the true previous tokens, which lets every position be scored in one parallel forward pass; generation instead feeds the model its own outputs.
  • The training objective is cross-entropy loss, the averaged negative log-probability of the true next token. Perplexity is exp(loss): the effective number of choices the model faces per step.
  • Autoregressive generation is a loop where the predicted token is appended to the input, and the process repeats.
  • Bigram, neural n-gram MLP, recurrent, and Transformer language models share the same next-token target; they differ in how much context they can use and how they represent it.
  • A decoding strategy turns the distribution into a token. Greedy is deterministic; sampling (tuned by temperature) adds variety. The full toolkit comes in a later chapter.
  • The context window is the maximum number of tokens the model can hold in this input loop; the KV cache trades memory for speed during generation.
  • Useful capabilities can appear because reusable structure about language, code, and tasks can help reduce next-token prediction error at scale, but they still need evaluation.
  • High probability doesn't mean factual truth. LLMs generate, they don't retrieve from a hidden database.

Mastery check

Key concepts

  • Language-model objective: next-token prediction
  • Chain-rule factorization of sequence probability
  • Bigram and neural n-gram models share the next-token target
  • Teacher forcing: training on true history in one parallel pass
  • Cross-entropy loss as averaged negative log-likelihood
  • Perplexity equals exp(loss): effective choices per token
  • Autoregressive generation loop
  • Causal vs masked language modeling (GPT vs BERT)
  • Context window limits and KV-cache serving tradeoffs
  • The database myth: LLMs generate instead of retrieving facts

Evaluation rubric

  • Foundational: Defines an autoregressive language model in terms of a probability distribution over the next token
  • Foundational: Writes sequence probability as a product of per-token conditionals (chain rule)
  • Foundational: Explains how a bigram table estimates P(next token | previous token) from counts
  • Foundational: Explains the autoregressive generation loop step-by-step
  • Foundational: Demonstrates how temperature rescales logits and changes the output distribution
  • Intermediate: Describes how a neural n-gram MLP uses embeddings, concatenation, matmul, GELU, and an output projection to predict the next token
  • Intermediate: Explains teacher forcing and why training scores all positions in one parallel pass while generation can't
  • Intermediate: Connects cross-entropy loss to perplexity via perplexity = exp(loss) and interprets perplexity as effective choices per token
  • Intermediate: Contrasts causal (left-to-right) modeling with masked (fill-in-the-blank) modeling
  • Intermediate: Distinguishes greedy decoding from sampling as ways to turn one distribution into a token
  • Intermediate: Discusses why predicting the next token can reward useful world representations
  • Intermediate: Explains why KV cache changes generation speed and serving memory trade-offs
  • Intermediate: Identifies why high-probability completions can still be factually wrong without retrieved context
  • Intermediate: Corrects the database myth by explaining that LLMs generate rather than retrieve facts

Follow-up questions

Common pitfalls

  • Symptom: A creative demo keeps giving nearly identical completions. Cause: Decoding is too deterministic, often from greedy decoding or very low temperature. Fix: Sample from the distribution and raise temperature carefully instead of always taking the top token.
  • Symptom: Training loss looks good, but long generations drift after one bad token. Cause: Teacher forcing was confused with real generation, so evaluation never tested the model feeding on its own outputs. Fix: Run autoregressive generation during eval, not loss-only checks.
  • Symptom: A model answers a policy question confidently but cites the wrong company rule. Cause: High next-token probability was treated like evidence, even though the model is generating from patterns instead of querying a live source. Fix: Retrieve the current policy or call a tool before trusting the answer.
  • Symptom: Generation gets slower or more expensive as outputs get longer. Cause: The key-value cache saves recomputation but still grows with context length, layers, and batch size. Fix: Watch context budgets and cache memory, not only model weight size.
  • Symptom: Someone claims perplexity and cross-entropy disagree. Cause: They forgot perplexity is exp(loss) under natural-log averaging, or compared runs with different token accounting. Fix: Hold tokenizer, evaluated tokens, and averaging convention constant before comparing them.
  • Symptom: Bigram tables, neural n-grams, and GPT-style models feel unrelated. Cause: Architecture changes were mistaken for objective changes. Fix: Trace all three back to the same job: predict the next token from context.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A tiny causal language model assigns these next-token probabilities for the sequence <bos> package left hub <eos>: P(package | <bos>) = 0.6, P(left | <bos> package) = 0.8, P(hub | <bos> package left) = 0.25, and P(<eos> | <bos> package left hub) = 0.5. What probability does it assign to the whole sequence after <bos>?
2.An unsmoothed bigram table observes left -> warehouse 3 times, left -> hub 2 times, and left -> carrier once, but never observes left -> depot. Which result and model change are correct?
3.One system must generate a reply from left to right. Another must predict a hidden word using text on both sides of the blank. Which objectives fit these tasks?
4.For the training sequence <bos> package left hub <eos>, a decoder input is shifted right. Which input-target pairing and speed implication are correct?
5.At four positions, a model assigns the true next token probabilities 0.5, 0.5, 0.5, and 0.125. Using natural logs, what are its mean cross-entropy and perplexity?
6.A support-log model has mean cross-entropy loss 0.69 on a held-out support dataset, so its perplexity is about exp(0.69) = 2.0. It still answers a question about today's refund policy with an old rule. What conclusion and fix follow?
7.A creative-writing demo uses greedy decoding, which always takes the highest-probability next token, so every user gets nearly the same safe paragraph. The product goal is more variety while still respecting the model's next-token probability distribution. Which decoding change best fits this goal?
8.An application has an 8,000-token context window, starts with a 7,600-token prompt, and retains every generated token without truncation or summarization. What happens after it generates 400 tokens?
9.In the simplified KV-cache estimate, a request batch uses 1.00 GiB of cache at 2,048 tokens. All other settings stay fixed and sequence length grows to 32,768 tokens. What should the serving team expect?
10.A language model improves on unfamiliar subtraction prompts after training on diverse math, code, and text examples. Which conclusion is justified by the next-token objective?

10 questions remaining.

Next Step
Continue to From GPT to Modern LLMs

You can now explain the objective, loss, and generation loop. Next you'll see how GPT and later model families turn that same loop into modern assistants and tools.

PreviousThe Transformer Architecture End-to-End
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Improving Language Understanding by Generative Pre-Training.

Radford, A., et al. · 2018

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. · 2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

A Mathematical Theory of Communication

Claude E. Shannon · 1948

Perplexity — a Measure of the Difficulty of Speech Recognition Tasks.

Jelinek, F., et al. · 1977

Efficiently Scaling Transformer Inference.

Pope, R., et al. · 2023 · arXiv preprint

Language Models are Few-Shot Learners.

Brown, T., et al. · 2020 · NeurIPS 2020