Learn how next-token prediction becomes a trainable language model, from bigram counts and neural n-grams to causal Transformer generation and KV-cache serving.
The Transformer chapter ended with logits and softmax: a vector of scores becomes a probability distribution over possible next tokens. This chapter explains what that distribution is used for during language modeling.
At the core, a large language model (LLM) does one repeated job: it predicts the next token. This article builds that job from counts into a trainable objective, then shows how a causal Transformer uses it during generation.
Key idea: An LLM doesn't generate a whole paragraph at once. It predicts a single token, appends it to its input, and runs the process again. This is called autoregressive generation.
An autoregressive language model is a probability engine for token sequences: given the tokens so far, it assigns a probability distribution over the vocabulary for what comes next.
Think of it like a warehouse routing system that must choose the next destination for a package. It doesn't treat every destination as equally likely; it ranks them using the package's origin, weight, and delivery promise.
A useful way to make this precise: the probability of a whole sequence factorizes into a product of next-token predictions, one per position. This is just the chain rule of probability.
Read it left to right: the chance of the first token, times the chance of the second given the first, times the chance of the third given the first two, and so on. Each factor is one next-token distribution. That single building block, "distribution over the next token given everything before it," is the entire job. Everything else in this chapter (generation, training, perplexity) is built on it.
Suppose a model receives this prompt:
"The package left the"
Its actual probabilities depend on its tokenizer, weights, and preceding context. In a simplified example, it might rank four possible continuations like this:
| Candidate token | Probability | What reader should notice |
|---|---|---|
warehouse | 85% | Greedy decoding would choose this; sampling makes it the likeliest choice. |
hub | 10% | Plausible alternatives remain available when sampling. |
carrier | 3% | Lower-ranked words can appear with higher temperature. |
coupon | 0.0001% | A commerce-related token can still be unlikely in this position. |
This table is only an excerpt. Tokens omitted from the table receive the remaining probability mass.
The table above isn't a list of words the model "knows." It's the model's next-step belief after reading the prompt.
That distinction matters because the same model can behave differently after one extra token.
| Prompt so far | Likely next token | Why |
|---|---|---|
The package left the | warehouse | Common logistics completion. |
The package left the regional | hub | New context changes the expected location. |
The package left the checkout | page | Product context pulls the distribution toward software words. |
The package left the plus a style instruction for a customer email | warehouse or carrier | Task instructions can change likely continuations. |
The model doesn't store one fixed answer for "the package left the." It computes a distribution from the whole context. Change the context, and you change the probabilities.
Try the mental move:
1The carrier scan says the parcel wasPossible continuations:
1delivered
2delayed
3reroutedAll three are plausible. A model doesn't need certainty to generate text. It needs a probability ranking, then the decoding algorithm picks from that ranking.
A production vocabulary contains many tokens, with its exact size set by the model's tokenizer. That's too many for a beginner example, so shrink the world to five tokens.
Imagine this prompt:
The support bot answered
The model produces logits, which are raw scores before softmax:
| Token | Logit | Meaning |
|---|---|---|
correctly | 3.0 | Strong continuation. |
slowly | 1.5 | Plausible but less likely. |
angrily | -1.0 | Grammatically possible, semantically odd. |
the | -2.0 | Usually wrong after this prompt. |
</s> | -3.0 | Ending now is unlikely. |
Softmax turns those scores into probabilities:
One line in the code deserves explanation before you read it: logits - logits.max().
Why subtract the maximum logit? Because softmax only cares about differences between logits, not their absolute level. If you subtract the same constant from every score, the ranking and final probabilities stay the same. But the numerics get much safer: the largest shifted logit becomes 0, so its exponential is e^0 = 1 instead of some huge number. On tiny toy logits like [3.0, 1.5, -1.0, ...] this barely matters. On real model logits it prevents overflow and keeps the computation stable.
1import numpy as np
2
3tokens = ["correctly", "slowly", "angrily", "the", "</s>"]
4logits = np.array([3.0, 1.5, -1.0, -2.0, -3.0])
5
6# Subtract max for numerical stability, then exponentiate
7exp = np.exp(logits - logits.max())
8probs = exp / exp.sum()
9
10for token, prob in zip(tokens, probs):
11 print(f"{token:10s} {prob:.3f}")
12print(f"sum {probs.sum():.3f}")
13print("top token ", tokens[int(np.argmax(probs))])1correctly 0.800
2slowly 0.178
3angrily 0.015
4the 0.005
5</s> 0.002
6sum 1.000
7top token correctlyThe exact numbers are less important than the shape:
That's next-token prediction in miniature.
The smallest context-aware language model is a bigram model. It looks at one previous token and predicts the next token from a table of counts:
1P(next_token | previous_token)If the support-log corpus often contains left -> warehouse, the row for left assigns high probability to warehouse. If it rarely contains left -> coupon, that probability stays tiny.
| Previous token | Candidate next token | Count | Probability |
|---|---|---|---|
left | warehouse | 85 | 0.85 |
left | hub | 10 | 0.10 |
left | carrier | 5 | 0.05 |
That table already teaches the core language-modeling contract: read context, produce a probability distribution over the next token. Its weakness is context length. A bigram model can't tell the difference between package left the and customer left the, because it only sees the final token.
Start with six tracking-message fragments. Each adjacent token pair becomes one training event: after left, which token appeared next?
1from collections import Counter, defaultdict
2
3messages = [
4 "package left warehouse",
5 "package left warehouse",
6 "parcel left hub",
7 "parcel left warehouse",
8 "order left hub",
9 "order left carrier",
10]
11
12counts = defaultdict(Counter)
13for message in messages:
14 tokens = ["<bos>", *message.split(), "<eos>"]
15 for previous, current in zip(tokens, tokens[1:]):
16 counts[previous][current] += 1
17
18row = counts["left"]
19total = sum(row.values())
20for token, count in row.most_common():
21 print(f"P({token:9s} | left) = {count}/{total} = {count / total:.3f}")
22print("unseen destination probability =", row["depot"] / total)1P(warehouse | left) = 3/6 = 0.500
2P(hub | left) = 2/6 = 0.333
3P(carrier | left) = 1/6 = 0.167
4unseen destination probability = 0.0An unseen pair receives probability zero in this unsmoothed table. More data or smoothing helps, but a count table still can't share what it learned about related contexts.
An n-gram model predicts the next token from the previous n - 1 tokens:
1P(next_token | previous n - 1 tokens)For tiny corpora, count tables still work. For real language, the table explodes because most exact contexts are rare. A neural n-gram model addresses that sparsity by learning embeddings and a small MLP (multi-layer perceptron):
n - 1 tokens.1previous n - 1 token IDs -> embedding table -> joined embeddings -> matmul + GELU -> logits -> softmaxThis is the bridge from count-based autocomplete to modern LLMs. The objective stays next-token prediction. The model family changes: bigram table, neural n-gram MLP, recurrent network, then Transformer. Each step gives the model a better way to use context.
The next cell trains a small two-token-context, or trigram, model. It isn't a Transformer, but it exposes the same contract: embeddings and shared weights produce logits, and cross-entropy rewards the true next token.
1import torch
2from torch import nn
3
4torch.manual_seed(7)
5sentences = [
6 "package left regional hub",
7 "parcel left regional hub",
8 "order left regional hub",
9 "package left local warehouse",
10 "parcel left local warehouse",
11 "order left local warehouse",
12]
13vocab = sorted({word for sentence in sentences for word in sentence.split()})
14to_id = {word: index for index, word in enumerate(vocab)}
15
16contexts, targets = [], []
17for sentence in sentences:
18 words = sentence.split()
19 for index in range(2, len(words)):
20 contexts.append([to_id[words[index - 2]], to_id[words[index - 1]]])
21 targets.append(to_id[words[index]])
22
23x = torch.tensor(contexts)
24y = torch.tensor(targets)
25model = nn.Sequential(
26 nn.Embedding(len(vocab), 6),
27 nn.Flatten(),
28 nn.Linear(12, 16),
29 nn.GELU(),
30 nn.Linear(16, len(vocab)),
31)
32optimizer = torch.optim.Adam(model.parameters(), lr=0.05)
33
34with torch.no_grad():
35 initial_loss = nn.functional.cross_entropy(model(x), y).item()
36for _ in range(120):
37 loss = nn.functional.cross_entropy(model(x), y)
38 optimizer.zero_grad()
39 loss.backward()
40 optimizer.step()
41
42query = torch.tensor([[to_id["left"], to_id["regional"]]])
43with torch.no_grad():
44 final_loss = nn.functional.cross_entropy(model(x), y).item()
45 predicted_id = int(model(query).argmax(dim=-1).item())
46
47print(f"loss before training: {initial_loss:.3f}")
48print(f"loss after training: {final_loss:.3f}")
49print("after 'left regional':", vocab[predicted_id])1loss before training: 2.121
2loss after training: 0.347
3after 'left regional': hubThe tiny corpus is deliberately simple, so this cell demonstrates optimization rather than a production-quality model. The important bridge is that a neural model can use one set of learned parameters across contexts instead of storing an isolated probability row for every exact phrase.
So far we've ranked likely tokens and used the highest-probability one as the default. In practice, many applications sample from the distribution instead. A temperature parameter rescales the logits before softmax, giving the model more or less freedom to deviate from the top choice.
Why do we divide by temperature instead of adding or subtracting something? Because dividing changes the gap between logits. A temperature below 1 stretches the gaps, so the winner pulls farther ahead after softmax. A temperature above 1 compresses the gaps, so lower-ranked tokens stay more competitive. Adding a constant would not do that; softmax would cancel it out. The formula requires a temperature above 0. Treat a zero-temperature mode as greedy decoding in code instead of dividing logits by zero.
Here's the same logits at three temperatures:
1import numpy as np
2
3tokens = ["correctly", "slowly", "angrily", "the", "</s>"]
4logits = np.array([3.0, 1.5, -1.0, -2.0, -3.0])
5
6for temperature in [0.5, 1.0, 2.0]:
7 scaled = logits / temperature
8 exp = np.exp(scaled - scaled.max())
9 probs = exp / exp.sum()
10 print(f"\nTemperature = {temperature}")
11 for token, prob in zip(tokens, probs):
12 print(f" {token:10s} {prob:.3f}")
13 print(f" sum {probs.sum():.3f}")
14 print(f" top token {tokens[int(np.argmax(probs))]}")1Temperature = 0.5
2 correctly 0.952
3 slowly 0.047
4 angrily 0.000
5 the 0.000
6 </s> 0.000
7 sum 1.000
8 top token correctly
9
10Temperature = 1.0
11 correctly 0.800
12 slowly 0.178
13 angrily 0.015
14 the 0.005
15 </s> 0.002
16 sum 1.000
17 top token correctly
18
19Temperature = 2.0
20 correctly 0.575
21 slowly 0.272
22 angrily 0.078
23 the 0.047
24 </s> 0.029
25 sum 1.000
26 top token correctlyAt low temperature (0.5), the distribution sharpens. correctly dominates almost completely. This is close to greedy behavior: safe, repetitive, and predictable.
At high temperature (2.0), the distribution flattens. slowly becomes much more likely than before, and even odd words like angrily gain real probability. This introduces variety, but also raises the risk of incoherent output.
Production note: Low temperature keeps a support bot consistent and on-brand; higher temperature produces more varied drafts for creative tools. Temperature is one lever among several for shaping generation; a later chapter covers the full decoding toolkit.
This specific flavor of language modeling, reading from left to right and predicting the future, is called causal language modeling (or autoregressive modeling). GPT-style models use this objective.[1][2]
It's called "causal" because the model is bound by causality: each position can use earlier context, but it can't peek at future tokens. In a Transformer block, a token position may attend to itself and earlier positions; the mask blocks later positions.
GPT-style decoder models are causal language models.
| Objective | Can read | Good for | Poor fit |
|---|---|---|---|
| Causal language modeling | Earlier context, no future tokens | Generating text left to right | Filling a middle blank with future context |
| Masked language modeling | Left and right context | Classification, extraction, sentence understanding | Open-ended generation |
(Note: There's another type called masked language modeling, used by Bidirectional Encoder Representations from Transformers (BERT), where a token in the middle of a sentence is hidden and the model looks at both the left and right context to predict the blank.[3] That objective builds representations from both sides of a blank; the left-to-right loop in this chapter instead needs causal modeling.)
We keep saying the model "predicts" the next token. Where does that skill come from? From training on text where the next token is already known. Every sentence in the training data is a free set of labeled examples: at each position, the correct answer is simply the token that actually came next.
During training the model is fed the real previous tokens at every position, not the tokens it would have guessed. This is called teacher forcing. The model never gets to wander off on its own mistakes while learning; each position is scored against ground truth using the true history.
This distinction matters whenever you interpret a training loss or measure generation latency:
| Phase | What feeds position t | Why |
|---|---|---|
| Training (teacher forcing) | The true tokens x_1 ... x_{t-1} from the dataset | Stable targets; all positions can be scored in one parallel forward pass. |
| Generation (autoregressive) | The model's own previously sampled tokens | No ground truth exists; the model must build on what it produced. |
Because the true history is known up front, training scores every position of a sequence in a single forward pass. The causal mask is what makes this honest: the logit produced from prefix x_1 ... x_{t-1} can't read its target token x_t.
This side-by-side view helps lock in why training is parallel while generation isn't:
For a decoder, one training sequence supplies several labeled predictions. The input is the sequence shifted right; the target is the same sequence shifted left.
| Context visible to model | Correct next token |
|---|---|
<bos> | package |
<bos> package | left |
<bos> package left | hub |
<bos> package left hub | <eos> |
1import numpy as np
2
3vocab = {"<bos>": 0, "package": 1, "left": 2, "hub": 3, "<eos>": 4}
4tokens = ["<bos>", "package", "left", "hub", "<eos>"]
5ids = np.array([vocab[token] for token in tokens])
6inputs = ids[:-1]
7targets = ids[1:]
8
9print("input token ids: ", inputs.tolist())
10print("target token ids:", targets.tolist())
11for position in range(len(inputs)):
12 context = " ".join(tokens[: position + 1])
13 print(f"{context:25s} -> {tokens[position + 1]}")1input token ids: [0, 1, 2, 3]
2target token ids: [1, 2, 3, 4]
3<bos> -> package
4<bos> package -> left
5<bos> package left -> hub
6<bos> package left hub -> <eos>At each position the model outputs a distribution over the vocabulary. We want the probability it assigns to the true next token to be high. The loss for one position is the negative log of that probability, and the loss for a sequence is the average over positions. This is cross-entropy loss, the same objective introduced in the softmax chapter, now applied at every position.
If the model gives the true token probability 1.0, that term contributes -log(1) = 0 loss: a perfect prediction. If it gives the true token a tiny probability, -log of a small number is large: a big penalty. Causal pretraining minimizes this average over a large corpus.
The logits at every shifted position can be scored together. This cell calculates stable cross-entropy for the four targets above.
1import numpy as np
2
3targets = np.array([1, 2, 3, 4]) # package, left, hub, <eos>
4logits = np.array([
5 [0.0, 3.0, 0.5, -1.0, -2.0],
6 [-1.0, 0.1, 2.5, 0.0, -2.0],
7 [-1.0, 0.0, 0.2, 2.0, -1.5],
8 [-2.0, 0.0, -1.0, -0.5, 2.4],
9])
10shifted = logits - logits.max(axis=1, keepdims=True)
11probs = np.exp(shifted) / np.exp(shifted).sum(axis=1, keepdims=True)
12true_probs = probs[np.arange(len(targets)), targets]
13losses = -np.log(true_probs)
14
15print("true-token probabilities:", np.round(true_probs, 3).tolist())
16print("per-token losses: ", np.round(losses, 3).tolist())
17print(f"mean cross-entropy: {losses.mean():.3f}")1true-token probabilities: [0.864, 0.824, 0.724, 0.839]
2per-token losses: [0.146, 0.194, 0.323, 0.175]
3mean cross-entropy: 0.209Cross-entropy is measured in nats (because we used natural log), which is hard to feel. Perplexity is the friendlier form: just exponentiate the average loss.
Perplexity has a clean interpretation: it's the effective number of equally likely tokens the model is choosing between at each step. A perplexity of 1 means the model is certain every time. A perplexity of 50,000 on a 50,000-token vocabulary means the model learned nothing and is guessing uniformly. Lower is better.[4][5]
Work a tiny example by hand. Suppose the model predicts four tokens, assigning the true token these probabilities: 0.5, 0.5, 0.5, 0.125.
1import numpy as np
2
3# Probability the model gave to the token that was actually correct, per position
4true_token_probs = np.array([0.5, 0.5, 0.5, 0.125])
5
6per_position_loss = -np.log(true_token_probs) # cross-entropy term per token
7avg_loss = per_position_loss.mean() # the training loss
8perplexity = np.exp(avg_loss) # loss in "effective choices"
9
10print(f"per-position loss: {np.round(per_position_loss, 4)}")
11print(f"average loss: {avg_loss:.4f}")
12print(f"perplexity: {perplexity:.4f}")
13
14# Perplexity equals the reciprocal geometric mean of the true-token probabilities
15geo_mean = np.prod(true_token_probs) ** (1 / len(true_token_probs))
16print(f"1 / geometric mean:{1 / geo_mean:.4f}")1per-position loss: [0.6931 0.6931 0.6931 2.0794]
2average loss: 1.0397
3perplexity: 2.8284
41 / geometric mean:2.8284A perplexity of about 2.83 means that, averaged across these four steps, the model behaved as if it were choosing between roughly 2.83 equally likely options. The one bad prediction (0.125) drags the average up. That is the whole point of the metric: it tells you, in token units, how surprised the model was by real text.
When you fine-tune a model and watch the loss curve, you're watching average cross-entropy fall. When an eval report uses the same tokenizer, evaluated tokens, and natural-log averaging, perplexity is
expof that same loss. Hold those choices constant before treating a mismatch as a regression.
When you ask a chat model to write a shipping-delay update, it doesn't output the whole message in one step. It generates the message one token at a time in a loop.
This is the autoregressive generation loop:
This list describes the dependency between generated tokens, not an inefficient serving implementation. Production decode usually reuses earlier attention state through a key-value (KV) cache. It still performs one sequential decode step for each new token, but it doesn't rebuild earlier keys and values. The serving section below makes that tradeoff concrete.
This explains why generation speed is measured in tokens per second (t/s). The model has to run another decoding step for every generated token. A KV cache avoids recomputing earlier keys and values from scratch, but each new token still has to attend to previous context.
The code below isn't a neural network. It's a small hand-written language model that shows the same loop: look at current text, choose one next token, append it, and repeat.
1next_token_table = {
2 "Order shipped from the": [("warehouse", 0.85), ("store", 0.10), ("hub", 0.05)],
3 "Order shipped from the warehouse": [(".", 0.80), ("and", 0.20)],
4 "Order shipped from the warehouse.": [("</s>", 1.0)],
5}
6
7text = "Order shipped from the"
8
9for step in range(3):
10 choices = next_token_table[text]
11 token, _probability = max(choices, key=lambda item: item[1])
12
13 if token == "</s>":
14 break
15
16 text = text + ("" if token == "." else " ") + token
17 print(step, repr(text))
18print("final text:", repr(text))10 'Order shipped from the warehouse'
21 'Order shipped from the warehouse.'
3final text: 'Order shipped from the warehouse.'Real LLMs replace the hand-written table with a Transformer, but the control flow is the same.
The loop ends when one of these things happens:
| Stop condition | Example |
|---|---|
| Model emits an end token | </s> or another stop marker. |
| Application or API hits an output-token limit | The server stops after a token budget. |
| Application stop string appears | The app stops at \n\nUser: or a JSON boundary. |
| Safety or policy system interrupts | The provider refuses or truncates output. |
The loop says "sample a token," but how? The model hands you a distribution; turning it into a single chosen token is the job of a decoding strategy. The simplest is greedy decoding: always take the highest-probability token. It is deterministic, but it can fall into loops, repeating "We apologize for the inconvenience. We apologize for the inconvenience..." because "apologize" keeps scoring highest after "We."
The alternative is to sample: draw a token at random in proportion to its probability, so the second- or third-ranked token sometimes wins. Sampling can reduce repetition and add variety, but careless sampling can also hurt quality. The temperature knob from earlier controls how adventurous that draw is. Greedy versus sampling is the core fork; the full toolkit of decoding methods (and how to tune them) gets a dedicated chapter later, so here we only need the idea that one distribution can be turned into text in more than one way.
Common mistake: Setting a high temperature in a customer-facing support bot and wondering why answers drift off-brand, or setting temperature to 0 in a creative app and wondering why every user gets the same poem. At each step, the current context produces a distribution; the decoding strategy decides how much you gamble on it.
Use a seeded random generator when testing a sampler, so the result is repeatable.
1import numpy as np
2
3tokens = np.array(["warehouse", "hub", "carrier"])
4logits = np.array([2.4, 1.7, 0.2])
5
6def probabilities(temperature):
7 scaled = logits / temperature
8 exp = np.exp(scaled - scaled.max())
9 return exp / exp.sum()
10
11rng = np.random.default_rng(4)
12greedy = tokens[int(logits.argmax())]
13samples = [str(rng.choice(tokens, p=probabilities(1.4))) for _ in range(8)]
14
15print("greedy:", greedy)
16print("sampled at temperature 1.4:", samples)
17print("probabilities:", np.round(probabilities(1.4), 3).tolist())1greedy: warehouse
2sampled at temperature 1.4: ['carrier', 'warehouse', 'carrier', 'warehouse', 'hub', 'warehouse', 'hub', 'warehouse']
3probabilities: [0.551, 0.334, 0.115]Because the output is constantly being appended to the input, the text the model has to read keeps growing. If you generate a 1000-token support answer, by the end, the model is reading the original prompt plus the 999 tokens it just generated.
This growing sequence must fit into the model's context window, its short-term memory limit. If the context window is 8,000 tokens and the prompt plus generated text reaches 8,001 tokens, the system has to truncate earlier context, reject the request, summarize, or use a longer-context strategy.
Without a cache, the model would recompute attention information for the full prompt at every generation step. That would be wasteful.
The KV cache stores key and value tensors from earlier tokens at each attention layer. New decode queries can attend to these cached tensors instead of rebuilding earlier keys and values at every step.[6]
Serving systems usually split this into two phases:
| Step | Naive work | With KV cache |
|---|---|---|
| First prompt pass | Read all prompt tokens. | Read all prompt tokens. |
| Generate token 1 | Recompute prompt and token 1. | Compute token 1, reuse prompt cache. |
| Generate token 2 | Recompute prompt, token 1, token 2. | Compute token 2, reuse earlier cache. |
| Long answer | Work keeps repeating old context. | Memory grows, but old computation is reused. |
This is why serving systems care so much about KV cache memory. The cache speeds generation, but it also consumes GPU memory proportional to context length, number of layers, key/value-head count, head dimension, and batch size.
Production note: If a model seems cheap at 4K context but expensive at 128K context, check KV cache memory before blaming the model weights. Long context changes serving economics.
A simplified cache estimate makes the tradeoff concrete. For each request, layer, token, and key/value head, the server stores one key vector and one value vector. This example uses bfloat16 or float16 values at two bytes each.
1batch = 4
2layers = 32
3kv_heads = 8
4head_dim = 128
5bytes_per_value = 2
6
7def cache_gib(sequence_length):
8 values = batch * layers * sequence_length * kv_heads * head_dim * 2
9 return values * bytes_per_value / (1024 ** 3)
10
11for sequence_length in [2_048, 32_768]:
12 print(f"{sequence_length:>6,} tokens: {cache_gib(sequence_length):.2f} GiB")12,048 tokens: 1.00 GiB
232,768 tokens: 16.00 GiBThis is only the KV cache, not model weights, activations, allocator overhead, or batching metadata. It also shows why fewer key/value heads can reduce cache cost.
A common critique of LLMs is that they're only autocomplete.
If it's autocomplete, how can it write code, solve logic puzzles, or translate languages?
The answer lies in scale, diversity, and optimization. When a neural network predicts the next token across huge mixtures of code, math, dialogue, documentation, and books, simple word association often isn't enough to minimize the loss function.
Consider this prompt:
"An order contains 5 items. After a return of 2 items, the retained item count is"
| Model family | Likely shortcut | Better internal feature |
|---|---|---|
| Markov-style autocomplete | Nearby token pattern like 2, 3 | None; it mostly follows local statistics. |
| Large language model | Still predicts one token | A subtraction-like representation helps reduce loss across many examples. |
A simple Markov chain might look at 2 and guess 3 because 2, 3 is common. A larger neural language model can benefit from features that track subtraction-like structure across many related examples, because those features can improve next-token predictions. This is a pressure from the objective, not a guarantee of correct arithmetic.
To predict text from a complex world, the model has pressure to learn useful approximations of that world. Grammar, facts, reasoning patterns, and code structure can become helpful internal representations for minimizing next-token prediction loss.
Next-token prediction explains the training signal. It doesn't prove that every model has human-like understanding.
Good language here is careful:
| Weak phrasing | Better phrasing |
|---|---|
| "The model understands the world." | "The model can learn internal features that track useful structure in text and code." |
| "The model only memorizes text." | "Memorization can happen, but broad next-token loss rewards reusable patterns too." |
| "Reasoning is guaranteed to emerge." | "Reasoning-like behavior can emerge when it helps reduce loss across many tasks." |
| "Autocomplete can't do math." | "Local autocomplete can't do much math, but large neural language models can learn arithmetic-like features." |
This matters in system design and technical discussions. If you say "it's just next-token prediction," you miss learned structure. If you say "it's basically a human mind," you overclaim. Large generatively pretrained models have shown useful task behavior, but each capability still needs measurement on the task that matters.[1][7]
Take this prompt:
A shipment event JSON record starts with
Answer these questions before looking down:
{?One reasonable answer:
| Question | Answer sketch |
|---|---|
| Plausible tokens | {, a, the, or a newline. |
| Highest probability | Usually {, because JSON syntax starts with an opening brace. |
After { | The brace gets appended, and the next prediction conditions on it. |
| Why alternatives exist | Natural-language explanations and formatted examples can start differently. |
Now connect that to code generation. A model that has seen many shipment-event JSON examples can learn syntax regularities: opening braces, quoted keys, colons, commas, values, and closing braces. The same principle applies to Python indentation, SQL clauses, and Markdown tables.
The model's top token can be wrong.
Suppose a support bot sees:
The refund policy says annual plans are
A model might assign high probability to non-refundable because that phrase is common in policy text. But if your company's actual policy says annual plans are refundable within 14 days, the statistically likely continuation is still wrong.
The fix isn't to hope the base model "knows" your policy. Put the trusted policy into context, retrieve the right document, or use a tool that checks the source of truth.
| Symptom | Cause | Fix |
|---|---|---|
| Plausible but false answer | Training distribution dominates missing local facts. | Add retrieved source text. |
| Old policy repeated | Model has stale or generic priors. | Query current docs or database. |
| Confident unsupported claim | Probability isn't evidence. | Require citations or source-grounded output. |
| Long answer drifts | Context fills with generated text. | Re-anchor with instructions and source snippets. |
Many beginners assume an LLM contains a hidden database of facts that it looks up during generation. It doesn't. When a model completes Orders marked delivered are eligible for with refund, it isn't querying the current returns-policy table. It's producing a likely continuation from its context and learned parameters.
This distinction matters because:
The correct mental model is statistical generation, not retrieval. If you need accurate, up-to-date facts, you must ground the model with retrieval, tools, or explicit context.
exp(loss): the effective number of choices the model faces per step.exp(loss) under natural-log averaging, or compared runs with different token accounting. Fix: Hold tokenizer, evaluated tokens, and averaging convention constant before comparing them.Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
Improving Language Understanding by Generative Pre-Training.
Radford, A., et al. · 2018
Language Models are Unsupervised Multitask Learners.
Radford, A., et al. · 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. · 2019 · NAACL 2019
A Mathematical Theory of Communication
Claude E. Shannon · 1948
Perplexity — a Measure of the Difficulty of Speech Recognition Tasks.
Jelinek, F., et al. · 1977
Efficiently Scaling Transformer Inference.
Pope, R., et al. · 2023 · arXiv preprint
Language Models are Few-Shot Learners.
Brown, T., et al. · 2020 · NeurIPS 2020