Trace how decoder-only models grew into modern LLMs, then inspect scaling, instruction tuning, open weights, MoE, and serving tradeoffs with runnable examples.
The last chapter showed the next-token loop: predict one token, append it, and repeat. This chapter asks how that loop grew from a research architecture into GPT (Generative Pre-trained Transformer) systems and the modern large language models (LLMs) behind assistants, coding tools, and retrieval systems.
The Transformer was introduced in 2017 for machine translation.[1] Later models specialized its pieces into reading-focused encoders and generation-focused decoders. We'll follow one concrete thread throughout: a customer-support message moving through a decoder-only model, so each historical step answers a practical question.
Key idea: Modern generative LLMs still use the next-token loop from the last chapter. What changed is how models are pretrained, adapted to instructions, routed through parameters, and served within a latency and memory budget.
Before we talk about history, let's see why the decoder-only path became so useful for general assistants. A decoder-only model reads text left to right and predicts the next token. That sounds simple, but it turns out to be a remarkably flexible interface.
Imagine you run customer support for an online store. You face three different problems:
An encoder-only model like BERT can be fine-tuned for the first task: read a ticket, output a label. A decoder-only model can express all three tasks as "continue the text." Here is what a few-shot classification prompt looks like:
1Classify the customer message into Refund, Shipping, or Account.
2
3Message: I never received my package.
4Label: Shipping
5
6Message: I want my money back.
7Label: Refund
8
9Message: I forgot my password.
10Label: Account
11
12---
13
14Message: My order says delivered but it's not here.
15Label:The model reads the prompt from left to right, including the examples. Then it predicts the continuation at the open label slot. A sufficiently capable instruction-tuned model may produce Shipping, but that outcome is something you evaluate, not assume.
This flexibility is why generation became a common product interface. One prompt format can express classification, extraction, summarization, translation, coding, and planning. That universality shaped the field.
The prompt framing itself is simple enough to inspect directly:
1def build_support_label_prompt(message: str) -> str:
2 examples = [
3 ("I never received my package.", "Shipping"),
4 ("I want my money back.", "Refund"),
5 ("I forgot my password.", "Account"),
6 ]
7 header = "Classify the customer message into Refund, Shipping, or Account."
8 shots = "\n\n".join(
9 f"Message: {text}\nLabel: {label}" for text, label in examples
10 )
11 return f"{header}\n\n{shots}\n\n---\n\nMessage: {message}\nLabel:"
12
13prompt = build_support_label_prompt("My order says delivered but it's not here.")
14print(prompt)
15print("expected_next_token=Shipping")
16print("why=the prompt pattern ends with a label slot")1Classify the customer message into Refund, Shipping, or Account.
2
3Message: I never received my package.
4Label: Shipping
5
6Message: I want my money back.
7Label: Refund
8
9Message: I forgot my password.
10Label: Account
11
12---
13
14Message: My order says delivered but it's not here.
15Label:
16expected_next_token=Shipping
17why=the prompt pattern ends with a label slotThe original 2017 Transformer had two parts: an Encoder (to read the input) and a Decoder (to generate the output). Later models specialized these roles.
Refund. But BERT wasn't built to write long answers token by token.Encoder-only models became common for language-understanding tasks. Decoder-only models became the common architecture for general assistants. One practical reason is that generation is a flexible interface. If a model can generate text, you can ask for translation, summarization, coding, planning, or classification in the same format: prompt in, text out.
The fork wasn't just academic. It shaped how products are built today.
| Product need | Encoder-style fit | Decoder-style fit |
|---|---|---|
| Classify a support ticket | Strong fit. Read full ticket, output label. | Works if framed as generation, but not the original strength. |
| Generate a reply | Poor fit without extra decoder machinery. | Strong fit. Predict one token at a time. |
| Search over documents | Strong fit when using a retrieval-specific encoder or reranker. | Useful when paired with generated answers. |
| Write code | Not the natural interface. | Strong fit because code is a token sequence. |
The main lesson isn't "BERT bad, GPT good." Encoder-style models are still useful for classification, extraction, reranking, and retrieval-specific embeddings.[4] GPT-style decoders became the natural assistant interface because a single text-generation loop can express many tasks.
Think of a decoder-only model as a writer with a bookmark:
That simple constraint is why the same architecture can write a French translation:
1Translate English to French:
2refund -> remboursement
3shipment -> expédition
4invoice ->and produce:
1factureor write a support reply:
Customer: Where is my order?
Agent: I'm sorry for the delay. Your tracking number is 1Z999AA10123...
No separate model was trained for each of these exact prompts. The model treats the examples and the conversation as context, then continues the pattern.
The attention visibility rule is small enough to inspect. In a causal decoder, a position can attend to itself and earlier positions, never future positions. In a bidirectional encoder, every input position is visible while building a representation.
1import numpy as np
2
3tokens = ["<bos>", "refund", "approved", "<eos>"]
4causal = np.tril(np.ones((len(tokens), len(tokens)), dtype=int))
5bidirectional = np.ones_like(causal)
6
7print("tokens:", tokens)
8print("causal row for 'approved': ", causal[2].tolist())
9print("bidirectional row for 'approved':", bidirectional[2].tolist())
10print("future visible in causal row:", int(causal[2, 3]))1tokens: ['<bos>', 'refund', 'approved', '<eos>']
2causal row for 'approved': [1, 1, 1, 0]
3bidirectional row for 'approved': [1, 1, 1, 1]
4future visible in causal row: 0The 0 for the future end token is the constraint that makes autoregressive generation valid. Real attention learns nonuniform weights over visible positions; this matrix only shows which connections are allowed.
To understand why later innovations matter, let's follow one token through the full decoder-only pipeline using a customer support question: "Where is my order?"
The sentence first goes through tokenization, which breaks it into token IDs (for example, a tokenizer might assign one token representing order an ID such as 1847, while punctuation and spaces may be separate or bundled with nearby text). Those IDs become vectors through an embedding lookup table. Each token ID maps to a learned high-dimensional vector used by the model's layers.
Next, causal self-attention lets each position mix information from allowed positions at or before it. For the representation at "order," that includes "Where," "is," and "my." The model computes Query, Key, and Value projections and uses attention weights to build a context-enriched vector for the current position. After attention, the feed-forward network transforms that vector independently, applying nonlinear transformations that turn raw context into useful features.
Finally, after any stack-level final normalization, the refined vector for the last position is projected through a large output matrix to produce scores over the vocabulary. GPT-2, for example, applies a final ln_f before computing logits.[5] Softmax turns those scores into probabilities, and a decoding policy chooses a next token. During generation, the KV cache stores Key and Value tensors from prior tokens so the model doesn't recompute those projections at each new step.
The same loop runs for every single token the model ever produces. At production scale, even small gains in how attention is computed, how many parameters are active per token, how positions are encoded, or how the KV cache is managed translate into large wins on latency, cost, and the maximum context length you can afford to serve.
A small shape trace makes that interface concrete. This isn't a Transformer implementation: random arrays stand in for learned final normalized hidden states and output weights. It exposes the two production objects to remember: last-position logits and cached prior Key/Value tensors.
1import numpy as np
2
3rng = np.random.default_rng(3)
4batch, prompt_tokens, d_model, vocab_size = 1, 5, 8, 6
5final_normalized_hidden = rng.normal(size=(batch, prompt_tokens, d_model))
6output_projection = rng.normal(size=(d_model, vocab_size))
7last_logits = final_normalized_hidden[:, -1, :] @ output_projection
8shifted = last_logits - last_logits.max(axis=-1, keepdims=True)
9probabilities = np.exp(shifted) / np.exp(shifted).sum(axis=-1, keepdims=True)
10
11layers, kv_heads, head_dim = 4, 2, 4
12kv_cache_shape = (layers, 2, batch, kv_heads, prompt_tokens, head_dim)
13print("final normalized states:", final_normalized_hidden.shape)
14print("last-position logits:", last_logits.shape)
15print("next-token probabilities:", probabilities.shape, "sum=", round(float(probabilities.sum()), 3))
16print("cached K/V shape:", kv_cache_shape)1final normalized states: (1, 5, 8)
2last-position logits: (1, 6)
3next-token probabilities: (1, 6) sum= 1.0
4cached K/V shape: (4, 2, 1, 2, 5, 4)In 2020, OpenAI published work on scaling laws for neural language models.[6] The core finding: loss improves predictably with three factors.
Between GPT-1 and GPT-3, OpenAI released GPT-2 in 2019, a 1.5-billion parameter decoder-only model that demonstrated surprisingly coherent open-ended generation and drew public attention to both the power and risks of scaling up autoregressive language models.[7]
OpenAI then trained GPT-3, a 175-billion parameter decoder-only model.[8] It was more than 100 times larger than GPT-2.
GPT-3 demonstrated in-context learning at a surprising scale.[8] You didn't need to retrain the model for every new task. You could show a few examples in the prompt, and the model often inferred the pattern on the fly. This made prompt engineering a practical developer skill.
Scaling laws are often summarized as "bigger is better." That's too sloppy. The useful statement is narrower and more balanced.
Consider two hypothetical training runs:
| Model | Parameters | Training tokens | Compute (approx) |
|---|---|---|---|
| Small but balanced | 7 billion | 140 billion | ~1x |
| Large but starved | 175 billion | 300 billion | ~54x |
In this hypothetical table, the 175B model has 25x more parameters, but it was trained on only about 2x more tokens. The Chinchilla paper showed that many large models were undertrained for their size: a smaller model trained on proportionally more data can be far more efficient per parameter.[9]
This toy calculation makes the imbalance visible. parameters x tokens is only a training-compute proxy, not a quality predictor or a substitute for measured evaluation.
1runs = {
2 "small balanced": {"params_b": 7, "tokens_b": 140},
3 "large token-starved": {"params_b": 175, "tokens_b": 300},
4}
5baseline_proxy = runs["small balanced"]["params_b"] * runs["small balanced"]["tokens_b"]
6
7for name, run in runs.items():
8 tokens_per_parameter = run["tokens_b"] / run["params_b"]
9 compute_proxy = run["params_b"] * run["tokens_b"] / baseline_proxy
10 print(
11 f"{name:20s} tokens/parameter={tokens_per_parameter:5.1f} "
12 f"compute proxy={compute_proxy:5.1f}x"
13 )1small balanced tokens/parameter= 20.0 compute proxy= 1.0x
2large token-starved tokens/parameter= 1.7 compute proxy= 53.6xFor engineers, this changes how you read model releases. Check four questions:
A smaller model trained well can beat a larger model trained poorly. That's why modern model announcements talk about tokens, data quality, context windows, tool use, and inference efficiency, not only parameter count.
GPT-3 was powerful, but it was still a document completer. If you prompted it with "Write a return-policy email", it might continue with another prompt instead of answering directly, because the base objective was "continue the text."
To make models respond to instructions more reliably, InstructGPT used supervised demonstrations followed by human-ranked model outputs, a learned reward model, and reinforcement-learning fine-tuning.[10] The important distinction is behavioral: a base model learns continuation, while an instruction-tuned model is adapted to answer requests.
These words get mixed together, so separate them early.
| Model type | What it's trained to do | Example behavior |
|---|---|---|
| Base model | Continue text from pre-training distribution. | May complete a list, article, code file, or prompt transcript. |
| Instruct model | Follow written instructions and output useful answers. | Responds directly when asked to summarize or explain. |
| Chat model | Handle multi-turn role-based conversations. | Tracks user and assistant turns, follows system instructions, and applies safety behavior. |
A base model can be powerful but awkward. An instruct model is easier to use. A chat model adds conversation structure and behavior tuning.
Here is the same user request through three lenses:
| Request | Base-model tendency | Chat-model tendency |
|---|---|---|
Write a refund email. | Might continue with more prompts or examples. | Writes the email. |
Explain attention. | Might produce documentation-like continuation. | Gives an answer adapted to the user. |
Return JSON only. | Might or might not follow format. | More likely to follow format, especially with structured output controls. |
This difference is why product teams usually start with an instruct or chat checkpoint instead of raw pre-training weights.
A common pairwise preference record has the same prompt paired with a chosen answer and a rejected answer. InstructGPT collected rankings of model outputs; pairwise records are a useful simplified view of the comparison signal.[10] The labels capture desired behavior; this code does not train a reward model.
1comparisons = [
2 {
3 "prompt": "Order A10234 arrived damaged on day 12. Apply the 30-day return policy.",
4 "chosen": "Approve a damaged-item return under the 30-day policy.",
5 "rejected": "All sales are final.",
6 "required_fact": "30-day",
7 },
8 {
9 "prompt": "Shipment S52 is delayed. Give the carrier scan time.",
10 "chosen": "The latest carrier scan was 09:32 UTC at the regional hub.",
11 "rejected": "It will arrive today.",
12 "required_fact": "09:32",
13 },
14]
15
16for row in comparisons:
17 keeps_required_fact = row["required_fact"] in row["chosen"]
18 print(f"prompt={row['prompt'][:22]}... chosen_keeps_fact={keeps_required_fact}")
19 print(" preferred:", row["chosen"])1prompt=Order A10234 arrived d... chosen_keeps_fact=True
2 preferred: Approve a damaged-item return under the 30-day policy.
3prompt=Shipment S52 is delaye... chosen_keeps_fact=True
4 preferred: The latest carrier scan was 09:32 UTC at the regional hub.Meta introduced a different distribution model.
In early 2023, Meta introduced LLaMA (Large Language Model Meta AI), a family of foundation models, and released its models to the research community.[11] That research-focused access gave approved users checkpoints they could evaluate and adapt instead of accessing a model only through a hosted interface. Downloadable weights and unrestricted commercial use are separate questions, which is why the license check below matters.
Researchers also didn't need 175 billion parameters for every useful workload. Chinchilla-style scaling showed that many models were undertrained for their size: smaller models trained on more tokens could be far more compute-efficient.[9] Independently, downloadable weights enable techniques such as fine-tuning (adapting a pre-trained model on task-specific examples) and quantization (using lower-precision weight representations to reduce memory cost), subject to model quality and license checks.
This distinction matters in real engineering work.
| Term | Meaning | What to check |
|---|---|---|
| Open weights | The model checkpoint is downloadable. | License, commercial use, redistribution, acceptable use policy. |
| Open-source code | The training or inference code is available under an open-source license. | License, dependencies, reproducibility. |
| Open data | The training dataset is available or documented enough to inspect. | Data license, filtering, privacy, safety. |
Many model families are open-weight but not fully open-source in the strict software-license sense. That doesn't make them useless. It means you need to read the license before building a business around them.
For a startup, open weights can unlock:
They also create work:
Beyond raw scaling, model builders can change how much computation and memory each generated token requires. This matters to engineers because a serving budget depends on the active computation path and stored state, not only a headline parameter total.
Mixture-of-Experts (MoE) models activate only part of the network for each token. Mixtral, for example, routes each token to a small subset of expert feed-forward networks.[12] This gives the model more total capacity while keeping active compute lower than a dense model of the same total size.
Think of it like a specialized fulfillment center. You don't send every parcel through every station. A fragile return goes to inspection and repacking, while a normal outbound order goes straight to pick-pack-ship. Most stations stay idle for that parcel. In an MoE model, most expert parameters are inactive for any given token. That lowers active computation relative to a dense model with the same total size, though the full expert pool still consumes memory and routing adds systems tradeoffs.
A toy router shows the mechanism. Its labels are for explanation only: trained experts aren't guaranteed to align neatly with human topics.
1import numpy as np
2
3experts = np.array(["returns", "shipping", "catalog", "billing"])
4tokens = ["refund", "delayed"]
5router_logits = np.array([
6 [2.4, 0.3, 0.5, -0.2],
7 [0.1, 2.2, 0.7, 0.0],
8])
9
10for token, logits in zip(tokens, router_logits):
11 selected = np.argsort(logits)[-2:][::-1]
12 weights = np.exp(logits[selected] - logits[selected].max())
13 weights /= weights.sum()
14 routes = list(zip(experts[selected].tolist(), np.round(weights, 3).tolist()))
15 print(f"{token:7s} -> {routes}")1refund -> [('returns', 0.87), ('catalog', 0.13)]
2delayed -> [('shipping', 0.818), ('catalog', 0.182)]Standard decoder LLMs generate one token at a time. Reasoning-focused models spend additional inference-time computation before producing a final answer. OpenAI's o1-preview described this approach in product form in 2024.[13] DeepSeek-R1 reported reasoning behavior developed through reinforcement-learning stages in 2025.[14] For applicable OpenAI reasoning API models, official documentation exposes a model-dependent reasoning-effort control that trades speed and token use against additional reasoning work.[15] Other model families require their own documentation and evaluation.
For our support example, a standard model might immediately answer a complex refund policy question and get a detail wrong. A reasoning model can spend extra inference-time compute on intermediate reasoning before producing the final answer, which can help it check the order date, the policy window, and the payment method more carefully. Training methods for reasoning and preference tuning get a dedicated chapter later; here you only need the shape of the idea.
Modern inference relies on several optimizations that sit below the architecture headlines:
For GQA, the serving consequence can be calculated directly: the KV cache scales with the number of Key/Value heads.
1batch, layers, tokens, head_dim, bytes_per_value = 8, 32, 8192, 128, 2
2query_heads = 32
3
4for label, kv_heads in [("MHA", 32), ("GQA", 8), ("MQA", 1)]:
5 cached_values = batch * layers * tokens * kv_heads * head_dim * 2
6 cache_gib = cached_values * bytes_per_value / (1024 ** 3)
7 relative = kv_heads / query_heads
8 print(f"{label}: kv_heads={kv_heads:2d} cache={cache_gib:5.2f} GiB relative={relative:.3f}")1MHA: kv_heads=32 cache=32.00 GiB relative=1.000
2GQA: kv_heads= 8 cache= 8.00 GiB relative=0.250
3MQA: kv_heads= 1 cache= 1.00 GiB relative=0.031RoPE's rotation also exposes a useful invariant: shifting both token positions by the same amount preserves their relative-position interaction in this two-dimensional example.
1import numpy as np
2
3def rotate(vector: np.ndarray, position: int, theta: float = 0.4) -> np.ndarray:
4 angle = position * theta
5 matrix = np.array([
6 [np.cos(angle), -np.sin(angle)],
7 [np.sin(angle), np.cos(angle)],
8 ])
9 return matrix @ vector
10
11query = np.array([1.0, 0.2])
12key = np.array([0.3, 0.9])
13same_gap_early = rotate(query, 4) @ rotate(key, 2)
14same_gap_late = rotate(query, 14) @ rotate(key, 12)
15different_gap = rotate(query, 14) @ rotate(key, 2)
16print("same gap, early:", round(float(same_gap_early), 6))
17print("same gap, late: ", round(float(same_gap_late), 6))
18print("different gap: ", round(float(different_gap), 6))
19print("same gap equal:", bool(np.isclose(same_gap_early, same_gap_late)))1same gap, early: 0.936998
2same gap, late: 0.936998
3different gap: -0.794779
4same gap equal: TrueGQA, RoPE, and FlashAttention don't change the next-token objective. They change cache cost, position handling, or attention memory traffic, which must be evaluated in the actual serving setup.
When a new model appears, avoid one-metric thinking. Read it like an engineer.
| Claim | Question to ask |
|---|---|
| "1T parameters" | Total parameters or active parameters per token? Dense or MoE? |
| "1M context" | What does latency and KV cache memory look like at that length? |
| "Better coding" | Which benchmark, which harness, and how many attempts? |
| "Reasoning model" | Does it use extra test-time compute, visible scratchpad, hidden reasoning, or tool calls? |
| "Open model" | Open weights, open license, open data, or open training recipe? |
| "Cheap API" | Input, output, cache-hit, cache-miss, batch, and long-context prices? |
This habit keeps you from overreacting to headlines.
History is useful only if it improves a decision. A published benchmark is evidence about a harness and a date, not proof that a model meets your support workload's quality, latency, privacy, or cost constraints.
Build an evaluation set of representative cases, such as damaged-item returns that require policy citations and delayed shipments that require exact scan times. Measure each candidate under the same prompt and tool setup. A scorecard can make tradeoffs explicit; its weights are product decisions, not universal truths.
1candidates = [
2 {"name": "hosted-fast", "quality": 0.86, "p95_ms": 250, "cost": 0.50, "license_ok": True},
3 {"name": "hosted-deep", "quality": 0.94, "p95_ms": 1200, "cost": 1.50, "license_ok": True},
4 {"name": "local-restricted", "quality": 0.90, "p95_ms": 500, "cost": 0.20, "license_ok": False},
5]
6
7def score(model: dict, latency_penalty: float, cost_penalty: float) -> float:
8 if not model["license_ok"]:
9 return float("-inf")
10 return 100 * model["quality"] - latency_penalty * model["p95_ms"] - cost_penalty * model["cost"]
11
12scenarios = {
13 "live support": (0.03, 8.0),
14 "nightly audit": (0.003, 4.0),
15}
16for scenario, penalties in scenarios.items():
17 ranked = sorted(candidates, key=lambda model: score(model, *penalties), reverse=True)
18 winner = ranked[0]
19 print(f"{scenario:13s} winner={winner['name']} score={score(winner, *penalties):.2f}")1live support winner=hosted-fast score=74.50
2nightly audit winner=hosted-deep score=84.40Three architectural questions recur across model documentation:
| Trend | What it is | Why an engineer cares |
|---|---|---|
| Routed computation | A model may activate a subset of expert parameters per token. | For MoE models, total parameter counts alone don't predict serving cost; ask for active parameters. |
| Optional reasoning effort | Some APIs expose more inference-time work per request. | Measure whether quality gains justify additional tokens and latency on hard cases. |
| Long input windows | Documentation may advertise large context limits. | Capacity doesn't prove reliable use of evidence at every position. |
The throughline is durable: the decoder-only next-token loop remained; scale increased capability, instruction tuning adapted behavior, downloadable weights changed deployment choices, and inference techniques changed cost and evaluation questions.
A common beginner mistake is assuming that a large advertised context window guarantees equally reliable use of evidence at every position. It doesn't.
Research on how language models use long contexts revealed a recurring pattern called "Lost in the Middle." When a model is given a very long prompt, it often performs best on relevant information at the beginning and the end. Relevant evidence placed in the middle of a long input is more likely to be missed or misused.[19]
For a support system, this has a direct practical implication. If you paste a 50-page return policy into the prompt and bury the customer's specific case details in the middle, the model might miss the case details. The fix is structural: keep important instructions and the most relevant context near strong positions such as the beginning or the end of the prompt, or break long documents into chunks and retrieve only the relevant ones.
You can enforce a prompt assembly rule in code. This rule places the governing policy first and the retrieved case evidence next to the question. It doesn't prove the model will answer correctly; evaluate that on held-out cases.
1policy = "Policy: damaged items may be returned within 30 days."
2older_notes = [
3 "History: customer updated email address.",
4 "History: promotional coupon applied.",
5 "History: delivery notification sent.",
6]
7case = "Evidence: order A10234 arrived damaged on day 12."
8question = "Question: approve return? Cite the policy and evidence."
9
10prompt_lines = [policy, *older_notes, case, question]
11for position, line in enumerate(prompt_lines, start=1):
12 print(f"{position}: {line}")
13print("evidence_adjacent_to_question:", prompt_lines[-2] == case)11: Policy: damaged items may be returned within 30 days.
22: History: customer updated email address.
33: History: promotional coupon applied.
44: History: delivery notification sent.
55: Evidence: order A10234 arrived damaged on day 12.
66: Question: approve return? Cite the policy and evidence.
7evidence_adjacent_to_question: TrueThis behavior is one reason to use retrieval-augmented generation (RAG). Instead of dumping every document into the context window, you retrieve only the most relevant passages and place them strategically.
Test your understanding by deciding which architecture is a good default for each task.
The table below uses the support and fulfillment examples we have been following, so you can see how the choice changes based on the interface your product needs.
| Task | Good default | Why |
|---|---|---|
| Rank support tickets by urgency | Encoder-only (BERT-style) | Reads the full text at once and outputs a score or label. |
| Write a personalized shipping-delay apology | Decoder-only (GPT-style) | Generates text token by token in a left-to-right loop. |
| Find the most similar past ticket from a database | Bi-encoder, often encoder-style | Encodes tickets separately so their embedding vectors can be indexed and compared with cosine similarity.[4] |
| Generate a warehouse pick-list summary | Decoder-only (GPT-style) | Treats a structured report as a generated token sequence. |
If you chose decoder-only for the ranking task, remember that generation is flexible, but an encoder can still be the better tool for embeddings and classification.
These mistakes are easy to make because headlines and marketing tend to oversimplify. If you catch yourself thinking any of the following, pause and check the correction.
These misconceptions are particularly dangerous in team settings, where one person's shorthand can shape architecture decisions for months.
| Misconception | Symptom | Cause | Fix |
|---|---|---|---|
| "GPT means any LLM." | You call every assistant "GPT" in architecture discussions. | GPT is the most famous brand, but it's one decoder-only family. | Use "LLM" or "language model" for the general category. |
| "BERT is obsolete." | You ignore encoder models for retrieval and classification. | Decoder-only assistants get the headlines. | Encoder-style models are still useful for classification, extraction, and retrieval-specific embeddings. |
| "Bigger model wins by default." | You judge models only by parameter count. | Marketing emphasizes large numbers. | Check data quality, compute budget, tuning, and inference cost. A smaller, well-trained model often beats a larger, starved one. |
| "Open-weight means free to use however I want." | You deploy an open-weight model commercially without reading the license. | The word "open" sounds permissive. | The license controls what you can do. Read it before building a business around the model. |
| "Reasoning models are a different species." | You assume a model label proves an unrelated architecture. | Product names obscure the model and inference details. | Read documentation: many reasoning-focused generative models still decode autoregressively while spending extra inference-time computation. |
Fixing these misconceptions early will save you from bad architecture decisions and from overpromising to stakeholders when you talk about model capabilities.
After this chapter, you should be able to:
Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
Attention Is All You Need.
Vaswani, A., et al. · 2017
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. · 2019 · NAACL 2019
Improving Language Understanding by Generative Pre-Training.
Radford, A., et al. · 2018
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
Reimers, N., & Gurevych, I. · 2019 · EMNLP 2019
GPT-2 Source Implementation.
OpenAI · 2019
Scaling Laws for Neural Language Models
Kaplan et al. · 2020
Language Models are Unsupervised Multitask Learners.
Radford, A., et al. · 2019
Language Models are Few-Shot Learners.
Brown, T., et al. · 2020 · NeurIPS 2020
Training Compute-Optimal Large Language Models.
Hoffmann, J., et al. · 2022 · NeurIPS 2022
Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
Ouyang, L., et al. · 2022 · NeurIPS 2022
LLaMA: Open and Efficient Foundation Language Models.
Touvron, H., et al. · 2023
Mixtral of Experts.
Jiang, A. Q., et al. · 2024
Learning to reason with LLMs
OpenAI · 2024
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
DeepSeek Team · 2025
Reasoning models
OpenAI · 2026
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.
Ainslie, J., et al. · 2023 · EMNLP 2023
RoFormer: Enhanced Transformer with Rotary Position Embedding.
Su, J., et al. · 2021
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022
Lost in the Middle: How Language Models Use Long Contexts
Liu, N.F., et al. · 2023 · TACL 2023