LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnPreparation & PrerequisitesFrom GPT to Modern LLMs
📝EasyNLP Fundamentals

From GPT to Modern LLMs

Trace how decoder-only models grew into modern LLMs, then inspect scaling, instruction tuning, open weights, MoE, and serving tradeoffs with runnable examples.

22 min read
Learning path
Step 23 of 155 in the full curriculum
Language Modeling & Next TokensPrompt Engineering Fundamentals

The last chapter showed the next-token loop: predict one token, append it, and repeat. This chapter asks how that loop grew from a research architecture into GPT (Generative Pre-trained Transformer) systems and the modern large language models (LLMs) behind assistants, coding tools, and retrieval systems.

The Transformer was introduced in 2017 for machine translation.[1] Later models specialized its pieces into reading-focused encoders and generation-focused decoders. We'll follow one concrete thread throughout: a customer-support message moving through a decoder-only model, so each historical step answers a practical question.

Key idea: Modern generative LLMs still use the next-token loop from the last chapter. What changed is how models are pretrained, adapted to instructions, routed through parameters, and served within a latency and memory budget.

Six visual milestones around a stable decoder loop: stacked Transformer blocks in 2017, bidirectional and causal attention masks in 2018, GPT-2-to-GPT-3 scale bars in 2020, preference tuning in 2022, downloadable-weight workflows in 2023, and sparse expert routing plus reasoning effort from 2024 onward. Six visual milestones around a stable decoder loop: stacked Transformer blocks in 2017, bidirectional and causal attention masks in 2018, GPT-2-to-GPT-3 scale bars in 2020, preference tuning in 2022, downloadable-weight workflows in 2023, and sparse expert routing plus reasoning effort from 2024 onward.
The architecture split into bidirectional readers and causal writers, then accumulated scale, instruction tuning, downloadable-weight workflows, routed experts, and optional inference effort. The decoder path at the bottom still selects one token, appends it, and repeats.
Diagram showing 2017 Transformer attention blocks, Encoder path BERT: read text, Decoder path GPT: generate text, and Scale plus data GPT-3. Diagram showing 2017 Transformer attention blocks, Encoder path BERT: read text, Decoder path GPT: generate text, and Scale plus data GPT-3.
2017 Transformer attention blocks, Encoder path BERT: read text, Decoder path GPT: generate text, and Scale plus data GPT-3.

One generation loop, many tasks

Before we talk about history, let's see why the decoder-only path became so useful for general assistants. A decoder-only model reads text left to right and predicts the next token. That sounds simple, but it turns out to be a remarkably flexible interface.

Imagine you run customer support for an online store. You face three different problems:

  1. Classification. A customer writes, "I want my money back." You need the label "Refund."
  2. Extraction. A customer pastes a messy paragraph that contains a tracking number buried inside it.
  3. Generation. A customer asks, "Where is my order?" You need to write a polite reply that looks up the tracking status and explains the delay.

An encoder-only model like BERT can be fine-tuned for the first task: read a ticket, output a label. A decoder-only model can express all three tasks as "continue the text." Here is what a few-shot classification prompt looks like:

text
1Classify the customer message into Refund, Shipping, or Account. 2 3Message: I never received my package. 4Label: Shipping 5 6Message: I want my money back. 7Label: Refund 8 9Message: I forgot my password. 10Label: Account 11 12--- 13 14Message: My order says delivered but it's not here. 15Label:

The model reads the prompt from left to right, including the examples. Then it predicts the continuation at the open label slot. A sufficiently capable instruction-tuned model may produce Shipping, but that outcome is something you evaluate, not assume.

This flexibility is why generation became a common product interface. One prompt format can express classification, extraction, summarization, translation, coding, and planning. That universality shaped the field.

One decoder interface branching into three visual output forms: an illustrative next-token label distribution selecting Shipping, a highlighted tracking-number span copied from a ticket, and a four-step generated reply whose context grows by one token per row. One decoder interface branching into three visual output forms: an illustrative next-token label distribution selecting Shipping, a highlighted tracking-number span copied from a ticket, and a four-step generated reply whose context grows by one token per row.
The same open continuation slot can hold one label, a copied span, or a reply built over several decode steps. The prompt defines the desired output shape; the model still predicts the next token from its left context.

The prompt framing itself is simple enough to inspect directly:

prompt-framing-as-generation.py
1def build_support_label_prompt(message: str) -> str: 2 examples = [ 3 ("I never received my package.", "Shipping"), 4 ("I want my money back.", "Refund"), 5 ("I forgot my password.", "Account"), 6 ] 7 header = "Classify the customer message into Refund, Shipping, or Account." 8 shots = "\n\n".join( 9 f"Message: {text}\nLabel: {label}" for text, label in examples 10 ) 11 return f"{header}\n\n{shots}\n\n---\n\nMessage: {message}\nLabel:" 12 13prompt = build_support_label_prompt("My order says delivered but it's not here.") 14print(prompt) 15print("expected_next_token=Shipping") 16print("why=the prompt pattern ends with a label slot")
Prompt framing output
1Classify the customer message into Refund, Shipping, or Account. 2 3Message: I never received my package. 4Label: Shipping 5 6Message: I want my money back. 7Label: Refund 8 9Message: I forgot my password. 10Label: Account 11 12--- 13 14Message: My order says delivered but it's not here. 15Label: 16expected_next_token=Shipping 17why=the prompt pattern ends with a label slot

The fork: reading and writing split apart

The original 2017 Transformer had two parts: an Encoder (to read the input) and a Decoder (to generate the output). Later models specialized these roles.

  1. The encoder-only path (BERT). Google released BERT in 2018.[2] It used an encoder stack. It conditioned on tokens on both sides of each position and could be fine-tuned for tasks such as classification and extraction. For our support example, a BERT-style classifier can be trained to map a ticket to a label such as Refund. But BERT wasn't built to write long answers token by token.
  2. The decoder-only path (GPT). OpenAI released GPT-1 in 2018.[3] It used a decoder-style Transformer stack with masked self-attention, without the original translation decoder's encoder input. It read a prefix left to right and learned to predict the next token. It could write because generation was baked into the training objective. For our support example, GPT can draft a full reply to the customer, one token at a time.

Encoder-only models became common for language-understanding tasks. Decoder-only models became the common architecture for general assistants. One practical reason is that generation is a flexible interface. If a model can generate text, you can ask for translation, summarization, coding, planning, or classification in the same format: prompt in, text out.

Why the fork matters for products

The fork wasn't just academic. It shaped how products are built today.

Product needEncoder-style fitDecoder-style fit
Classify a support ticketStrong fit. Read full ticket, output label.Works if framed as generation, but not the original strength.
Generate a replyPoor fit without extra decoder machinery.Strong fit. Predict one token at a time.
Search over documentsStrong fit when using a retrieval-specific encoder or reranker.Useful when paired with generated answers.
Write codeNot the natural interface.Strong fit because code is a token sequence.

The main lesson isn't "BERT bad, GPT good." Encoder-style models are still useful for classification, extraction, reranking, and retrieval-specific embeddings.[4] GPT-style decoders became the natural assistant interface because a single text-generation loop can express many tasks.

What "decoder-only" means in plain English

Think of a decoder-only model as a writer with a bookmark:

  1. It reads everything to the left of the bookmark.
  2. It writes the next token.
  3. The bookmark moves one token to the right.
  4. It repeats the same loop.
  5. It can't read future tokens because there are no future tokens yet.

That simple constraint is why the same architecture can write a French translation:

text
1Translate English to French: 2refund -> remboursement 3shipment -> expédition 4invoice ->

and produce:

text
1facture

or write a support reply:

Customer: Where is my order?

Agent: I'm sorry for the delay. Your tracking number is 1Z999AA10123...


No separate model was trained for each of these exact prompts. The model treats the examples and the conversation as context, then continues the pattern.

The attention visibility rule is small enough to inspect. In a causal decoder, a position can attend to itself and earlier positions, never future positions. In a bidirectional encoder, every input position is visible while building a representation.

visibility-masks.py
1import numpy as np 2 3tokens = ["<bos>", "refund", "approved", "<eos>"] 4causal = np.tril(np.ones((len(tokens), len(tokens)), dtype=int)) 5bidirectional = np.ones_like(causal) 6 7print("tokens:", tokens) 8print("causal row for 'approved': ", causal[2].tolist()) 9print("bidirectional row for 'approved':", bidirectional[2].tolist()) 10print("future visible in causal row:", int(causal[2, 3]))
Visibility mask output
1tokens: ['<bos>', 'refund', 'approved', '<eos>'] 2causal row for 'approved': [1, 1, 1, 0] 3bidirectional row for 'approved': [1, 1, 1, 1] 4future visible in causal row: 0

The 0 for the future end token is the constraint that makes autoregressive generation valid. Real attention learns nonuniform weights over visible positions; this matrix only shows which connections are allowed.

The life of one token

To understand why later innovations matter, let's follow one token through the full decoder-only pipeline using a customer support question: "Where is my order?"

The sentence first goes through tokenization, which breaks it into token IDs (for example, a tokenizer might assign one token representing order an ID such as 1847, while punctuation and spaces may be separate or bundled with nearby text). Those IDs become vectors through an embedding lookup table. Each token ID maps to a learned high-dimensional vector used by the model's layers.

Next, causal self-attention lets each position mix information from allowed positions at or before it. For the representation at "order," that includes "Where," "is," and "my." The model computes Query, Key, and Value projections and uses attention weights to build a context-enriched vector for the current position. After attention, the feed-forward network transforms that vector independently, applying nonlinear transformations that turn raw context into useful features.


Finally, after any stack-level final normalization, the refined vector for the last position is projected through a large output matrix to produce scores over the vocabulary. GPT-2, for example, applies a final ln_f before computing logits.[5] Softmax turns those scores into probabilities, and a decoding policy chooses a next token. During generation, the KV cache stores Key and Value tensors from prior tokens so the model doesn't recompute those projections at each new step.

The same loop runs for every single token the model ever produces. At production scale, even small gains in how attention is computed, how many parameters are active per token, how positions are encoded, or how the KV cache is managed translate into large wins on latency, cost, and the maximum context length you can afford to serve.

Decoder token shape trace showing five prompt tokens mapped to a five-by-eight embedding matrix, a five-by-five causal visibility mask and transformed hidden states, an illustrative next-token probability chart selecting Your, and a KV cache growing from five to six columns after the token is appended. Decoder token shape trace showing five prompt tokens mapped to a five-by-eight embedding matrix, a five-by-five causal visibility mask and transformed hidden states, an illustrative next-token probability chart selecting Your, and a KV cache growing from five to six columns after the token is appended.
A simplified five-token prompt becomes a row of vectors, passes through causal attention and feed-forward transformations, produces one next-token distribution from the final row, and extends both context and KV cache by one column.

A small shape trace makes that interface concrete. This isn't a Transformer implementation: random arrays stand in for learned final normalized hidden states and output weights. It exposes the two production objects to remember: last-position logits and cached prior Key/Value tensors.

decoder-output-and-cache-shapes.py
1import numpy as np 2 3rng = np.random.default_rng(3) 4batch, prompt_tokens, d_model, vocab_size = 1, 5, 8, 6 5final_normalized_hidden = rng.normal(size=(batch, prompt_tokens, d_model)) 6output_projection = rng.normal(size=(d_model, vocab_size)) 7last_logits = final_normalized_hidden[:, -1, :] @ output_projection 8shifted = last_logits - last_logits.max(axis=-1, keepdims=True) 9probabilities = np.exp(shifted) / np.exp(shifted).sum(axis=-1, keepdims=True) 10 11layers, kv_heads, head_dim = 4, 2, 4 12kv_cache_shape = (layers, 2, batch, kv_heads, prompt_tokens, head_dim) 13print("final normalized states:", final_normalized_hidden.shape) 14print("last-position logits:", last_logits.shape) 15print("next-token probabilities:", probabilities.shape, "sum=", round(float(probabilities.sum()), 3)) 16print("cached K/V shape:", kv_cache_shape)
Decoder shape trace output
1final normalized states: (1, 5, 8) 2last-position logits: (1, 6) 3next-token probabilities: (1, 6) sum= 1.0 4cached K/V shape: (4, 2, 1, 2, 5, 4)

Why scaling became a design variable

In 2020, OpenAI published work on scaling laws for neural language models.[6] The core finding: loss improves predictably with three factors.

  1. Parameter count (how big the model is)
  2. Dataset size (how much text it reads)
  3. Compute (how many training FLOPs, floating-point operations, you spend)

Between GPT-1 and GPT-3, OpenAI released GPT-2 in 2019, a 1.5-billion parameter decoder-only model that demonstrated surprisingly coherent open-ended generation and drew public attention to both the power and risks of scaling up autoregressive language models.[7]

OpenAI then trained GPT-3, a 175-billion parameter decoder-only model.[8] It was more than 100 times larger than GPT-2.

GPT-3 demonstrated in-context learning at a surprising scale.[8] You didn't need to retrain the model for every new task. You could show a few examples in the prompt, and the model often inferred the pattern on the fly. This made prompt engineering a practical developer skill.

Scaling laws with concrete numbers

Scaling laws are often summarized as "bigger is better." That's too sloppy. The useful statement is narrower and more balanced.

Consider two hypothetical training runs:

ModelParametersTraining tokensCompute (approx)
Small but balanced7 billion140 billion~1x
Large but starved175 billion300 billion~54x

In this hypothetical table, the 175B model has 25x more parameters, but it was trained on only about 2x more tokens. The Chinchilla paper showed that many large models were undertrained for their size: a smaller model trained on proportionally more data can be far more efficient per parameter.[9]

This toy calculation makes the imbalance visible. parameters x tokens is only a training-compute proxy, not a quality predictor or a substitute for measured evaluation.

tokens-per-parameter.py
1runs = { 2 "small balanced": {"params_b": 7, "tokens_b": 140}, 3 "large token-starved": {"params_b": 175, "tokens_b": 300}, 4} 5baseline_proxy = runs["small balanced"]["params_b"] * runs["small balanced"]["tokens_b"] 6 7for name, run in runs.items(): 8 tokens_per_parameter = run["tokens_b"] / run["params_b"] 9 compute_proxy = run["params_b"] * run["tokens_b"] / baseline_proxy 10 print( 11 f"{name:20s} tokens/parameter={tokens_per_parameter:5.1f} " 12 f"compute proxy={compute_proxy:5.1f}x" 13 )
Scaling allocation output
1small balanced tokens/parameter= 20.0 compute proxy= 1.0x 2large token-starved tokens/parameter= 1.7 compute proxy= 53.6x

For engineers, this changes how you read model releases. Check four questions:

  • How many parameters does it have?
  • How much data was it trained on?
  • What was the compute budget?
  • How was the data filtered and which workload improved?

A smaller model trained well can beat a larger model trained poorly. That's why modern model announcements talk about tokens, data quality, context windows, tool use, and inference efficiency, not only parameter count.

From autocomplete to assistant

GPT-3 was powerful, but it was still a document completer. If you prompted it with "Write a return-policy email", it might continue with another prompt instead of answering directly, because the base objective was "continue the text."

To make models respond to instructions more reliably, InstructGPT used supervised demonstrations followed by human-ranked model outputs, a learned reward model, and reinforcement-learning fine-tuning.[10] The important distinction is behavioral: a base model learns continuation, while an instruction-tuned model is adapted to answer requests.

Base model, instruct model, and chat model

These words get mixed together, so separate them early.

Model typeWhat it's trained to doExample behavior
Base modelContinue text from pre-training distribution.May complete a list, article, code file, or prompt transcript.
Instruct modelFollow written instructions and output useful answers.Responds directly when asked to summarize or explain.
Chat modelHandle multi-turn role-based conversations.Tracks user and assistant turns, follows system instructions, and applies safety behavior.

A base model can be powerful but awkward. An instruct model is easier to use. A chat model adds conversation structure and behavior tuning.

Here is the same user request through three lenses:

RequestBase-model tendencyChat-model tendency
Write a refund email.Might continue with more prompts or examples.Writes the email.
Explain attention.Might produce documentation-like continuation.Gives an answer adapted to the user.
Return JSON only.Might or might not follow format.More likely to follow format, especially with structured output controls.

This difference is why product teams usually start with an instruct or chat checkpoint instead of raw pre-training weights.

A common pairwise preference record has the same prompt paired with a chosen answer and a rejected answer. InstructGPT collected rankings of model outputs; pairwise records are a useful simplified view of the comparison signal.[10] The labels capture desired behavior; this code does not train a reward model.

preference-records.py
1comparisons = [ 2 { 3 "prompt": "Order A10234 arrived damaged on day 12. Apply the 30-day return policy.", 4 "chosen": "Approve a damaged-item return under the 30-day policy.", 5 "rejected": "All sales are final.", 6 "required_fact": "30-day", 7 }, 8 { 9 "prompt": "Shipment S52 is delayed. Give the carrier scan time.", 10 "chosen": "The latest carrier scan was 09:32 UTC at the regional hub.", 11 "rejected": "It will arrive today.", 12 "required_fact": "09:32", 13 }, 14] 15 16for row in comparisons: 17 keeps_required_fact = row["required_fact"] in row["chosen"] 18 print(f"prompt={row['prompt'][:22]}... chosen_keeps_fact={keeps_required_fact}") 19 print(" preferred:", row["chosen"])
Preference record output
1prompt=Order A10234 arrived d... chosen_keeps_fact=True 2 preferred: Approve a damaged-item return under the 30-day policy. 3prompt=Shipment S52 is delaye... chosen_keeps_fact=True 4 preferred: The latest carrier scan was 09:32 UTC at the regional hub.

Open weights and local inference

Meta introduced a different distribution model.

In early 2023, Meta introduced LLaMA (Large Language Model Meta AI), a family of foundation models, and released its models to the research community.[11] That research-focused access gave approved users checkpoints they could evaluate and adapt instead of accessing a model only through a hosted interface. Downloadable weights and unrestricted commercial use are separate questions, which is why the license check below matters.

Researchers also didn't need 175 billion parameters for every useful workload. Chinchilla-style scaling showed that many models were undertrained for their size: smaller models trained on more tokens could be far more compute-efficient.[9] Independently, downloadable weights enable techniques such as fine-tuning (adapting a pre-trained model on task-specific examples) and quantization (using lower-precision weight representations to reduce memory cost), subject to model quality and license checks.

Open-weight doesn't mean open-source

This distinction matters in real engineering work.

TermMeaningWhat to check
Open weightsThe model checkpoint is downloadable.License, commercial use, redistribution, acceptable use policy.
Open-source codeThe training or inference code is available under an open-source license.License, dependencies, reproducibility.
Open dataThe training dataset is available or documented enough to inspect.Data license, filtering, privacy, safety.

Many model families are open-weight but not fully open-source in the strict software-license sense. That doesn't make them useless. It means you need to read the license before building a business around them.

For a startup, open weights can unlock:

  1. Local inference for privacy-sensitive data.
  2. Fine-tuning on domain examples.
  3. Quantization for cheaper serving.
  4. Debugging and evaluation without sending every token to a hosted API.

They also create work:

  1. You need GPU capacity.
  2. You own uptime.
  3. You own safety filters.
  4. You own model upgrade testing.
  5. You need a license review.

Modern efficiency and inference choices

Beyond raw scaling, model builders can change how much computation and memory each generated token requires. This matters to engineers because a serving budget depends on the active computation path and stored state, not only a headline parameter total.

Mixture of Experts (MoE)

Mixture-of-Experts (MoE) models activate only part of the network for each token. Mixtral, for example, routes each token to a small subset of expert feed-forward networks.[12] This gives the model more total capacity while keeping active compute lower than a dense model of the same total size.

Think of it like a specialized fulfillment center. You don't send every parcel through every station. A fragile return goes to inspection and repacking, while a normal outbound order goes straight to pick-pack-ship. Most stations stay idle for that parcel. In an MoE model, most expert parameters are inactive for any given token. That lowers active computation relative to a dense model with the same total size, though the full expert pool still consumes memory and routing adds systems tradeoffs.

A toy router shows the mechanism. Its labels are for explanation only: trained experts aren't guaranteed to align neatly with human topics.

top-two-expert-router.py
1import numpy as np 2 3experts = np.array(["returns", "shipping", "catalog", "billing"]) 4tokens = ["refund", "delayed"] 5router_logits = np.array([ 6 [2.4, 0.3, 0.5, -0.2], 7 [0.1, 2.2, 0.7, 0.0], 8]) 9 10for token, logits in zip(tokens, router_logits): 11 selected = np.argsort(logits)[-2:][::-1] 12 weights = np.exp(logits[selected] - logits[selected].max()) 13 weights /= weights.sum() 14 routes = list(zip(experts[selected].tolist(), np.round(weights, 3).tolist())) 15 print(f"{token:7s} -> {routes}")
MoE routing output
1refund -> [('returns', 0.87), ('catalog', 0.13)] 2delayed -> [('shipping', 0.818), ('catalog', 0.182)]

Reasoning models and test-time compute

Standard decoder LLMs generate one token at a time. Reasoning-focused models spend additional inference-time computation before producing a final answer. OpenAI's o1-preview described this approach in product form in 2024.[13] DeepSeek-R1 reported reasoning behavior developed through reinforcement-learning stages in 2025.[14] For applicable OpenAI reasoning API models, official documentation exposes a model-dependent reasoning-effort control that trades speed and token use against additional reasoning work.[15] Other model families require their own documentation and evaluation.

For our support example, a standard model might immediately answer a complex refund policy question and get a detail wrong. A reasoning model can spend extra inference-time compute on intermediate reasoning before producing the final answer, which can help it check the order date, the policy window, and the payment method more carefully. Training methods for reasoning and preference tuning get a dedicated chapter later; here you only need the shape of the idea.

The engineering stack that makes it fast

Modern inference relies on several optimizations that sit below the architecture headlines:

  • Grouped-query attention (GQA). Instead of every Query head keeping separate Key and Value heads, groups of Query heads share Key and Value heads. GQA aims to retain quality close to multi-head attention while reducing inference cost.[16]
  • Rotary positional embeddings (RoPE). Rather than adding fixed position vectors to token embeddings, RoPE rotates pairs of Query and Key features by a position-dependent angle. This gives self-attention an explicit relative-position signal, though length extrapolation still needs careful training and evaluation.[17]
  • FlashAttention. Standard attention reads and writes large intermediate matrices in GPU memory, which becomes expensive for long sequences. FlashAttention is an exact attention algorithm that tiles the computation to reduce memory traffic between high-bandwidth memory and on-chip SRAM.[18]

For GQA, the serving consequence can be calculated directly: the KV cache scales with the number of Key/Value heads.

gqa-kv-cache-accounting.py
1batch, layers, tokens, head_dim, bytes_per_value = 8, 32, 8192, 128, 2 2query_heads = 32 3 4for label, kv_heads in [("MHA", 32), ("GQA", 8), ("MQA", 1)]: 5 cached_values = batch * layers * tokens * kv_heads * head_dim * 2 6 cache_gib = cached_values * bytes_per_value / (1024 ** 3) 7 relative = kv_heads / query_heads 8 print(f"{label}: kv_heads={kv_heads:2d} cache={cache_gib:5.2f} GiB relative={relative:.3f}")
KV cache comparison output
1MHA: kv_heads=32 cache=32.00 GiB relative=1.000 2GQA: kv_heads= 8 cache= 8.00 GiB relative=0.250 3MQA: kv_heads= 1 cache= 1.00 GiB relative=0.031

RoPE's rotation also exposes a useful invariant: shifting both token positions by the same amount preserves their relative-position interaction in this two-dimensional example.

rope-relative-position.py
1import numpy as np 2 3def rotate(vector: np.ndarray, position: int, theta: float = 0.4) -> np.ndarray: 4 angle = position * theta 5 matrix = np.array([ 6 [np.cos(angle), -np.sin(angle)], 7 [np.sin(angle), np.cos(angle)], 8 ]) 9 return matrix @ vector 10 11query = np.array([1.0, 0.2]) 12key = np.array([0.3, 0.9]) 13same_gap_early = rotate(query, 4) @ rotate(key, 2) 14same_gap_late = rotate(query, 14) @ rotate(key, 12) 15different_gap = rotate(query, 14) @ rotate(key, 2) 16print("same gap, early:", round(float(same_gap_early), 6)) 17print("same gap, late: ", round(float(same_gap_late), 6)) 18print("different gap: ", round(float(different_gap), 6)) 19print("same gap equal:", bool(np.isclose(same_gap_early, same_gap_late)))
RoPE relative position output
1same gap, early: 0.936998 2same gap, late: 0.936998 3different gap: -0.794779 4same gap equal: True

GQA, RoPE, and FlashAttention don't change the next-token objective. They change cache cost, position handling, or attention memory traffic, which must be evaluated in the actual serving setup.

How to read a modern model announcement

When a new model appears, avoid one-metric thinking. Read it like an engineer.

ClaimQuestion to ask
"1T parameters"Total parameters or active parameters per token? Dense or MoE?
"1M context"What does latency and KV cache memory look like at that length?
"Better coding"Which benchmark, which harness, and how many attempts?
"Reasoning model"Does it use extra test-time compute, visible scratchpad, hidden reasoning, or tool calls?
"Open model"Open weights, open license, open data, or open training recipe?
"Cheap API"Input, output, cache-hit, cache-miss, batch, and long-context prices?

This habit keeps you from overreacting to headlines.

Three visual axes for evaluating a generative model: dense and mixture-of-experts activation matrices compare eight-of-eight versus two-of-eight active experts, normalized low-to-high effort curves compare quality with token and latency cost, and an operations-versus-control quadrant places hosted API, dedicated endpoint, and local open-weight deployment. Three visual axes for evaluating a generative model: dense and mixture-of-experts activation matrices compare eight-of-eight versus two-of-eight active experts, normalized low-to-high effort curves compare quality with token and latency cost, and an operations-versus-control quadrant places hosted API, dedicated endpoint, and local open-weight deployment.
Separate total capacity from active per-token work, measure whether additional inference effort earns enough quality, and choose a deployment point whose control and privacy justify its operational burden.

Evaluate a released model on your workload

History is useful only if it improves a decision. A published benchmark is evidence about a harness and a date, not proof that a model meets your support workload's quality, latency, privacy, or cost constraints.

Build an evaluation set of representative cases, such as damaged-item returns that require policy citations and delayed shipments that require exact scan times. Measure each candidate under the same prompt and tool setup. A scorecard can make tradeoffs explicit; its weights are product decisions, not universal truths.

model-workload-scorecard.py
1candidates = [ 2 {"name": "hosted-fast", "quality": 0.86, "p95_ms": 250, "cost": 0.50, "license_ok": True}, 3 {"name": "hosted-deep", "quality": 0.94, "p95_ms": 1200, "cost": 1.50, "license_ok": True}, 4 {"name": "local-restricted", "quality": 0.90, "p95_ms": 500, "cost": 0.20, "license_ok": False}, 5] 6 7def score(model: dict, latency_penalty: float, cost_penalty: float) -> float: 8 if not model["license_ok"]: 9 return float("-inf") 10 return 100 * model["quality"] - latency_penalty * model["p95_ms"] - cost_penalty * model["cost"] 11 12scenarios = { 13 "live support": (0.03, 8.0), 14 "nightly audit": (0.003, 4.0), 15} 16for scenario, penalties in scenarios.items(): 17 ranked = sorted(candidates, key=lambda model: score(model, *penalties), reverse=True) 18 winner = ranked[0] 19 print(f"{scenario:13s} winner={winner['name']} score={score(winner, *penalties):.2f}")
Workload scorecard output
1live support winner=hosted-fast score=74.50 2nightly audit winner=hosted-deep score=84.40

Three architectural questions recur across model documentation:

TrendWhat it isWhy an engineer cares
Routed computationA model may activate a subset of expert parameters per token.For MoE models, total parameter counts alone don't predict serving cost; ask for active parameters.
Optional reasoning effortSome APIs expose more inference-time work per request.Measure whether quality gains justify additional tokens and latency on hard cases.
Long input windowsDocumentation may advertise large context limits.Capacity doesn't prove reliable use of evidence at every position.

The throughline is durable: the decoder-only next-token loop remained; scale increased capability, instruction tuning adapted behavior, downloadable weights changed deployment choices, and inference techniques changed cost and evaluation questions.

Context attention varies

A common beginner mistake is assuming that a large advertised context window guarantees equally reliable use of evidence at every position. It doesn't.

Research on how language models use long contexts revealed a recurring pattern called "Lost in the Middle." When a model is given a very long prompt, it often performs best on relevant information at the beginning and the end. Relevant evidence placed in the middle of a long input is more likely to be missed or misused.[19]

For a support system, this has a direct practical implication. If you paste a 50-page return policy into the prompt and bury the customer's specific case details in the middle, the model might miss the case details. The fix is structural: keep important instructions and the most relevant context near strong positions such as the beginning or the end of the prompt, or break long documents into chunks and retrieve only the relevant ones.


You can enforce a prompt assembly rule in code. This rule places the governing policy first and the retrieved case evidence next to the question. It doesn't prove the model will answer correctly; evaluate that on held-out cases.

evidence-near-question.py
1policy = "Policy: damaged items may be returned within 30 days." 2older_notes = [ 3 "History: customer updated email address.", 4 "History: promotional coupon applied.", 5 "History: delivery notification sent.", 6] 7case = "Evidence: order A10234 arrived damaged on day 12." 8question = "Question: approve return? Cite the policy and evidence." 9 10prompt_lines = [policy, *older_notes, case, question] 11for position, line in enumerate(prompt_lines, start=1): 12 print(f"{position}: {line}") 13print("evidence_adjacent_to_question:", prompt_lines[-2] == case)
Context packing output
11: Policy: damaged items may be returned within 30 days. 22: History: customer updated email address. 33: History: promotional coupon applied. 44: History: delivery notification sent. 55: Evidence: order A10234 arrived damaged on day 12. 66: Question: approve return? Cite the policy and evidence. 7evidence_adjacent_to_question: True

This behavior is one reason to use retrieval-augmented generation (RAG). Instead of dumping every document into the context window, you retrieve only the most relevant passages and place them strategically.

A quick check: match the architecture to the job

Test your understanding by deciding which architecture is a good default for each task.

The table below uses the support and fulfillment examples we have been following, so you can see how the choice changes based on the interface your product needs.

TaskGood defaultWhy
Rank support tickets by urgencyEncoder-only (BERT-style)Reads the full text at once and outputs a score or label.
Write a personalized shipping-delay apologyDecoder-only (GPT-style)Generates text token by token in a left-to-right loop.
Find the most similar past ticket from a databaseBi-encoder, often encoder-styleEncodes tickets separately so their embedding vectors can be indexed and compared with cosine similarity.[4]
Generate a warehouse pick-list summaryDecoder-only (GPT-style)Treats a structured report as a generated token sequence.

If you chose decoder-only for the ranking task, remember that generation is flexible, but an encoder can still be the better tool for embeddings and classification.

Common misconceptions

These mistakes are easy to make because headlines and marketing tend to oversimplify. If you catch yourself thinking any of the following, pause and check the correction.

These misconceptions are particularly dangerous in team settings, where one person's shorthand can shape architecture decisions for months.

MisconceptionSymptomCauseFix
"GPT means any LLM."You call every assistant "GPT" in architecture discussions.GPT is the most famous brand, but it's one decoder-only family.Use "LLM" or "language model" for the general category.
"BERT is obsolete."You ignore encoder models for retrieval and classification.Decoder-only assistants get the headlines.Encoder-style models are still useful for classification, extraction, and retrieval-specific embeddings.
"Bigger model wins by default."You judge models only by parameter count.Marketing emphasizes large numbers.Check data quality, compute budget, tuning, and inference cost. A smaller, well-trained model often beats a larger, starved one.
"Open-weight means free to use however I want."You deploy an open-weight model commercially without reading the license.The word "open" sounds permissive.The license controls what you can do. Read it before building a business around the model.
"Reasoning models are a different species."You assume a model label proves an unrelated architecture.Product names obscure the model and inference details.Read documentation: many reasoning-focused generative models still decode autoregressively while spending extra inference-time computation.

Fixing these misconceptions early will save you from bad architecture decisions and from overpromising to stakeholders when you talk about model capabilities.

Carry these checks forward

After this chapter, you should be able to:

  1. Explain why the original Transformer split into encoder-only (BERT) and decoder-only (GPT) paths, and why generation became the broader interface.
  2. Walk through the life of a token inside a decoder: tokenization, embedding, attention, feed-forward, softmax, and KV cache reuse.
  3. Read a model release with engineer skepticism: check parameters, data, compute, architecture, and license rather than headline claims.
  4. Choose between base, instruct, and chat checkpoints based on the product need.
  5. Recognize that long context windows aren't uniformly attentive, and place important instructions near strong positions such as the beginning or end of the prompt.
  6. Explain why model rankings are workload-specific and what an engineer should measure before picking one.

Mastery check

Key concepts

  • The split between Encoder-only (BERT) and Decoder-only (GPT) models
  • Scaling laws: why making models bigger actually worked
  • The shift from GPT-3 to in-context learning
  • Open weights: LLaMA, license checks, local inference, and fine-tuning workflows
  • Modern architectural shifts: MoE (Mixture of Experts) and reasoning-time computation
  • The life of a token: tokenization, embedding, attention, and KV cache
  • Context window behavior: lost in the middle and prompt placement
  • Base, instruct, and chat model behavior
  • How to read a model announcement without relying on hype

Evaluation rubric

  • Foundational: Differentiates between the original Transformer (Encoder-Decoder) and the GPT architecture (Decoder-only)
  • Foundational: Explains why decoder-only generation can express classification, extraction, translation, coding, and chat as next-token continuation
  • Foundational: Explains the significance of OpenAI's Scaling Laws
  • Intermediate: Describes how GPT-3 demonstrated that large models can perform zero-shot and few-shot tasks without fine-tuning
  • Intermediate: Identifies how downloadable model weights change evaluation and deployment options, subject to license terms
  • Intermediate: Separates base models, instruct models, and chat models by behavior and tuning objective
  • Intermediate: Explains why open-weight doesn't automatically mean open-source or unrestricted commercial use
  • Intermediate: Describes the lost-in-the-middle effect and why long context windows aren't uniformly useful
  • Intermediate: Explains why "best model" is workload-specific and lists the evaluation questions that matter more than a headline score

Follow-up questions

Common pitfalls

  • Symptom: You estimate serving cost from total parameters alone. Cause: You didn't check whether a model routes tokens through experts. Fix: Record active parameters, KV-cache requirements, measured throughput, and latency.
  • Symptom: You download weights and plan a deployment before license review. Cause: You treated open-weight as unrestricted open-source. Fix: Check license, acceptable-use terms, redistribution, and commercial-use requirements first.
  • Symptom: Your support assistant misses decisive case evidence in a long prompt. Cause: You assumed context capacity guaranteed uniform evidence use. Fix: Retrieve relevant passages, place them near the question, and evaluate positional robustness.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.Your team needs one support model that can draft apology emails, summarize tickets, translate snippets, and sometimes output a label from a few-shot prompt. Which architecture is the better default and why?
2.A decoder-only model receives a support prompt and begins generating a reply. Which sequence correctly describes its processing and reuse of prior state?
3.A raw pretrained checkpoint is given Write a refund email. and continues with more prompt-like text instead of answering. Which model type is the better starting point for a support bot?
4.Two training runs use these allocations: run A has 7B parameters and 140B training tokens; run B has 175B parameters and 300B training tokens. Which conclusion follows from the tokens-per-parameter comparison?
5.A live-support score is 100 * quality - 0.03 * p95_ms - 8 * cost, and candidates with an invalid license are disqualified. Which candidate wins?
6.A startup wants to run downloadable model weights locally on private support tickets and sell the resulting service. Which plan is technically and legally sound?
7.A model advertises a very large total parameter count but routes each token through only a small subset of feed-forward experts. What follows for serving estimates?
8.A support assistant receives a 50-page policy, but the decisive order details are buried in the middle and are often missed. Which change directly addresses this failure mode?
9.An attention model uses 32 Query heads and currently has 32 Key/Value heads with a 32 GiB KV cache. If GQA reduces it to 8 Key/Value heads with all other dimensions fixed, what cache size should you expect?
10.A reasoning-capable API exposes an effort control. How should a team decide whether to raise it for complex refund-policy cases?

10 questions remaining.

Next Step
Continue to Prompt Engineering Fundamentals

You'll turn model behavior and context placement into tested instructions.

PreviousLanguage Modeling & Next Tokens
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Attention Is All You Need.

Vaswani, A., et al. · 2017

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

Improving Language Understanding by Generative Pre-Training.

Radford, A., et al. · 2018

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Reimers, N., & Gurevych, I. · 2019 · EMNLP 2019

GPT-2 Source Implementation.

OpenAI · 2019

Scaling Laws for Neural Language Models

Kaplan et al. · 2020

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. · 2019

Language Models are Few-Shot Learners.

Brown, T., et al. · 2020 · NeurIPS 2020

Training Compute-Optimal Large Language Models.

Hoffmann, J., et al. · 2022 · NeurIPS 2022

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

LLaMA: Open and Efficient Foundation Language Models.

Touvron, H., et al. · 2023

Mixtral of Experts.

Jiang, A. Q., et al. · 2024

Learning to reason with LLMs

OpenAI · 2024

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

DeepSeek Team · 2025

Reasoning models

OpenAI · 2026

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

RoFormer: Enhanced Transformer with Rotary Position Embedding.

Su, J., et al. · 2021

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

Lost in the Middle: How Language Models Use Long Contexts

Liu, N.F., et al. · 2023 · TACL 2023