LeetLLM
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 104 articles completed

๐Ÿ› ๏ธComputing Foundations0/3
Python for AI EngineeringNumPy and Tensor ShapesData Structures for AI
๐Ÿ“ŠMath & Statistics0/4
Probability for MLStatistics and UncertaintyDistributions and SamplingHypothesis Tests and pass@k
๐Ÿ“šPreparation & Prerequisites0/9
Vectors, Matrices & TensorsNeural Networks from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsThe LLM Lifecycle
๐ŸงชCore LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFunction Calling & Tool UseChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
๐ŸงฎML Algorithms & Evaluation0/8
Linear Regression from ScratchValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment DesignPyTorch Training LoopsDataset Pipelines
๐ŸงฐApplied LLM Engineering0/17
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingMCP & Tool Protocol StandardsPrompt Injection DefenseAI Agent Evaluation and BenchmarkingProduction RAG PipelinesHybrid Search: Dense + SparseLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringPre-training Data at ScaleMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsDesign an Automated Support Agent
๐ŸŽ“Portfolio Capstones0/4
Capstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
๐Ÿง Transformer Deep Dives0/6
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to Nucleus
๐ŸงฌAdvanced Training & Adaptation0/10
Scaling Laws & Compute-Optimal TrainingDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
๐Ÿค–Advanced Agents & Retrieval0/12
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingAgent Memory & PersistenceHuman-in-the-Loop AgentsAgent Failure & RecoveryMulti-Agent Orchestration
โšกInference & Production Scale0/14
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Quantization: GPTQ, AWQ & GGUFSpeculative DecodingLong Context Window ManagementMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeGPU Serving & AutoscalingA/B Testing for LLMs
๐Ÿ—๏ธSystem Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnPreparation & PrerequisitesLanguage Modeling & Next Tokens
๐Ÿ“EasyNLP Fundamentals

Language Modeling & Next Tokens

Discover how the simple objective of predicting the next word leads to complex reasoning, and understand the concept of autoregressive generation.

20 min readGoogle, Meta, OpenAI +15 key concepts

By now, we understand the architecture of the Transformer, how it processes sequences, and how we measure its mistakes using cross-entropy loss. But what exactly is the model doing?

If you strip away the hype, a large language model (LLM) does one core thing: it predicts the next token.

This article explains language modeling, autoregressive generation, and how the deceptively simple task of "guessing the next token" can produce broad capabilities.

๐Ÿ’ก Key insight: An LLM doesn't generate a whole paragraph at once. It predicts a single token, appends it to its input, and runs the process again. This is called autoregressive generation.

A visualization of autoregressive next-token prediction. A visualization of autoregressive next-token prediction.
Visual anchor: trace the loop from prompt to prediction to appended token, because every generated token becomes part of the next input.

What is a language model?

At its core, a language model is a probability engine. Given a sequence of text, it assigns a probability distribution over the entire vocabulary to determine what token comes next.

If you give a language model the prompt: "The cat sat on the"

It will output a probability distribution over its 50,000+ vocabulary tokens:

  • mat (85%)
  • floor (10%)
  • couch (3%)
  • apple (0.0001%)

Probability table

Candidate tokenProbabilityWhat reader should notice
mat85%Highest probability becomes the default continuation.
floor10%Plausible alternatives remain available when sampling.
couch3%Lower-ranked words can appear with higher temperature.
apple0.0001%Grammatically possible tokens can still be semantically unlikely.

Causal language modeling

This specific flavor of language modeling, reading from left to right and predicting the future, is called causal language modeling (or autoregressive modeling). GPT-style models use this objective.[1][2]

It's called "causal" because the model is bound by causality: it can only look at the past to predict the future. It can't peek ahead. When predicting the 5th word, an attention mask blocks it from looking at the 6th or 7th words.

Models like GPT (Generative Pre-trained Transformer) and Llama are Causal Language Models.

ObjectiveCan readGood forPoor fit
Causal language modelingPast tokens onlyGenerating text left to rightFilling a middle blank with future context
Masked language modelingLeft and right contextClassification, extraction, sentence understandingOpen-ended generation

(Note: There is another type called Masked Language Modeling, used by BERT, where a word in the middle of a sentence is hidden and the model looks at both the left and right context to guess the blank. This is great for reading comprehension, but terrible for generating text, which is why modern LLMs are almost entirely causal).


The autoregressive loop

When you ask ChatGPT to write a poem, it doesn't plan out the rhyming structure in advance and output the whole poem in one go. It generates the poem exactly one token at a time in a loop.

This is the autoregressive generation loop:

  1. Input: The model receives the prompt.
  2. Predict: It calculates the probabilities for the very next token.
  3. Sample: We pick a token from those probabilities (usually the most likely one, or we use Temperature to add randomness).
  4. Append: That chosen token is glued to the end of the prompt.
  5. Repeat: The new, longer prompt is fed back into the model to predict the next token.
Diagram Diagram

This explains why generation speed is measured in tokens per second (t/s). The model has to run a forward pass through its parameters for every generated token.

The context window limit

Because the output is constantly being appended to the input, the text the model has to read keeps growing. If you generate a 1000-word essay, by the end, the model is reading the original prompt plus the 999 words it just generated.

This growing sequence must fit into the model's context window, its short-term memory limit. If the context window is 8,000 tokens and the prompt plus generated text reaches 8,001 tokens, the model can't continue without dropping earlier context.


How does "guessing words" lead to intelligence?

A common critique of LLMs is "it's just a stochastic parrot; it's just autocomplete."

If it's autocomplete, how can it write code, solve logic puzzles, or translate languages?

The answer lies in scale, diversity, and optimization. When you force a neural network to predict the next token across huge mixtures of code, math, dialogue, documentation, and books, simple word association isn't enough to minimize the loss function.

Worked prompt

Consider this prompt: "If John has 5 apples and eats 2, he has"

Model familyLikely shortcutBetter internal feature
Markov-style autocompleteNearby token pattern like 2, 3None; it mostly follows local statistics.
Large language modelStill predicts one tokenA subtraction-like representation helps reduce loss across many examples.

A simple Markov chain (basic autocomplete) might look at the word "2" and guess "3" because "2, 3" is common. But an LLM can learn that, across many similar math problems, representing subtraction helps it predict the next token more reliably than memorizing every possible combination of numbers.

To predict text from a complex world, the model has pressure to learn useful approximations of that world. Grammar, facts, reasoning patterns, and code structure become helpful internal representations for minimizing next-token prediction loss.

Summary

  • Language models output a probability distribution over a vocabulary for the next token.
  • Causal language modeling strictly looks at the past to predict the future.
  • Autoregressive generation is a loop where the predicted token is appended to the input, and the process repeats.
  • The context window is the maximum number of tokens the model can hold in this input loop.
  • Reasoning emerges because understanding logic and world knowledge is the most efficient way to minimize next-token prediction error on a massive scale.

Next, continue to From GPT to Modern LLMs, where next-token training becomes the modern model family.

Evaluation Rubric
  • 1
    Defines what a language model is in terms of probability distribution over a vocabulary
  • 2
    Explains the autoregressive generation loop step-by-step
  • 3
    Contrasts causal (left-to-right) modeling with masked (fill-in-the-blank) modeling
  • 4
    Discusses why predicting the next word naturally forces the model to learn world knowledge
Common Pitfalls
  • Assuming LLMs have an internal 'thinking' loop separate from word prediction (the 'thinking' is entirely encoded in the word prediction process)
  • Confusing the context window (how much it can read at once) with the vocabulary (the set of valid words it can generate)
  • Thinking the model generates whole sentences at once
Follow-up Questions to Expect

Key Concepts Tested
The core objective of a language model: predicting the next token given a sequence of tokensAutoregressive generation: appending the predicted token to the input and loopingCausal language modeling vs masked language modeling (GPT vs BERT)Context window limits and why generation slows down over timeHow next-token prediction forces a model to learn grammar, facts, and reasoning
References

Improving Language Understanding by Generative Pre-Training.

Radford, A., et al. ยท 2018

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. ยท 2019

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail

Your account is free and you can post anonymously if you choose.