LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Training & AdaptationBuild GPT from Scratch Lab
⚡HardFine-Tuning & Training

Build GPT from Scratch Lab

Build and train a tiny GPT end to end on Shakespeare: tokenize with GPT-style subwords, remap active token IDs, run causal self-attention, track validation loss, save a checkpoint, and sample text.

23 min read
Learning path
Step 98 of 158 in the full curriculum
Pre-training Data at ScaleContinued Pretraining for Domain Shift

Token shards are only useful once a model can learn from them. This lab shows that next step: a tiny decoder-only GPT (Generative Pre-trained Transformer) consumes token blocks, predicts next tokens, writes a checkpoint, and produces sample text you can judge.[1]Reference 1CS336: Language Modeling from Scratch.https://cs336.stanford.edu/

Reference implementation inspiration matters here. Karpathy's nanoGPT provides a compact end-to-end Transformer training codebase, but its fastest Shakespeare on-ramp is character-level. For this lab we keep the tiny scale while switching to GPT-2 byte-level byte pair encoding (BPE), matching GPT-2's published input representation.[2]Reference 2nanoGPT.https://github.com/karpathy/nanoGPT[3]Reference 3Language Models are Unsupervised Multitask Learners.https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

The corpus is bundled here so readers can download the same raw text and rerun the experiment locally: Download shakespeare.txt. That file is Princeton's course mirror of Shakespeare text. We train on the full bundled corpus, not a single-play excerpt. The run stays CPU-friendly because the model and update budget are tiny.[4]Reference 4shakespeare.txt.https://www.cs.princeton.edu/courses/archive/spring20/cos302/files/shakespeare.txt

Why use Shakespeare instead of a technical-log corpus? Because here the learning target is pure next-token mechanics: tokenization, causal masking, loss, checkpointing, and sampling. Once you understand the tiny GPT loop on a public corpus, the same pipeline transfers to code, runbooks, chat transcripts, and other domain text where you would usually fine-tune or continue pretrain.

Exact tensor-shape trace through the tiny GPT lab. A 12 by 48 token-ID batch becomes 12 by 48 by 96 embeddings, splits into four attention heads with 24 features each, forms a 48 by 48 lower-triangular attention matrix per head, and produces 12 by 48 by 17,485 logits trained against one-token-shifted labels. A parameter bar compares 4.82 million full-vocabulary tied weights with 1.68 million active-vocabulary weights. Exact tensor-shape trace through the tiny GPT lab. A 12 by 48 token-ID batch becomes 12 by 48 by 96 embeddings, splits into four attention heads with 24 features each, forms a 48 by 48 lower-triangular attention matrix per head, and produces 12 by 48 by 17,485 logits trained against one-token-shifted labels. A parameter bar compares 4.82 million full-vocabulary tied weights with 1.68 million active-vocabulary weights.
One batch moves from `[12, 48]` token IDs to `[12, 48, 96]` embeddings, four 24-feature attention heads, a causal `[48, 48]` matrix per head, and `[12, 48, 17,485]` logits. The shifted row shows the next-token target, while active-vocabulary remapping cuts the tied vocabulary matrix by 65.2%.

Why this lab feels different

Small GPT runs still have one non-negotiable contract:

  • tokenize raw text into IDs
  • let position t attend only to positions <= t
  • predict token t + 1
  • repeat across many packed windows
  • judge checkpoint with both validation loss and sampled text

If any piece is wrong, the loop still runs and still prints numbers. Tokenizer, batching, model, and generation need to agree with each other instead of being studied as isolated concepts.

Five moving parts

PieceWhat it doesFailure mode
Corpussupplies raw language distributionusing too much text for CPU lab or too little text for recognizable output
Tokenizerturns text into GPT-style subword IDsforgetting that token IDs depend on exact tokenizer
Active-vocab remapshrinks sparse GPT-2 token IDs into small local rangekeeping a 50k-row tied vocabulary matrix in toy CPU lab
Block packerslices token stream into fixed contexts and shifted labelsoff-by-one windows or broken train/val split
Decoder loopruns masked self-attention, loss, checkpoint, and generationmissing causal mask or trusting loss without sampling

That middle row matters. GPT-2 BPE can emit IDs anywhere in a 50k vocabulary. Our tiny lab still only activates a subset of those IDs, so we remap active token IDs into contiguous local IDs like 0..17484. Same subword tokenization. Smaller tied embedding/output matrix.

active-vocabulary-parameter-cost.py
1gpt2_vocab = 50_257 2active_vocab = 17_485 3d_model = 96 4 5# GPT-2 uses token-embedding weights again for output logits, so one 6# vocabulary-sized matrix determines this part of parameter cost. 7full_weights = gpt2_vocab * d_model 8compact_weights = active_vocab * d_model 9reduction = 1 - compact_weights / full_weights 10 11print(f"full vocab tied weights: {full_weights:,}") 12print(f"compact tied weights: {compact_weights:,}") 13print(f"toy-lab reduction: {reduction:.1%}")
Output
1full vocab tied weights: 4,824,672 2compact tied weights: 1,678,560 3toy-lab reduction: 65.2%

The remap is an efficiency device for this fixed corpus, not a replacement tokenizer. The model below also follows GPT-2's tied token-embedding/output-logit matrix.[5]Reference 5GPT-2 Source Implementation.https://github.com/openai/gpt-2/blob/master/src/model.py This toy mapping is built from the full corpus before the split so validation IDs remain representable; a full GPT-2-vocabulary run wouldn't need that compromise. Compact remapping also means generated prompts must be expressible using token IDs present in the corpus.

Audit the causal target before training

A costly silent error in a language-model lab is training each token to reproduce itself instead of predicting its successor. Check the shift on a tiny block before looking at a full training loop.

shifted-next-token-labels.py
1block = ["Good", " sir", ",", "\n", "Speak", " plain", "."] 2x = block[:-1] 3y = block[1:] 4 5for current, target in zip(x, y): 6 print(f"{current!r:>8} -> {target!r}") 7 8assert y[0] == " sir" and y[-1] == "."
Output
1'Good' -> ' sir' 2 ' sir' -> ',' 3 ',' -> '\n' 4 '\n' -> 'Speak' 5 'Speak' -> ' plain' 6' plain' -> '.'

A shifted target isn't enough: attention at each position must also be unable to read future inputs. During each forward pass, the model emits logits: one unnormalized score for every local vocabulary ID at every position. Cross-entropy compares those scores with the shifted targets. Inside attention, the causal mask is applied before softmax turns allowed scores into weights, so future positions receive no probability mass.

causal-mask-visibility.py
1import torch 2 3mask = torch.tril(torch.ones(4, 4, dtype=torch.bool)) 4for row in mask.int().tolist(): 5 print(" ".join(map(str, row))) 6 7assert mask[2].tolist() == [True, True, True, False] 8print("position 2 can read positions:", [index for index, visible in enumerate(mask[2]) if visible])
Output
11 0 0 0 21 1 0 0 31 1 1 0 41 1 1 1 5position 2 can read positions: [0, 1, 2]

Runnable lab

This lab is more useful if first result is still half-baked. So notebook flow is staged:

  1. train short warmup run
  2. sample rough checkpoint
  3. keep training same model longer
  4. sample again and compare

The first cell builds the whole pipeline in plain PyTorch, then only trains for 80 steps. That's enough to learn something, but not enough to look good.[6]Reference 6PyTorch: An Imperative Style, High-Performance Deep Learning Library.https://arxiv.org/abs/1912.01703

build_gpt_from_scratch.py
1from pathlib import Path 2import math 3import torch 4import torch.nn as nn 5import torch.nn.functional as F 6import tiktoken 7 8# Fix random seed so article output is reproducible. 9torch.manual_seed(7) 10 11# Load raw text exactly as learner will download it. 12all_text = Path("assets/shakespeare.txt").read_text(encoding="utf-8") 13corpus = all_text 14 15# Use GPT-2 BPE so tokenization matches GPT-2's published input representation. 16encoding_name = "gpt2" 17encoder = tiktoken.get_encoding(encoding_name) 18corpus_token_ids = encoder.encode(corpus) 19 20# GPT-2 token ids are sparse across a 50k vocabulary. This run only needs 21# tokens that actually appear inside bundled Shakespeare corpus, so remap them to 0..N-1. 22# That keeps tied embedding/output weights much smaller without changing 23# which subword pieces the tokenizer produced. 24active_token_ids = sorted(set(corpus_token_ids)) 25token_to_local = {token_id: idx for idx, token_id in enumerate(active_token_ids)} 26local_to_token = {idx: token_id for token_id, idx in token_to_local.items()} 27ids = [token_to_local[token_id] for token_id in corpus_token_ids] 28 29# Each training example needs block_size input tokens plus 1 next-token label. 30block_size = 48 31 32# Split once along original stream. Concatenating interleaved held-out chunks 33# would create fake transitions where non-adjacent Shakespeare passages meet. 34split_index = int(0.9 * len(ids)) 35train_ids = torch.tensor(ids[:split_index], dtype=torch.long) 36val_ids = torch.tensor(ids[split_index:], dtype=torch.long) 37batch_size = 12 38 39# Keep training, validation, and text-sampling randomness independent. Logging 40# one sample should never change which training windows the model sees next. 41train_generator = torch.Generator().manual_seed(101) 42val_generator = torch.Generator().manual_seed(202) 43 44print( 45 f"tokens={len(ids)} active_vocab={len(active_token_ids)} " 46 f"train={len(train_ids)} val={len(val_ids)}" 47) 48 49def sample_batch( 50 source: torch.Tensor, 51 *, 52 generator: torch.Generator, 53) -> tuple[torch.Tensor, torch.Tensor]: 54 # Pick random starting positions from long token stream. 55 starts = torch.randint( 56 0, 57 len(source) - block_size, 58 (batch_size,), 59 generator=generator, 60 ) 61 62 # x is current context window. 63 x = torch.stack([source[s:s + block_size] for s in starts]) 64 65 # y is same window shifted one token to left. This is whole learning target. 66 y = torch.stack([source[s + 1:s + block_size + 1] for s in starts]) 67 return x, y 68 69class CausalSelfAttention(nn.Module): 70 def __init__(self, d_model: int = 96, n_heads: int = 4): 71 super().__init__() 72 self.n_heads = n_heads 73 self.head_dim = d_model // n_heads 74 75 # One linear layer projects each position into query, key, and value vectors. 76 self.qkv = nn.Linear(d_model, 3 * d_model) 77 self.proj = nn.Linear(d_model, d_model) 78 79 # Lower-triangular mask blocks attention to future positions. 80 self.register_buffer( 81 "mask", 82 torch.tril(torch.ones(block_size, block_size, dtype=torch.bool)), 83 ) 84 85 def forward(self, x: torch.Tensor) -> torch.Tensor: 86 batch_size, seqlen, width = x.shape 87 q, k, v = self.qkv(x).chunk(3, dim=-1) 88 89 def split_heads(tensor: torch.Tensor) -> torch.Tensor: 90 # Turn [batch, time, width] into [batch, heads, time, head_dim]. 91 return tensor.view(batch_size, seqlen, self.n_heads, self.head_dim).transpose(1, 2) 92 93 q = split_heads(q) 94 k = split_heads(k) 95 v = split_heads(v) 96 97 # Attention score = query-key similarity, scaled to keep softmax stable. 98 attn = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim) 99 100 # Any position above diagonal is future token, so hide it from model. 101 attn = attn.masked_fill(~self.mask[:seqlen, :seqlen], float("-inf")) 102 attn = attn.softmax(dim=-1) 103 104 # Weighted sum of value vectors produces contextualized representation. 105 out = attn @ v 106 out = out.transpose(1, 2).contiguous().view(batch_size, seqlen, width) 107 return self.proj(out) 108 109class Block(nn.Module): 110 def __init__(self, d_model: int = 96, n_heads: int = 4): 111 super().__init__() 112 113 # Pre-LN transformer block: normalize, attend, add residual, then MLP. 114 self.ln1 = nn.LayerNorm(d_model) 115 self.attn = CausalSelfAttention(d_model, n_heads) 116 self.ln2 = nn.LayerNorm(d_model) 117 self.ff = nn.Sequential( 118 nn.Linear(d_model, 4 * d_model), 119 nn.GELU(), 120 nn.Linear(4 * d_model, d_model), 121 ) 122 123 def forward(self, x: torch.Tensor) -> torch.Tensor: 124 x = x + self.attn(self.ln1(x)) 125 x = x + self.ff(self.ln2(x)) 126 return x 127 128class TinyGPT(nn.Module): 129 def __init__(self, vocab_size: int, d_model: int = 96, n_heads: int = 4, n_layers: int = 2): 130 super().__init__() 131 132 # Token embeddings say "which subword is this?". 133 self.token_emb = nn.Embedding(vocab_size, d_model) 134 135 # Position embeddings say "where is this token inside current window?". 136 self.pos_emb = nn.Embedding(block_size, d_model) 137 self.blocks = nn.ModuleList([Block(d_model, n_heads) for _ in range(n_layers)]) 138 self.ln_f = nn.LayerNorm(d_model) 139 140 # GPT-2 reuses token embedding weights for its output logits. 141 self.head = nn.Linear(d_model, vocab_size, bias=False) 142 self.apply(self._init_weights) 143 self.head.weight = self.token_emb.weight 144 145 @staticmethod 146 def _init_weights(module: nn.Module) -> None: 147 # GPT-style small initialization is important once output weights are tied. 148 if isinstance(module, (nn.Linear, nn.Embedding)): 149 nn.init.normal_(module.weight, mean=0.0, std=0.02) 150 if isinstance(module, nn.Linear) and module.bias is not None: 151 nn.init.zeros_(module.bias) 152 153 def forward(self, x: torch.Tensor) -> torch.Tensor: 154 _, seqlen = x.shape 155 positions = torch.arange(seqlen, device=x.device) 156 157 # GPT adds token meaning and position meaning before any attention happens. 158 h = self.token_emb(x) + self.pos_emb(positions)[None, :, :] 159 for block in self.blocks: 160 h = block(h) 161 h = self.ln_f(h) 162 return self.head(h) 163 164# Build model and optimizer. 165model_config = {"d_model": 96, "n_heads": 4, "n_layers": 2} 166model = TinyGPT(vocab_size=len(active_token_ids), **model_config) 167optimizer = torch.optim.AdamW(model.parameters(), lr=2e-3) 168 169def evaluate(source: torch.Tensor) -> tuple[float, float]: 170 # Average across a few validation batches so accuracy is less noisy. 171 was_training = model.training 172 model.eval() 173 losses = [] 174 accuracies = [] 175 with torch.no_grad(): 176 for _ in range(8): 177 x, y = sample_batch(source, generator=val_generator) 178 logits = model(x) 179 losses.append(F.cross_entropy(logits.reshape(-1, logits.size(-1)), y.reshape(-1)).item()) 180 accuracies.append((logits.argmax(dim=-1) == y).float().mean().item()) 181 model.train(was_training) 182 return sum(losses) / len(losses), sum(accuracies) / len(accuracies) 183 184def sample_completion(prompt: str, steps: int = 80) -> str: 185 # Use a local generator so fixed-prompt monitoring cannot perturb training. 186 sampling_generator = torch.Generator().manual_seed(17) 187 188 prompt_token_ids = encoder.encode(prompt) 189 missing_ids = sorted(set(prompt_token_ids) - set(active_token_ids)) 190 if missing_ids: 191 raise ValueError(f"Prompt uses token ids outside compact corpus vocabulary: {missing_ids}") 192 prompt_local_ids = [token_to_local[token_id] for token_id in prompt_token_ids] 193 context = torch.tensor([prompt_local_ids], dtype=torch.long) 194 195 was_training = model.training 196 model.eval() 197 with torch.no_grad(): 198 for _ in range(steps): 199 # If sample gets longer than block size, GPT only sees most recent window. 200 x = context[:, -block_size:] 201 logits = model(x) 202 203 # Only final position matters for next-token sampling. 204 next_logits = logits[:, -1, :] 205 206 # Restrict to top candidates so toy model doesn't wander too wildly. 207 top_values, top_indices = torch.topk(next_logits, k=8, dim=-1) 208 probs = torch.softmax(top_values / 0.9, dim=-1) 209 sampled_index = torch.multinomial( 210 probs, 211 num_samples=1, 212 generator=sampling_generator, 213 ) 214 next_local_id = top_indices.gather(-1, sampled_index) 215 216 # Append sampled token and continue autoregressive loop. 217 context = torch.cat([context, next_local_id], dim=1) 218 219 # Convert local ids back to original GPT-2 token ids, then decode to text. 220 sample = encoder.decode([local_to_token[int(idx)] for idx in context[0]]) 221 model.train(was_training) 222 return sample 223 224for step in range(81): 225 # 1. Draw random training batch. 226 x, y = sample_batch(train_ids, generator=train_generator) 227 228 # 2. Predict next-token logits for every position. 229 logits = model(x) 230 231 # 3. Compare logits against shifted targets. 232 loss = F.cross_entropy(logits.reshape(-1, logits.size(-1)), y.reshape(-1)) 233 234 # 4. Backpropagate and update weights. 235 optimizer.zero_grad() 236 loss.backward() 237 optimizer.step() 238 239 # Print occasional train/val snapshots so learner can watch first useful learning happen. 240 if step % 40 == 0: 241 val_loss, val_acc = evaluate(val_ids) 242 print( 243 f"step={step:03d} train={loss.item():.3f} " 244 f"val={val_loss:.3f} val_acc={val_acc:.3f}" 245 ) 246 247# Save first-checkpoint metrics so later cell can compare improvement directly. 248early_val_loss = val_loss 249early_val_acc = val_acc 250 251# Save enough state to keep sampling compatible with trained weights. 252checkpoint = { 253 "encoding_name": encoding_name, 254 "active_token_ids": active_token_ids, 255 "model_config": model_config, 256 "block_size": block_size, 257 "split_index": split_index, 258 "model_state_dict": model.state_dict(), 259 "optimizer_state_dict": optimizer.state_dict(), 260 "train_generator_state": train_generator.get_state(), 261 "val_generator_state": val_generator.get_state(), 262}
Warmup training log
1tokens=1255253 active_vocab=17485 train=1129727 val=125526 2step=000 train=9.760 val=9.536 val_acc=0.102 3step=040 train=6.444 val=6.439 val_acc=0.146 4step=080 train=6.103 val=6.076 val_acc=0.169

The second cell samples that early checkpoint. Because the prompt isn't present verbatim in the corpus, generation doesn't start from a memorized heading. The continuation can still contain familiar or copied spans, so this is a sanity check rather than a memorization audit.

build_gpt_from_scratch_warmup_sample.py
1# Prompt is intentionally not copied from training corpus verbatim. 2prompt = "Good sir,\nSpeak plain.\n" 3 4# This first sample should still look rough and undertrained. 5sample = sample_completion(prompt) 6print(f"prompt_seen_verbatim={prompt in corpus}") 7print("sample:") 8print("\n".join(line.rstrip() for line in sample.splitlines()))
Warmup sample
1prompt_seen_verbatim=False 2sample: 3Good sir, 4Speak plain. 5 6I and not 7 8 and , 9 10To 11 12 and a I 13 14To , 15 16And 17 18I the , 19 the 20 , . the . , I , the , I . and , 21 . . , 22 23 , 24 . 25 , 26 a ; , ,

That checkpoint is still half-baked. It knows Shakespeare-ish punctuation and function words, but it doesn't yet hold a stable thought.

The third cell keeps the exact same model and optimizer state, trains longer, and prints whether held-out metrics improved.

build_gpt_from_scratch_continue_training.py
1# Continue from exact same checkpoint instead of restarting from scratch. 2for step in range(81, 321): 3 x, y = sample_batch(train_ids, generator=train_generator) 4 logits = model(x) 5 loss = F.cross_entropy(logits.reshape(-1, logits.size(-1)), y.reshape(-1)) 6 optimizer.zero_grad() 7 loss.backward() 8 optimizer.step() 9 10 if step % 80 == 0: 11 late_val_loss, late_val_acc = evaluate(val_ids) 12 print( 13 f"step={step:03d} train={loss.item():.3f} " 14 f"val={late_val_loss:.3f} val_acc={late_val_acc:.3f}" 15 ) 16 17print(f"val_loss_improved_by={early_val_loss - late_val_loss:.3f}") 18print(f"val_acc_improved_by={late_val_acc - early_val_acc:.3f}") 19 20checkpoint = { 21 "encoding_name": encoding_name, 22 "active_token_ids": active_token_ids, 23 "model_config": model_config, 24 "block_size": block_size, 25 "split_index": split_index, 26 "model_state_dict": model.state_dict(), 27 "optimizer_state_dict": optimizer.state_dict(), 28 "train_generator_state": train_generator.get_state(), 29 "val_generator_state": val_generator.get_state(), 30}
Longer training log
1step=160 train=5.785 val=5.788 val_acc=0.174 2step=240 train=5.670 val=5.730 val_acc=0.163 3step=320 train=5.475 val=5.612 val_acc=0.176 4val_loss_improved_by=0.464 5val_acc_improved_by=0.007

The fourth cell samples again from the same prompt so you can compare the rough checkpoint against the longer-trained checkpoint.

build_gpt_from_scratch_late_sample.py
1# Same prompt, same sampling settings. Only model weights changed. 2sample = sample_completion(prompt) 3print(f"prompt_seen_verbatim={prompt in corpus}") 4print("sample:") 5print("\n".join(line.rstrip() for line in sample.splitlines()))
Longer-training sample
1prompt_seen_verbatim=False 2sample: 3Good sir, 4Speak plain. 5To the very king 6 7And you you will not you not a lord . 8 9And the king . 10 11And , my lord of a good hand , you , to a lord : I have you . 12And have not the king ; and be not at , my good good lord . 13I pray . 14And have be be a lord . But have the lord

The state dictionary isn't enough for an exact training resume. A checkpoint also needs the vocabulary mapping, corpus split, model configuration, optimizer state, and random-generator states that give its tensors meaning and determine the next update. This cell writes a real checkpoint file, reloads it, and proves that the restored model produces identical logits and the same next training batch.

build_gpt_checkpoint_round_trip.py
1checkpoint_path = Path("tiny_gpt_checkpoint.pt") 2torch.save(checkpoint, checkpoint_path) 3restored = torch.load(checkpoint_path, weights_only=True) 4 5assert restored["encoding_name"] == encoding_name 6assert restored["block_size"] == block_size 7assert restored["split_index"] == split_index 8resumed_model = TinyGPT(vocab_size=len(restored["active_token_ids"]), **restored["model_config"]) 9resumed_model.load_state_dict(restored["model_state_dict"]) 10resumed_optimizer = torch.optim.AdamW(resumed_model.parameters(), lr=2e-3) 11resumed_optimizer.load_state_dict(restored["optimizer_state_dict"]) 12 13resumed_train_generator = torch.Generator() 14resumed_train_generator.set_state(restored["train_generator_state"]) 15resumed_val_generator = torch.Generator() 16resumed_val_generator.set_state(restored["val_generator_state"]) 17 18resumed_token_to_local = { 19 token_id: idx for idx, token_id in enumerate(restored["active_token_ids"]) 20} 21assert resumed_token_to_local == token_to_local 22assert torch.equal(resumed_val_generator.get_state(), restored["val_generator_state"]) 23 24probe_generator = torch.Generator().manual_seed(303) 25probe_x, _ = sample_batch(val_ids, generator=probe_generator) 26model.eval() 27resumed_model.eval() 28with torch.no_grad(): 29 original_logits = model(probe_x) 30 resumed_logits = resumed_model(probe_x) 31 32expected_train_generator = torch.Generator() 33expected_train_generator.set_state(checkpoint["train_generator_state"]) 34expected_x, expected_y = sample_batch(train_ids, generator=expected_train_generator) 35resumed_x, resumed_y = sample_batch(train_ids, generator=resumed_train_generator) 36 37print(f"saved checkpoint={checkpoint_path}") 38print(f"restored encoding={restored['encoding_name']} active_vocab={len(restored['active_token_ids'])}") 39print("same logits after round trip:", torch.equal(original_logits, resumed_logits)) 40print( 41 "same next training batch after round trip:", 42 torch.equal(expected_x, resumed_x) and torch.equal(expected_y, resumed_y), 43)
Checkpoint round trip
1saved checkpoint=tiny_gpt_checkpoint.pt 2restored encoding=gpt2 active_vocab=17485 3same logits after round trip: True 4same next training batch after round trip: True

For a GPU or distributed run, checkpoint the device RNG, scheduler, and sampler state too. The exact state list grows with the training system, but the rule stays simple: a resume is exact only if the next update sees the same model, optimizer, corpus split, data window, and randomness.

That later checkpoint is still not good writing, but it produces better held-out metrics than the warmup checkpoint:

  • validation loss drops from 6.076 to 5.612
  • validation next-token accuracy rises from 0.169 to 0.176
  • sample shifts from punctuation soup toward dialogue-like clauses and role words

This is also where many readers misread toy-model output. You aren't asking, "is this fluent enough to publish?" You're checking whether held-out metrics improve and whether fixed-prompt output begins to show local corpus structure without claiming fluent generation.

Warmup versus later checkpoint comparison for the same novel prompt: step 80 shows val_loss 6.076 and punctuation soup, step 320 shows val_loss 5.612 and dialogue-like role words with broken syntax. Warmup versus later checkpoint comparison for the same novel prompt: step 80 shows val_loss 6.076 and punctuation soup, step 320 shows val_loss 5.612 and dialogue-like role words with broken syntax.
Same novel prompt, two checkpoints. Warmup sample is mostly punctuation soup at val_loss 6.076. After 240 more updates, val_loss falls to 5.612 and the continuation picks up king, lord, and clause-shaped phrasing. Broken syntax is still normal on a tiny model.

Why this lab uses GPT-2 subword IDs

Character tokens are useful for first contact because vocabulary is tiny and code is easy. GPT-2 instead uses a byte-level BPE input representation, so this lab adopts that specific tokenizer design.[3]Reference 3Language Models are Unsupervised Multitask Learners.https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

That changes three practical things:

  1. one token can span part of word, whole word, punctuation, or whitespace prefix
  2. active vocabulary is much bigger than a character alphabet
  3. prompts and sampled continuations are processed in reusable subword pieces rather than one character at a time

The lab uses GPT-2 BPE plus a local remap:

  • GPT-2 BPE matches the published GPT-2 input representation
  • local remap keeps tiny model small enough for laptop CPU

If you skip remap and keep the full 50k-way tied vocabulary matrix, the toy model becomes needlessly heavy without teaching anything extra.

Why one-token shift still matters

Nothing about subwords changes core causal objective. Loss still compares logits at position t against ground-truth token at t + 1.

If you forget shift, model stops learning continuation and starts learning reconstruction. Loss may still fall, but checkpoint is optimizing wrong problem.

For a packed token block:

text
1x: [15496, 995, 11, 262] 2y: [ 995, 11, 262, 995]

Those are still "predict next token" pairs, even though each integer now stands for BPE token instead of character.

Packing and splitting data

This toy corpus is represented as one token stream, and fixed windows are cut from that stream rather than sentence rows. Larger pre-training pipelines may pack multiple documents with separate boundary handling, as covered in the preceding chapter.

This lab keeps the same idea:

  • encode text once
  • flatten token stream
  • slice fixed block_size windows
  • shift targets by one token

This lab splits the original stream once: the first 90% is training data and the final 10% is validation data. Avoid building a validation stream by concatenating interleaved chunks: random windows could then train or evaluate on invented transitions between passages that were never adjacent.

avoid-fake-split-boundaries.py
1tokens = list(range(20)) 2chunks = [tokens[start:start + 4] for start in range(0, len(tokens), 4)] 3 4interleaved_train = chunks[0] + chunks[2] 5fake_transition = (interleaved_train[3], interleaved_train[4]) 6 7split_at = int(0.8 * len(tokens)) 8contiguous_train = tokens[:split_at] 9contiguous_validation = tokens[split_at:] 10 11print("interleaved concatenation transition:", fake_transition) 12print("was adjacent in source:", fake_transition[1] == fake_transition[0] + 1) 13print("contiguous split sizes:", len(contiguous_train), len(contiguous_validation))
Output
1interleaved concatenation transition: (3, 8) 2was adjacent in source: False 3contiguous split sizes: 16 4

Validation loss versus sample quality

Notice progression:

  • short warmup checkpoint is mostly punctuation soup
  • longer run lowers validation loss and raises validation accuracy
  • later sample has more dialogue-like clauses and role words

That's the point of the staged notebook. You can see the model move from "barely shaped" to "somewhat useful" instead of pretending the first checkpoint was already good.

In bigger runs you still don't choose between metrics and generations. You need both.

How modern LLMs differ from this lab

This lab follows the relevant GPT-2 educational choices: learned absolute position embeddings, layer normalization at each sub-block input plus a final normalization, full multi-head attention, and a GELU MLP.[3]Reference 3Language Models are Unsupervised Multitask Learners.https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Later decoder-only families such as Llama 2 keep the same causal next-token contract while changing several components for their training and serving goals.[7]Reference 7Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288

Lab componentModern replacementWhy it changed
Learned pos_emb tableRotary position embeddings (RoPE)rotates queries and keys by position instead of learning an absolute position table; Llama 2 uses RoPE[8]Reference 8RoFormer: Enhanced Transformer with Rotary Position Embedding.https://arxiv.org/abs/2104.09864[7]Reference 7Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288
nn.LayerNormRMSNormremoves mean-centering from normalization; Llama 2 uses RMSNorm before transformer sublayers[9]Reference 9Root Mean Square Layer Normalization.https://arxiv.org/abs/1910.07467[7]Reference 7Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288
GELU MLPSwiGLU (gated GLU variant)uses a gated feed-forward activation selected in Llama 2's architecture[10]Reference 10GLU Variants Improve Transformerhttps://arxiv.org/abs/2002.05202[7]Reference 7Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288
Full multi-head attentionGrouped-query attention (GQA)shares key/value heads within query-head groups to reduce KV-cache load; Llama 2 uses GQA for its 34B and 70B models[11]Reference 11GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.https://arxiv.org/abs/2305.13245[7]Reference 7Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288

Llama 2 reports RoPE, RMSNorm, and SwiGLU throughout its family, while GQA is specific to its larger 34B and 70B configurations.[7]Reference 7Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288

One more change is computational, not architectural. Our forward materializes the full [batch, heads, time, time] score matrix, then masks and softmaxes it. PyTorch provides torch.nn.functional.scaled_dot_product_attention, which can dispatch to optimized backends when conditions allow. You can drop it into this lab without changing the attention result:

scaled-dot-product-attention-equivalence.py
1import math 2import torch 3import torch.nn.functional as F 4 5torch.manual_seed(3) 6q = torch.randn(1, 2, 4, 8) 7k = torch.randn(1, 2, 4, 8) 8v = torch.randn(1, 2, 4, 8) 9 10scores = (q @ k.transpose(-2, -1)) / math.sqrt(q.size(-1)) 11mask = torch.tril(torch.ones(4, 4, dtype=torch.bool)) 12manual = scores.masked_fill(~mask, float("-inf")).softmax(dim=-1) @ v 13fused_api = F.scaled_dot_product_attention(q, k, v, is_causal=True) 14 15print("output shape:", tuple(fused_api.shape)) 16print("numerically close:", torch.allclose(manual, fused_api, atol=1e-6))
Output
1output shape: (1, 2, 4, 8) 2numerically close: True

The manual version stays in the lab because seeing the score matrix get masked is the point. PyTorch's API can select optimized kernels when supported, while FlashAttention gives the IO-aware algorithmic basis for avoiding full attention-matrix materialization.[12]Reference 12FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.https://arxiv.org/abs/2205.14135

Mastery check

Evaluation rubric

  • Explains GPT pretraining as one contract: corpus -> tokenizer -> packed windows -> causal decoder -> shifted loss -> checkpoint -> sampling
  • Explains why GPT-2 BPE plus active-token remap matches GPT-2 tokenization without paying for a full 50k-row tied vocabulary matrix
  • Diagnoses the silent label-shift bug and explains why falling loss can still hide wrong training targets
  • Explains why a causal mask is still required even after moving from characters to subword tokens
  • Uses validation loss, validation accuracy, and sampled text together instead of trusting one signal alone
  • Reads later checkpoint text as evidence of local structure learning, not as proof of fluent writing
  • Contrasts the lab's manual attention path with fused production attention without confusing systems optimization for objective changes

Follow-up questions

Common pitfalls

  • Symptom: Loss falls, but samples look like reconstruction or trivial copying. Cause: Logits at position t were trained against token t instead of token t + 1. Fix: Audit the shifted-label path with one tiny hand-checked batch before training longer.
  • Symptom: Resume run loads cleanly, then generations turn nonsensical. Cause: Tokenizer changed after checkpoint was written. Fix: Save and reload tokenizer metadata with the model state, then verify one prompt encodes the same way before resuming.
  • Symptom: CPU lab is unexpectedly huge and slow. Cause: Sparse GPT-2 token IDs were used directly, so the toy model allocates a full 50k-row tied vocabulary matrix. Fix: Remap active corpus IDs into a compact local vocabulary.
  • Symptom: Validation looks better, but generations are repetitive or collapse. Cause: You trusted train loss or one metric instead of checking held-out samples. Fix: Track validation loss, validation accuracy, and a fixed novel-prompt sample together.
  • Symptom: Outputs look locally plausible but lose longer dialogue structure. Cause: block_size was chosen only for speed, not for pattern length. Fix: Increase context window until the model can see enough preceding tokens to learn the structure you care about.

What to remember

  • Real GPTs are token-ID machines, not string machines.
  • This lab uses GPT-2 byte-level BPE rather than a character alphabet.
  • Tiny labs can still use real subword tokenization if they remap active tokens into compact local IDs.
  • GPT training loop is still corpus -> token IDs -> packed windows -> masked self-attention -> shifted loss -> checkpoint -> generation.
  • Validation loss and sampled text should both move in right direction before you trust checkpoint.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A tiny GPT uses GPT-2 BPE on a Shakespeare corpus. GPT-2 has 50,257 token IDs, the corpus activates 17,485 of them, d_model=96, and the token embedding is tied to the output head. What is the effect of remapping the active IDs to a contiguous local vocabulary?
2.A token stream is [0, 1, 2, ..., 19]. A learner splits it into chunks of 4, then builds a training stream as chunks[0] + chunks[2]. Which adjacent pair can now appear across the concatenation boundary even though it was never adjacent in the original stream?
3.With source token stream [10, 20, 30, 40, 50] and block_size=4, a batch sampler chooses start=0. Which x and y should be used for next-token cross-entropy?
4.After switching from character tokens to GPT-2 subword IDs, a length-4 decoder still uses a lower-triangular causal mask. For query position index 2, which statement is correct?
5.A manual causal-attention block forms scaled query-key scores, masks future positions, applies softmax, and multiplies by values. It is replaced with F.scaled_dot_product_attention(q, k, v, is_causal=True). What changes?
6.A warmup checkpoint has val_loss=6.076, val_acc=0.169, and a punctuation-heavy sample. After continued training, the same fixed prompt gives val_loss=5.612, val_acc=0.176, and a sample with role words and dialogue-like clauses but broken syntax. What conclusion is justified?
7.Assume the same corpus and code remain available. Beyond model_state_dict, which checkpoint payload supports an exact continuation with the same next training batch and later evaluation sequence?
8.During generation, a model with block_size=48 has a 70-token local-ID context. Which procedure correctly produces and decodes the next token?

8 questions remaining.

Next Step
Continue to Continued Pretraining and Domain Adaptation

This lab built a GPT-style base-model training loop from raw text to checkpoint and sample. Next chapter asks what changes when a model already knows general language and you want to adapt it to narrower domain data instead of starting from zero.

PreviousPre-training Data at Scale
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

CS336: Language Modeling from Scratch.

Stanford University · 2026

nanoGPT.

Karpathy, A. · 2025

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. · 2019

shakespeare.txt.

Princeton University COS 302 / SML 305 · 2020

GPT-2 Source Implementation.

OpenAI · 2019

PyTorch: An Imperative Style, High-Performance Deep Learning Library.

Paszke, A., et al. · 2019 · NeurIPS 2019

Llama 2: Open Foundation and Fine-Tuned Chat Models.

Touvron, H., et al. · 2023 · arXiv preprint

RoFormer: Enhanced Transformer with Rotary Position Embedding.

Su, J., et al. · 2021

Root Mean Square Layer Normalization.

Zhang, B. & Sennrich, R. · 2019 · NeurIPS 2019

GLU Variants Improve Transformer

Shazeer, N. · 2020

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022