LearnAdvanced Training & AdaptationBuild GPT from Scratch Lab

⚡HardFine-Tuning & Training

Build GPT from Scratch Lab

Build and train a tiny GPT end to end on Shakespeare: tokenize with GPT-style subwords, remap active token IDs, run causal self-attention, track validation loss, save a checkpoint, and sample text.

23 min read

Learning path

Step 98 of 158 in the full curriculum

Pre-training Data at Scale Continued Pretraining for Domain Shift

Token shards are only useful once a model can learn from them. This lab shows that next step: a tiny decoder-only GPT (Generative Pre-trained Transformer) consumes token blocks, predicts next tokens, writes a checkpoint, and produces sample text you can judge.^{[1]Reference 1CS336: Language Modeling from Scratch.https://cs336.stanford.edu/}

Reference implementation inspiration matters here. Karpathy's nanoGPT provides a compact end-to-end Transformer training codebase, but its fastest Shakespeare on-ramp is character-level. For this lab we keep the tiny scale while switching to GPT-2 byte-level byte pair encoding (BPE), matching GPT-2's published input representation.^{[2]Reference 2nanoGPT.https://github.com/karpathy/nanoGPT}^{[3]Reference 3Language Models are Unsupervised Multitask Learners.https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf}

The corpus is bundled here so readers can download the same raw text and rerun the experiment locally: Download shakespeare.txt. That file is Princeton's course mirror of Shakespeare text. We train on the full bundled corpus, not a single-play excerpt. The run stays CPU-friendly because the model and update budget are tiny.^{[4]Reference 4shakespeare.txt.https://www.cs.princeton.edu/courses/archive/spring20/cos302/files/shakespeare.txt}

Why use Shakespeare instead of a technical-log corpus? Because here the learning target is pure next-token mechanics: tokenization, causal masking, loss, checkpointing, and sampling. Once you understand the tiny GPT loop on a public corpus, the same pipeline transfers to code, runbooks, chat transcripts, and other domain text where you would usually fine-tune or continue pretrain.

Exact tensor-shape trace through the tiny GPT lab. A 12 by 48 token-ID batch becomes 12 by 48 by 96 embeddings, splits into four attention heads with 24 features each, forms a 48 by 48 lower-triangular attention matrix per head, and produces 12 by 48 by 17,485 logits trained against one-token-shifted labels. A parameter bar compares 4.82 million full-vocabulary tied weights with 1.68 million active-vocabulary weights. — One batch moves from `[12, 48]` token IDs to `[12, 48, 96]` embeddings, four 24-feature attention heads, a causal `[48, 48]` matrix per head, and `[12, 48, 17,485]` logits. The shifted row shows the next-token target, while active-vocabulary remapping cuts the tied vocabulary matrix by 65.2%.

Why this lab feels different

Small GPT runs still have one non-negotiable contract:

tokenize raw text into IDs
let position t attend only to positions <= t
predict token t + 1
repeat across many packed windows
judge checkpoint with both validation loss and sampled text

If any piece is wrong, the loop still runs and still prints numbers. Tokenizer, batching, model, and generation need to agree with each other instead of being studied as isolated concepts.

Five moving parts

Piece	What it does	Failure mode
Corpus	supplies raw language distribution	using too much text for CPU lab or too little text for recognizable output
Tokenizer	turns text into GPT-style subword IDs	forgetting that token IDs depend on exact tokenizer
Active-vocab remap	shrinks sparse GPT-2 token IDs into small local range	keeping a 50k-row tied vocabulary matrix in toy CPU lab
Block packer	slices token stream into fixed contexts and shifted labels	off-by-one windows or broken train/val split
Decoder loop	runs masked self-attention, loss, checkpoint, and generation	missing causal mask or trusting loss without sampling

That middle row matters. GPT-2 BPE can emit IDs anywhere in a 50k vocabulary. Our tiny lab still only activates a subset of those IDs, so we remap active token IDs into contiguous local IDs like 0..17484. Same subword tokenization. Smaller tied embedding/output matrix.

active-vocabulary-parameter-cost.py

gpt2_vocab = 50_257
active_vocab = 17_485
d_model = 96

# GPT-2 uses token-embedding weights again for output logits, so one
# vocabulary-sized matrix determines this part of parameter cost.
full_weights = gpt2_vocab * d_model
compact_weights = active_vocab * d_model
reduction = 1 - compact_weights / full_weights

print(f"full vocab tied weights: {full_weights:,}")
print(f"compact tied weights:    {compact_weights:,}")
print(f"toy-lab reduction: {reduction:.1%}")

Output

full vocab tied weights: 4,824,672
compact tied weights:    1,678,560
toy-lab reduction: 65.2%

The remap is an efficiency device for this fixed corpus, not a replacement tokenizer. The model below also follows GPT-2's tied token-embedding/output-logit matrix.^{[5]Reference 5GPT-2 Source Implementation.https://github.com/openai/gpt-2/blob/master/src/model.py} This toy mapping is built from the full corpus before the split so validation IDs remain representable; a full GPT-2-vocabulary run wouldn't need that compromise. Compact remapping also means generated prompts must be expressible using token IDs present in the corpus.

Audit the causal target before training

A costly silent error in a language-model lab is training each token to reproduce itself instead of predicting its successor. Check the shift on a tiny block before looking at a full training loop.

shifted-next-token-labels.py

block = ["Good", " sir", ",", "\n", "Speak", " plain", "."]
x = block[:-1]
y = block[1:]

for current, target in zip(x, y):
    print(f"{current!r:>8} -> {target!r}")

assert y[0] == " sir" and y[-1] == "."

Output

'Good' -> ' sir'
  ' sir' -> ','
     ',' -> '\n'
    '\n' -> 'Speak'
 'Speak' -> ' plain'
' plain' -> '.'

A shifted target isn't enough: attention at each position must also be unable to read future inputs. During each forward pass, the model emits logits: one unnormalized score for every local vocabulary ID at every position. Cross-entropy compares those scores with the shifted targets. Inside attention, the causal mask is applied before softmax turns allowed scores into weights, so future positions receive no probability mass.

causal-mask-visibility.py

import torch

mask = torch.tril(torch.ones(4, 4, dtype=torch.bool))
for row in mask.int().tolist():
    print(" ".join(map(str, row)))

assert mask[2].tolist() == [True, True, True, False]
print("position 2 can read positions:", [index for index, visible in enumerate(mask[2]) if visible])

Output

0 0 0
1 0 0
1 1 0
1 1 1
position 2 can read positions: [0, 1, 2]

Runnable lab

This lab is more useful if first result is still half-baked. So notebook flow is staged:

train short warmup run
sample rough checkpoint
keep training same model longer
sample again and compare

The first cell builds the whole pipeline in plain PyTorch, then only trains for 80 steps. That's enough to learn something, but not enough to look good.^{[6]Reference 6PyTorch: An Imperative Style, High-Performance Deep Learning Library.https://arxiv.org/abs/1912.01703}

build_gpt_from_scratch.py

from pathlib import Path
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import tiktoken

# Fix random seed so article output is reproducible.
torch.manual_seed(7)

# Load raw text exactly as learner will download it.
all_text = Path("assets/shakespeare.txt").read_text(encoding="utf-8")
corpus = all_text

# Use GPT-2 BPE so tokenization matches GPT-2's published input representation.
encoding_name = "gpt2"
encoder = tiktoken.get_encoding(encoding_name)
corpus_token_ids = encoder.encode(corpus)

# GPT-2 token ids are sparse across a 50k vocabulary. This run only needs
# tokens that actually appear inside bundled Shakespeare corpus, so remap them to 0..N-1.
# That keeps tied embedding/output weights much smaller without changing
# which subword pieces the tokenizer produced.
active_token_ids = sorted(set(corpus_token_ids))
token_to_local = {token_id: idx for idx, token_id in enumerate(active_token_ids)}
local_to_token = {idx: token_id for token_id, idx in token_to_local.items()}
ids = [token_to_local[token_id] for token_id in corpus_token_ids]

# Each training example needs block_size input tokens plus 1 next-token label.
block_size = 48

# Split once along original stream. Concatenating interleaved held-out chunks
# would create fake transitions where non-adjacent Shakespeare passages meet.
split_index = int(0.9 * len(ids))
train_ids = torch.tensor(ids[:split_index], dtype=torch.long)
val_ids = torch.tensor(ids[split_index:], dtype=torch.long)
batch_size = 12

# Keep training, validation, and text-sampling randomness independent. Logging
# one sample should never change which training windows the model sees next.
train_generator = torch.Generator().manual_seed(101)
val_generator = torch.Generator().manual_seed(202)

print(
    f"tokens={len(ids)} active_vocab={len(active_token_ids)} "
    f"train={len(train_ids)} val={len(val_ids)}"
)

def sample_batch(
    source: torch.Tensor,
    *,
    generator: torch.Generator,
) -> tuple[torch.Tensor, torch.Tensor]:
    # Pick random starting positions from long token stream.
    starts = torch.randint(
        0,
        len(source) - block_size,
        (batch_size,),
        generator=generator,
    )

    # x is current context window.
    x = torch.stack([source[s:s + block_size] for s in starts])

    # y is same window shifted one token to left. This is whole learning target.
    y = torch.stack([source[s + 1:s + block_size + 1] for s in starts])
    return x, y

class CausalSelfAttention(nn.Module):
    def __init__(self, d_model: int = 96, n_heads: int = 4):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads

        # One linear layer projects each position into query, key, and value vectors.
        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.proj = nn.Linear(d_model, d_model)

        # Lower-triangular mask blocks attention to future positions.
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(block_size, block_size, dtype=torch.bool)),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch_size, seqlen, width = x.shape
        q, k, v = self.qkv(x).chunk(3, dim=-1)

        def split_heads(tensor: torch.Tensor) -> torch.Tensor:
            # Turn [batch, time, width] into [batch, heads, time, head_dim].
            return tensor.view(batch_size, seqlen, self.n_heads, self.head_dim).transpose(1, 2)

        q = split_heads(q)
        k = split_heads(k)
        v = split_heads(v)

        # Attention score = query-key similarity, scaled to keep softmax stable.
        attn = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)

        # Any position above diagonal is future token, so hide it from model.
        attn = attn.masked_fill(~self.mask[:seqlen, :seqlen], float("-inf"))
        attn = attn.softmax(dim=-1)

        # Weighted sum of value vectors produces contextualized representation.
        out = attn @ v
        out = out.transpose(1, 2).contiguous().view(batch_size, seqlen, width)
        return self.proj(out)

class Block(nn.Module):
    def __init__(self, d_model: int = 96, n_heads: int = 4):
        super().__init__()

        # Pre-LN transformer block: normalize, attend, add residual, then MLP.
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = CausalSelfAttention(d_model, n_heads)
        self.ln2 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x + self.attn(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x

class TinyGPT(nn.Module):
    def __init__(self, vocab_size: int, d_model: int = 96, n_heads: int = 4, n_layers: int = 2):
        super().__init__()

        # Token embeddings say "which subword is this?".
        self.token_emb = nn.Embedding(vocab_size, d_model)

        # Position embeddings say "where is this token inside current window?".
        self.pos_emb = nn.Embedding(block_size, d_model)
        self.blocks = nn.ModuleList([Block(d_model, n_heads) for _ in range(n_layers)])
        self.ln_f = nn.LayerNorm(d_model)

        # GPT-2 reuses token embedding weights for its output logits.
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        self.apply(self._init_weights)
        self.head.weight = self.token_emb.weight

    @staticmethod
    def _init_weights(module: nn.Module) -> None:
        # GPT-style small initialization is important once output weights are tied.
        if isinstance(module, (nn.Linear, nn.Embedding)):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
        if isinstance(module, nn.Linear) and module.bias is not None:
            nn.init.zeros_(module.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        _, seqlen = x.shape
        positions = torch.arange(seqlen, device=x.device)

        # GPT adds token meaning and position meaning before any attention happens.
        h = self.token_emb(x) + self.pos_emb(positions)[None, :, :]
        for block in self.blocks:
            h = block(h)
        h = self.ln_f(h)
        return self.head(h)

# Build model and optimizer.
model_config = {"d_model": 96, "n_heads": 4, "n_layers": 2}
model = TinyGPT(vocab_size=len(active_token_ids), **model_config)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-3)

def evaluate(source: torch.Tensor) -> tuple[float, float]:
    # Average across a few validation batches so accuracy is less noisy.
    was_training = model.training
    model.eval()
    losses = []
    accuracies = []
    with torch.no_grad():
        for _ in range(8):
            x, y = sample_batch(source, generator=val_generator)
            logits = model(x)
            losses.append(F.cross_entropy(logits.reshape(-1, logits.size(-1)), y.reshape(-1)).item())
            accuracies.append((logits.argmax(dim=-1) == y).float().mean().item())
    model.train(was_training)
    return sum(losses) / len(losses), sum(accuracies) / len(accuracies)

def sample_completion(prompt: str, steps: int = 80) -> str:
    # Use a local generator so fixed-prompt monitoring cannot perturb training.
    sampling_generator = torch.Generator().manual_seed(17)

    prompt_token_ids = encoder.encode(prompt)
    missing_ids = sorted(set(prompt_token_ids) - set(active_token_ids))
    if missing_ids:
        raise ValueError(f"Prompt uses token ids outside compact corpus vocabulary: {missing_ids}")
    prompt_local_ids = [token_to_local[token_id] for token_id in prompt_token_ids]
    context = torch.tensor([prompt_local_ids], dtype=torch.long)

    was_training = model.training
    model.eval()
    with torch.no_grad():
        for _ in range(steps):
            # If sample gets longer than block size, GPT only sees most recent window.
            x = context[:, -block_size:]
            logits = model(x)

            # Only final position matters for next-token sampling.
            next_logits = logits[:, -1, :]

            # Restrict to top candidates so toy model doesn't wander too wildly.
            top_values, top_indices = torch.topk(next_logits, k=8, dim=-1)
            probs = torch.softmax(top_values / 0.9, dim=-1)
            sampled_index = torch.multinomial(
                probs,
                num_samples=1,
                generator=sampling_generator,
            )
            next_local_id = top_indices.gather(-1, sampled_index)

            # Append sampled token and continue autoregressive loop.
            context = torch.cat([context, next_local_id], dim=1)

    # Convert local ids back to original GPT-2 token ids, then decode to text.
    sample = encoder.decode([local_to_token[int(idx)] for idx in context[0]])
    model.train(was_training)
    return sample

for step in range(81):
    # 1. Draw random training batch.
    x, y = sample_batch(train_ids, generator=train_generator)

    # 2. Predict next-token logits for every position.
    logits = model(x)

    # 3. Compare logits against shifted targets.
    loss = F.cross_entropy(logits.reshape(-1, logits.size(-1)), y.reshape(-1))

    # 4. Backpropagate and update weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print occasional train/val snapshots so learner can watch first useful learning happen.
    if step % 40 == 0:
        val_loss, val_acc = evaluate(val_ids)
        print(
            f"step={step:03d} train={loss.item():.3f} "
            f"val={val_loss:.3f} val_acc={val_acc:.3f}"
        )

# Save first-checkpoint metrics so later cell can compare improvement directly.
early_val_loss = val_loss
early_val_acc = val_acc

# Save enough state to keep sampling compatible with trained weights.
checkpoint = {
    "encoding_name": encoding_name,
    "active_token_ids": active_token_ids,
    "model_config": model_config,
    "block_size": block_size,
    "split_index": split_index,
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "train_generator_state": train_generator.get_state(),
    "val_generator_state": val_generator.get_state(),
}

Warmup training log

tokens=1255253 active_vocab=17485 train=1129727 val=125526
step=000 train=9.760 val=9.536 val_acc=0.102
step=040 train=6.444 val=6.439 val_acc=0.146
step=080 train=6.103 val=6.076 val_acc=0.169

The second cell samples that early checkpoint. Because the prompt isn't present verbatim in the corpus, generation doesn't start from a memorized heading. The continuation can still contain familiar or copied spans, so this is a sanity check rather than a memorization audit.

build_gpt_from_scratch_warmup_sample.py

# Prompt is intentionally not copied from training corpus verbatim.
prompt = "Good sir,\nSpeak plain.\n"

# This first sample should still look rough and undertrained.
sample = sample_completion(prompt)
print(f"prompt_seen_verbatim={prompt in corpus}")
print("sample:")
print("\n".join(line.rstrip() for line in sample.splitlines()))

Warmup sample

prompt_seen_verbatim=False
sample:
Good sir,
Speak plain.

I and not

 and ,

To

 and a I

To ,

And

I the ,
 the
 , . the . , I , the , I . and ,
 . . ,

 ,
 .
 ,
 a ; , ,

That checkpoint is still half-baked. It knows Shakespeare-ish punctuation and function words, but it doesn't yet hold a stable thought.

The third cell keeps the exact same model and optimizer state, trains longer, and prints whether held-out metrics improved.

build_gpt_from_scratch_continue_training.py

# Continue from exact same checkpoint instead of restarting from scratch.
for step in range(81, 321):
    x, y = sample_batch(train_ids, generator=train_generator)
    logits = model(x)
    loss = F.cross_entropy(logits.reshape(-1, logits.size(-1)), y.reshape(-1))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if step % 80 == 0:
        late_val_loss, late_val_acc = evaluate(val_ids)
        print(
            f"step={step:03d} train={loss.item():.3f} "
            f"val={late_val_loss:.3f} val_acc={late_val_acc:.3f}"
        )

print(f"val_loss_improved_by={early_val_loss - late_val_loss:.3f}")
print(f"val_acc_improved_by={late_val_acc - early_val_acc:.3f}")

checkpoint = {
    "encoding_name": encoding_name,
    "active_token_ids": active_token_ids,
    "model_config": model_config,
    "block_size": block_size,
    "split_index": split_index,
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "train_generator_state": train_generator.get_state(),
    "val_generator_state": val_generator.get_state(),
}

Longer training log

step=160 train=5.785 val=5.788 val_acc=0.174
step=240 train=5.670 val=5.730 val_acc=0.163
step=320 train=5.475 val=5.612 val_acc=0.176
val_loss_improved_by=0.464
val_acc_improved_by=0.007

The fourth cell samples again from the same prompt so you can compare the rough checkpoint against the longer-trained checkpoint.

build_gpt_from_scratch_late_sample.py

# Same prompt, same sampling settings. Only model weights changed.
sample = sample_completion(prompt)
print(f"prompt_seen_verbatim={prompt in corpus}")
print("sample:")
print("\n".join(line.rstrip() for line in sample.splitlines()))

Longer-training sample

prompt_seen_verbatim=False
sample:
Good sir,
Speak plain.
To the very king

And you you will not you not a lord .

And the king .

And , my lord of a good hand , you , to a lord : I have you .
And have not the king ; and be not at , my good good lord .
I pray .
And have be be a lord . But have the lord

The state dictionary isn't enough for an exact training resume. A checkpoint also needs the vocabulary mapping, corpus split, model configuration, optimizer state, and random-generator states that give its tensors meaning and determine the next update. This cell writes a real checkpoint file, reloads it, and proves that the restored model produces identical logits and the same next training batch.

build_gpt_checkpoint_round_trip.py

checkpoint_path = Path("tiny_gpt_checkpoint.pt")
torch.save(checkpoint, checkpoint_path)
restored = torch.load(checkpoint_path, weights_only=True)

assert restored["encoding_name"] == encoding_name
assert restored["block_size"] == block_size
assert restored["split_index"] == split_index
resumed_model = TinyGPT(vocab_size=len(restored["active_token_ids"]), **restored["model_config"])
resumed_model.load_state_dict(restored["model_state_dict"])
resumed_optimizer = torch.optim.AdamW(resumed_model.parameters(), lr=2e-3)
resumed_optimizer.load_state_dict(restored["optimizer_state_dict"])

resumed_train_generator = torch.Generator()
resumed_train_generator.set_state(restored["train_generator_state"])
resumed_val_generator = torch.Generator()
resumed_val_generator.set_state(restored["val_generator_state"])

resumed_token_to_local = {
    token_id: idx for idx, token_id in enumerate(restored["active_token_ids"])
}
assert resumed_token_to_local == token_to_local
assert torch.equal(resumed_val_generator.get_state(), restored["val_generator_state"])

probe_generator = torch.Generator().manual_seed(303)
probe_x, _ = sample_batch(val_ids, generator=probe_generator)
model.eval()
resumed_model.eval()
with torch.no_grad():
    original_logits = model(probe_x)
    resumed_logits = resumed_model(probe_x)

expected_train_generator = torch.Generator()
expected_train_generator.set_state(checkpoint["train_generator_state"])
expected_x, expected_y = sample_batch(train_ids, generator=expected_train_generator)
resumed_x, resumed_y = sample_batch(train_ids, generator=resumed_train_generator)

print(f"saved checkpoint={checkpoint_path}")
print(f"restored encoding={restored['encoding_name']} active_vocab={len(restored['active_token_ids'])}")
print("same logits after round trip:", torch.equal(original_logits, resumed_logits))
print(
    "same next training batch after round trip:",
    torch.equal(expected_x, resumed_x) and torch.equal(expected_y, resumed_y),
)

Checkpoint round trip

saved checkpoint=tiny_gpt_checkpoint.pt
restored encoding=gpt2 active_vocab=17485
same logits after round trip: True
same next training batch after round trip: True

For a GPU or distributed run, checkpoint the device RNG, scheduler, and sampler state too. The exact state list grows with the training system, but the rule stays simple: a resume is exact only if the next update sees the same model, optimizer, corpus split, data window, and randomness.

That later checkpoint is still not good writing, but it produces better held-out metrics than the warmup checkpoint:

validation loss drops from 6.076 to 5.612
validation next-token accuracy rises from 0.169 to 0.176
sample shifts from punctuation soup toward dialogue-like clauses and role words

This is also where many readers misread toy-model output. You aren't asking, "is this fluent enough to publish?" You're checking whether held-out metrics improve and whether fixed-prompt output begins to show local corpus structure without claiming fluent generation.

Warmup versus later checkpoint comparison for the same novel prompt: step 80 shows val_loss 6.076 and punctuation soup, step 320 shows val_loss 5.612 and dialogue-like role words with broken syntax. — Same novel prompt, two checkpoints. Warmup sample is mostly punctuation soup at val_loss 6.076. After 240 more updates, val_loss falls to 5.612 and the continuation picks up king, lord, and clause-shaped phrasing. Broken syntax is still normal on a tiny model.

Why this lab uses GPT-2 subword IDs

Character tokens are useful for first contact because vocabulary is tiny and code is easy. GPT-2 instead uses a byte-level BPE input representation, so this lab adopts that specific tokenizer design.^{[3]Reference 3Language Models are Unsupervised Multitask Learners.https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf}

That changes three practical things:

one token can span part of word, whole word, punctuation, or whitespace prefix
active vocabulary is much bigger than a character alphabet
prompts and sampled continuations are processed in reusable subword pieces rather than one character at a time

The lab uses GPT-2 BPE plus a local remap:

GPT-2 BPE matches the published GPT-2 input representation
local remap keeps tiny model small enough for laptop CPU

If you skip remap and keep the full 50k-way tied vocabulary matrix, the toy model becomes needlessly heavy without teaching anything extra.

Why one-token shift still matters

Nothing about subwords changes core causal objective. Loss still compares logits at position t against ground-truth token at t + 1.

If you forget shift, model stops learning continuation and starts learning reconstruction. Loss may still fall, but checkpoint is optimizing wrong problem.

For a packed token block:

text

x: [15496, 995, 11, 262]
y: [  995, 11, 262, 995]

Those are still "predict next token" pairs, even though each integer now stands for BPE token instead of character.

Packing and splitting data

This toy corpus is represented as one token stream, and fixed windows are cut from that stream rather than sentence rows. Larger pre-training pipelines may pack multiple documents with separate boundary handling, as covered in the preceding chapter.

This lab keeps the same idea:

encode text once
flatten token stream
slice fixed block_size windows
shift targets by one token

This lab splits the original stream once: the first 90% is training data and the final 10% is validation data. Avoid building a validation stream by concatenating interleaved chunks: random windows could then train or evaluate on invented transitions between passages that were never adjacent.

avoid-fake-split-boundaries.py

tokens = list(range(20))
chunks = [tokens[start:start + 4] for start in range(0, len(tokens), 4)]

interleaved_train = chunks[0] + chunks[2]
fake_transition = (interleaved_train[3], interleaved_train[4])

split_at = int(0.8 * len(tokens))
contiguous_train = tokens[:split_at]
contiguous_validation = tokens[split_at:]

print("interleaved concatenation transition:", fake_transition)
print("was adjacent in source:", fake_transition[1] == fake_transition[0] + 1)
print("contiguous split sizes:", len(contiguous_train), len(contiguous_validation))

Output

interleaved concatenation transition: (3, 8)
was adjacent in source: False
contiguous split sizes: 16 4

Validation loss versus sample quality

Notice progression:

short warmup checkpoint is mostly punctuation soup
longer run lowers validation loss and raises validation accuracy
later sample has more dialogue-like clauses and role words

That's the point of the staged notebook. You can see the model move from "barely shaped" to "somewhat useful" instead of pretending the first checkpoint was already good.

In bigger runs you still don't choose between metrics and generations. You need both.

How modern LLMs differ from this lab

This lab follows the relevant GPT-2 educational choices: learned absolute position embeddings, layer normalization at each sub-block input plus a final normalization, full multi-head attention, and a GELU MLP.^{[3]Reference 3Language Models are Unsupervised Multitask Learners.https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf} Later decoder-only families such as Llama 2 keep the same causal next-token contract while changing several components for their training and serving goals.^{[7]Reference 7Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288}

Lab component	Modern replacement	Why it changed
Learned `pos_emb` table	Rotary position embeddings (RoPE)	rotates queries and keys by position instead of learning an absolute position table; Llama 2 uses RoPE^{[8]Reference 8RoFormer: Enhanced Transformer with Rotary Position Embedding.https://arxiv.org/abs/2104.09864}^{[7]Reference 7Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288}
`nn.LayerNorm`	RMSNorm	removes mean-centering from normalization; Llama 2 uses RMSNorm before transformer sublayers^{[9]Reference 9Root Mean Square Layer Normalization.https://arxiv.org/abs/1910.07467}^{[7]Reference 7Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288}
`GELU` MLP	SwiGLU (gated GLU variant)	uses a gated feed-forward activation selected in Llama 2's architecture^{[10]Reference 10GLU Variants Improve Transformerhttps://arxiv.org/abs/2002.05202}^{[7]Reference 7Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288}
Full multi-head attention	Grouped-query attention (GQA)	shares key/value heads within query-head groups to reduce KV-cache load; Llama 2 uses GQA for its 34B and 70B models^{[11]Reference 11GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.https://arxiv.org/abs/2305.13245}^{[7]Reference 7Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288}

Llama 2 reports RoPE, RMSNorm, and SwiGLU throughout its family, while GQA is specific to its larger 34B and 70B configurations.^{[7]Reference 7Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288}

One more change is computational, not architectural. Our forward materializes the full [batch, heads, time, time] score matrix, then masks and softmaxes it. PyTorch provides torch.nn.functional.scaled_dot_product_attention, which can dispatch to optimized backends when conditions allow. You can drop it into this lab without changing the attention result:

scaled-dot-product-attention-equivalence.py

import math
import torch
import torch.nn.functional as F

torch.manual_seed(3)
q = torch.randn(1, 2, 4, 8)
k = torch.randn(1, 2, 4, 8)
v = torch.randn(1, 2, 4, 8)

scores = (q @ k.transpose(-2, -1)) / math.sqrt(q.size(-1))
mask = torch.tril(torch.ones(4, 4, dtype=torch.bool))
manual = scores.masked_fill(~mask, float("-inf")).softmax(dim=-1) @ v
fused_api = F.scaled_dot_product_attention(q, k, v, is_causal=True)

print("output shape:", tuple(fused_api.shape))
print("numerically close:", torch.allclose(manual, fused_api, atol=1e-6))

Output

output shape: (1, 2, 4, 8)
numerically close: True

The manual version stays in the lab because seeing the score matrix get masked is the point. PyTorch's API can select optimized kernels when supported, while FlashAttention gives the IO-aware algorithmic basis for avoiding full attention-matrix materialization.^{[12]Reference 12FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.https://arxiv.org/abs/2205.14135}

Mastery check

Evaluation rubric

Explains GPT pretraining as one contract: corpus -> tokenizer -> packed windows -> causal decoder -> shifted loss -> checkpoint -> sampling
Explains why GPT-2 BPE plus active-token remap matches GPT-2 tokenization without paying for a full 50k-row tied vocabulary matrix
Diagnoses the silent label-shift bug and explains why falling loss can still hide wrong training targets
Explains why a causal mask is still required even after moving from characters to subword tokens
Uses validation loss, validation accuracy, and sampled text together instead of trusting one signal alone
Reads later checkpoint text as evidence of local structure learning, not as proof of fluent writing
Contrasts the lab's manual attention path with fused production attention without confusing systems optimization for objective changes

Follow-up questions

Common pitfalls

Symptom: Loss falls, but samples look like reconstruction or trivial copying. Cause: Logits at position t were trained against token t instead of token t + 1. Fix: Audit the shifted-label path with one tiny hand-checked batch before training longer.
Symptom: Resume run loads cleanly, then generations turn nonsensical. Cause: Tokenizer changed after checkpoint was written. Fix: Save and reload tokenizer metadata with the model state, then verify one prompt encodes the same way before resuming.
Symptom: CPU lab is unexpectedly huge and slow. Cause: Sparse GPT-2 token IDs were used directly, so the toy model allocates a full 50k-row tied vocabulary matrix. Fix: Remap active corpus IDs into a compact local vocabulary.
Symptom: Validation looks better, but generations are repetitive or collapse. Cause: You trusted train loss or one metric instead of checking held-out samples. Fix: Track validation loss, validation accuracy, and a fixed novel-prompt sample together.
Symptom: Outputs look locally plausible but lose longer dialogue structure. Cause: block_size was chosen only for speed, not for pattern length. Fix: Increase context window until the model can see enough preceding tokens to learn the structure you care about.

What to remember

Real GPTs are token-ID machines, not string machines.
This lab uses GPT-2 byte-level BPE rather than a character alphabet.
Tiny labs can still use real subword tokenization if they remap active tokens into compact local IDs.
GPT training loop is still corpus -> token IDs -> packed windows -> masked self-attention -> shifted loss -> checkpoint -> generation.
Validation loss and sampled text should both move in right direction before you trust checkpoint.

Next Step

Continue to Continued Pretraining and Domain Adaptation

This lab built a GPT-style base-model training loop from raw text to checkpoint and sample. Next chapter asks what changes when a model already knows general language and you want to adapt it to narrower domain data instead of starting from zero.

PreviousPre-training Data at Scale

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

CS336: Language Modeling from Scratch.

Stanford University · 2026

nanoGPT.

Karpathy, A. · 2025

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. · 2019

shakespeare.txt.

Princeton University COS 302 / SML 305 · 2020

GPT-2 Source Implementation.

OpenAI · 2019

PyTorch: An Imperative Style, High-Performance Deep Learning Library.

Paszke, A., et al. · 2019 · NeurIPS 2019

Llama 2: Open Foundation and Fine-Tuned Chat Models.

Touvron, H., et al. · 2023 · arXiv preprint

RoFormer: Enhanced Transformer with Rotary Position Embedding.

Su, J., et al. · 2021

Root Mean Square Layer Normalization.

Zhang, B. & Sennrich, R. · 2019 · NeurIPS 2019

GLU Variants Improve Transformer

Shazeer, N. · 2020

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

Build GPT from Scratch Lab

Why this lab feels different

Five moving parts

Audit the causal target before training

Runnable lab

Why this lab uses GPT-2 subword IDs

Why one-token shift still matters

Packing and splitting data

Validation loss versus sample quality

How modern LLMs differ from this lab

Mastery check

Evaluation rubric

Follow-up questions

Why is this chapter different from earlier language-modeling and Transformer chapters?

What is most common silent bug in causal language-model training code?

Why sample text after checkpoint instead of trusting loss alone?

Why does this lab remap GPT-2 token IDs into a compact local vocabulary instead of training directly on the original token integers?

If this tutorial already uses subword tokens, why do we still need shifted labels and a causal mask?

Warmup sample is mostly punctuation soup, but later sample starts producing role words and dialogue-like clauses. What changed?

You keep same model weights but accidentally reload checkpoint with a different tokenizer. What symptom should you expect first?

Common pitfalls

What to remember

Mastery Check