LearnCore LLM FoundationsBPE, WordPiece, and SentencePiece

📝MediumNLP Fundamentals

BPE, WordPiece, and SentencePiece

Build a small subword tokenizer, compare BPE, WordPiece, and SentencePiece, then audit token cost and Unicode behavior.

20 min read

Learning path

Step 49 of 158 in the full curriculum

The Bitter Lesson & Compute Static to Contextual Embeddings

A developer writes const tokenCount = encodePrompt(prompt).length;. A model can't read that code or any other text directly. It receives integer IDs, and a tokenizer decides which pieces of the string get those IDs.

In the previous lesson, you saw why learning systems need representations that can improve with data and compute. Tokenization is the first such representation for text: a fixed contract that turns new wording, languages, and symbols into model input. You'll compare Byte Pair Encoding (BPE), WordPiece, and SentencePiece as you build that contract, break it, and learn what to measure before serving it.

Tokenization pipeline for prompt cache failed unexpectedly: common words remain whole, unexpectedly splits into un, expect, ed, and ly, seven pieces map to tokenizer-local integer IDs, and those IDs select seven rows from an embedding table. — The tokenizer fixes both sequence length and lookup addresses. Here three common words stay whole, one rare word becomes four reusable pieces, and the resulting seven tokenizer-local IDs select seven embedding rows.

Diagram showing Raw prompt or code, Normalize and segment, Map pieces to IDs, and Look up embeddings. — Raw prompt or code, Normalize and segment, Map pieces to IDs, and Look up embeddings.

An embedding is a vector associated with a token ID. The next lesson teaches those vectors. For now, keep one rule in mind: ID 421 has no universal meaning. It only means something when paired with the exact tokenizer artifact that produced it.

Choose pieces between characters and words

A word-level tokenizer could keep tokenizer as one item, but it needs a policy for every rare identifier, version string, and prompt marker. A character-level tokenizer never runs out of pieces, but it turns short strings into long sequences. Subword tokenization keeps frequent fragments whole while retaining small fallback pieces for rare text.

Consider this prompt fixture:

prompt.txt

tokenizer failed for prompt128k

Token unit	Example pieces	What it buys you	What it costs
Character	`t o k e n i z e r <space> ...`	Every character, including whitespace, is representable	Long input sequence
Word	`tokenizer`, `failed`, `for`, `prompt128k`	Short common strings	Rare words and IDs need fallbacks
Subword	`token`, `izer`, `<space>fail`, `ed`, `<space>for`, `<space>prompt`, `128`, `k`	Compact common patterns with fallback parts	Requires learned vocabulary

Token granularity frontier for tokenizer failed for prompt128k: character tokenization uses 28 positions and a tiny vocabulary, subwords use eight positions with medium vocabulary and open coverage, and whole words use four positions but require many rows and carry higher unknown-token risk. — For this fixture, larger pieces reduce sequence length from 31 characters, including whitespace, to eight whitespace-preserving subwords to four words. Vocabulary cost moves in the opposite direction, so subwords often occupy the practical region with short sequences and reusable fallback pieces.

The first lab makes that tradeoff concrete. The subword split is written by hand because you haven't trained a tokenizer yet. Use its token count as a budget to compare against character and whitespace-word baselines.

01-count-token-units.py

message = "tokenizer failed for prompt128k"

characters = list(message)
words = message.split()
subwords = ["token", "izer", "<space>fail", "ed", "<space>for", "<space>prompt", "128", "k"]

print("characters:", len(characters))
print("words:", len(words))
print("candidate subwords:", len(subwords), subwords)
assert "".join(subwords).replace("<space>", " ") == message

Output

characters: 31
words: 4
candidate subwords: 8 ['token', 'izer', '<space>fail', 'ed', '<space>for', '<space>prompt', '128', 'k']

The word count looks smallest, but it hides the hard question: what happens when prompt128k was never in the vocabulary? Subwords answer that question by learning common fragments and retaining smaller pieces for the rest.

Train byte pair encoding from counts

Byte Pair Encoding (BPE) builds a vocabulary from repeated adjacent pieces. Sennrich, Haddow, and Birch applied BPE subword units to open-vocabulary neural machine translation: start from small symbols, merge frequent adjacent pairs, and reuse the resulting pieces for rare words.^{[1]Reference 1Neural Machine Translation of Rare Words with Subword Units.https://arxiv.org/abs/1508.07909}

For teaching, start with characters inside pre-separated code and text terms. A production tokenizer has extra decisions about whitespace, bytes, and normalization, but the merge loop is the important first mechanism.

Three-round BPE recount on weighted terms code, coder, and cope: c plus o wins with count seven, co plus d ties d plus e at five and wins by deterministic first-seen order, then cod plus e wins with count five to form code. — Each column recounts adjacent pairs after the previous merge. Round two contains a 5-5 tie: this classroom implementation picks the pair encountered first, while a production trainer must define a stable tie-break so the learned vocabulary is reproducible.

Suppose a code-text corpus contains these term counts:

Term	Count
`code`	3
`coder`	2
`cope`	2
`token`	2
`prompt`	1

At the start, c + o appears seven times: three in code, two in coder, and two in cope. Merge it into co. On the updated corpus, co + d and d + e are tied at five. The toy trainer encounters co + d first, so it becomes cod; a production trainer needs an explicit deterministic tie-break. Recounting again makes cod + e the next winner at five, producing code.

This miniature trainer stores words as tuples of current pieces, counts adjacent pairs, merges the winner everywhere, and prints the first three learned rules.

02-train-mini-bpe.py

from collections import Counter

frequencies = {
    "code": 3,
    "coder": 2,
    "cope": 2,
    "token": 2,
    "prompt": 1,
}
state = {tuple(word): count for word, count in frequencies.items()}

def count_pairs(words: dict[tuple[str, ...], int]) -> Counter[tuple[str, str]]:
    pairs: Counter[tuple[str, str]] = Counter()
    for pieces, count in words.items():
        for pair in zip(pieces, pieces[1:]):
            pairs[pair] += count
    return pairs

def merge_pair(
    pieces: tuple[str, ...], pair: tuple[str, str]
) -> tuple[str, ...]:
    merged: list[str] = []
    i = 0
    while i < len(pieces):
        if i + 1 < len(pieces) and pieces[i : i + 2] == pair:
            merged.append("".join(pair))
            i += 2
        else:
            merged.append(pieces[i])
            i += 1
    return tuple(merged)

merges: list[tuple[str, str]] = []
for step in range(3):
    pair, count = count_pairs(state).most_common(1)[0]
    state = {merge_pair(pieces, pair): freq for pieces, freq in state.items()}
    merges.append(pair)
    print(step + 1, pair, "->", "".join(pair), "count", count)

print("learned merges:", merges)

Output

('c', 'o') -> co count 7
('co', 'd') -> cod count 5
('cod', 'e') -> code count 5
learned merges: [('c', 'o'), ('co', 'd'), ('cod', 'e')]

The result isn't a linguistic analysis. BPE doesn't know that code relates to programming. It knows only that a boundary occurs often enough to compress.

Replay merges on a new term

Training chooses an ordered merge list once. Encoding a new input doesn't recount a new corpus; it replays those learned rules. That distinction matters in production because the tokenizer must stay fixed with the model.

The next lab takes the three rules learned above and applies them to new terms. coder benefits from the common stem, while coper gets only the co merge because the corpus never earned cope as one piece in the first three steps.

03-replay-bpe-merges.py

def apply_rule(pieces: list[str], pair: tuple[str, str]) -> list[str]:
    result: list[str] = []
    i = 0
    while i < len(pieces):
        if i + 1 < len(pieces) and tuple(pieces[i : i + 2]) == pair:
            result.append("".join(pair))
            i += 2
        else:
            result.append(pieces[i])
            i += 1
    return result

rules = [("c", "o"), ("co", "d"), ("cod", "e")]

for term in ["coder", "codec", "coper"]:
    pieces = list(term)
    for rule in rules:
        pieces = apply_rule(pieces, rule)
    print(term, "->", pieces)

Output

coder -> ['code', 'r']
codec -> ['code', 'c']
coper -> ['co', 'p', 'e', 'r']

Use bytes as a complete base alphabet

Character-starting BPE still needs an unknown-token policy for characters absent from its base vocabulary. GPT-2 used a byte-level BPE variant: its base alphabet represents bytes, then learned merges build larger pieces over that base. Because any Unicode string has a UTF-8 byte representation, every input remains representable without an unknown character token.^{[2]Reference 2Language Models are Unsupervised Multitask Learners.https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf}

Some tokenizers use an explicit byte fallback only when ordinary pieces can't encode an input. That has the same coverage goal but isn't the same mechanism as GPT-2's byte-level starting alphabet. This lab doesn't train merges. It isolates the common foundation: a code comment containing Japanese characters and an emoji is reversible through raw UTF-8 byte values.

04-utf8-byte-round-trip.py

message = "関数✨"
byte_ids = list(message.encode("utf-8"))
reconstructed = bytes(byte_ids).decode("utf-8")

print("byte count:", len(byte_ids))
print("first byte ids:", byte_ids[:8])
print("round trip:", reconstructed)
assert reconstructed == message
assert all(0 <= value <= 255 for value in byte_ids)

Output

byte count: 9
first byte ids: [233, 150, 162, 230, 149, 176, 226, 156]
round trip: 関数✨

Byte coverage prevents an unrepresentable character. It doesn't promise compact tokenization: if the training data rarely covers a script or emoji sequence, several bytes may remain separate pieces.

WordPiece chooses vocabulary differently

WordPiece appeared in Google's Japanese and Korean voice-search work and later became familiar through BERT's tokenizer.^{[3]Reference 3Japanese and Korean Voice Search.https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf}^{[4]Reference 4BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.https://arxiv.org/abs/1810.04805} Like BPE, it creates reusable pieces. Unlike simple frequency-based BPE, the original WordPiece description selects vocabulary additions to improve a language-model likelihood objective.

Exact training recipes aren't fully specified by the short original paper, and library trainers can differ. A useful classroom proxy is an association score:

\operatorname{score}(a,b) = \frac{\operatorname{count}(ab)} {\operatorname{count}(a)\operatorname{count}(b)}

Here, count(ab) is how often two neighboring pieces occur together; the denominator penalizes pieces that occur frequently in many other contexts. This formula teaches why a less frequent but tightly associated pair could be attractive. Treat it as intuition for WordPiece's likelihood motivation, not as the original implementation specification.

Candidate pair	Pair count	Individual counts	Proxy score	Lesson
`code` + `base`	42	50 and 44	`0.0191`	Often occurs together
`the` + `model`	90	900 and 300	`0.0003`	Frequent pieces aren't necessarily exclusive

At encoding time, BERT-style WordPiece uses continuation pieces such as ##ing and a greedy longest-match lookup. A piece beginning with ## continues the current word rather than beginning a new word.

The next lab implements that lookup for tokenizer vocabulary. It always tries the longest valid piece from the current cursor position.

05-wordpiece-longest-match.py

def wordpiece_tokenize(word: str, vocabulary: set[str]) -> list[str]:
    result: list[str] = []
    start = 0
    while start < len(word):
        chosen = None
        for end in range(len(word), start, -1):
            candidate = word[start:end]
            if start > 0:
                candidate = "##" + candidate
            if candidate in vocabulary:
                chosen = candidate
                start = end
                break
        if chosen is None:
            return ["[UNK]"]
        result.append(chosen)
    return result

vocabulary = {"code", "##base", "token", "##ized"}
for term in ["code", "codebase", "tokenized"]:
    print(term, "->", wordpiece_tokenize(term, vocabulary))

Output

code -> ['code']
codebase -> ['code', '##base']
tokenized -> ['token', '##ized']

Expose the unknown-token failure

Standard WordPiece can't necessarily spell a word from arbitrary bytes. If no valid segmentation reaches the end of a word, BERT-style tokenization emits [UNK] for that word. That loses distinctions between two different unseen strings.

This failure test uses the same algorithm with a missing ##bot continuation. The fix isn't to silently map new strings to a known ID. You must use the model's tokenizer contract, or explicitly change the vocabulary and corresponding model parameters as a training decision.

06-wordpiece-unknown-token.py

def encode_word(word: str, vocabulary: set[str]) -> list[str]:
    pieces: list[str] = []
    cursor = 0
    while cursor < len(word):
        match = None
        for end in range(len(word), cursor, -1):
            candidate = word[cursor:end]
            if cursor:
                candidate = "##" + candidate
            if candidate in vocabulary:
                match = candidate
                cursor = end
                break
        if match is None:
            return ["[UNK]"]
        pieces.append(match)
    return pieces

vocabulary = {"code", "##base", "token"}
known = encode_word("codebase", vocabulary)
missing = encode_word("codebot", vocabulary)

print("known term:", known)
print("missing continuation:", missing)
assert known == ["code", "##base"]
assert missing == ["[UNK]"]

Output

known term: ['code', '##base']
missing continuation: ['[UNK]']

SentencePiece treats boundaries as part of the artifact

Many subword pipelines first split a sentence into word-like units, then segment inside each unit. That assumption is awkward for text where spaces don't mark each word. SentencePiece is a tokenizer and detokenizer framework that trains directly from raw sentences instead of requiring pre-tokenized word sequences.^{[5]Reference 5SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.https://arxiv.org/abs/1808.06226}

SentencePiece commonly makes spaces visible as ▁, so Token cache warm becomes a stream like ▁Token▁cache▁warm before final pieces are chosen. Decoding targets the normalized input string, not necessarily the original raw byte sequence, because normalization can fold equivalent or compatibility forms.

SentencePiece artifact flow for token ﬁle delayed: normalization folds the ﬁ ligature, visible boundary markers enter the exact Unigram piece sequence from the lab, decoding restores spaces, and lower diagrams contrast BPE adding merges with Unigram pruning candidate paths. — The upper path uses the lab's exact normalization and piece sequence: the artifact folds `ﬁ`, makes boundaries visible, segments, and decodes. The lower diagrams separate its two model choices: BPE adds frequent merges, while Unigram prunes a candidate pool and can sample legal paths during training.

SentencePiece supports BPE, and it also supports the Unigram language model algorithm proposed with subword regularization. Unigram starts with many candidate pieces, assigns probabilities, removes pieces that contribute least to corpus likelihood, and can sample multiple valid segmentations during model training.^{[6]Reference 6Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.https://arxiv.org/abs/1804.10959}

That sampling option matters because a word such as tokenized can have several legal piece boundaries. Training on more than one segmentation can make the downstream model less dependent on a single boundary choice. For deterministic serving, the tokenizer can still select its highest-probability segmentation.

The official library makes the artifact visible. This lab trains a small Unigram SentencePiece model on raw tokenizer-related text, encodes an unseen request, and proves that its decoded form matches the tokenizer's normalized text.

07-train-sentencepiece-unigram.py

from pathlib import Path
from tempfile import TemporaryDirectory

import sentencepiece as spm

spm.set_min_log_level(2)

with TemporaryDirectory() as tmp:
    corpus = Path(tmp) / "text.txt"
    corpus.write_text(
        "token budget pending\n"
        "token budget approved\n"
        "prompt cache created\n"
        "context window delayed\n"
        "unicode file requested\n",
        encoding="utf-8",
    )
    prefix = str(Path(tmp) / "text_tokenizer")
    spm.SentencePieceTrainer.train(
        input=str(corpus),
        model_prefix=prefix,
        model_type="unigram",
        vocab_size=40,
        character_coverage=1.0,
        hard_vocab_limit=False,
        bos_id=-1,
        eos_id=-1,
        pad_id=-1,
    )
    tokenizer = spm.SentencePieceProcessor(model_file=prefix + ".model")
    raw_message = "token ﬁle delayed"
    normalized_with_marker = tokenizer.normalize(raw_message)
    pieces = tokenizer.encode(raw_message, out_type=str)
    decoded = tokenizer.decode(pieces)

print("raw:", raw_message)
print("normalized with marker:", normalized_with_marker)
print("pieces:", pieces)
print("decoded:", decoded)
assert raw_message != decoded
assert normalized_with_marker == "▁token▁file▁delayed"
assert decoded == "token file delayed"
assert any(piece.startswith("▁") for piece in pieces)

Output

raw: token ﬁle delayed
normalized with marker: ▁token▁file▁delayed
pieces: ['▁token', '▁', 'f', 'i', 'l', 'e', '▁', 'de', 'l', 'a', 'y', 'ed']
decoded: token file delayed

The default SentencePiece normalization folds the compatibility ligature ﬁ into fi. The visible ▁ marker participates in segmentation, then decoding restores spaces. That's why the decoded output matches normalized text rather than the original raw code-point sequence.

Vocabulary size spends parameters to save positions

Every added vocabulary entry can compress a recurring string into fewer tokens. It also adds an embedding row. If the output projection isn't tied to the input embedding matrix, it adds another row there too.

For a vocabulary of size $V$ and hidden dimension $d$ , an input embedding matrix contains $Vd$ parameters. With float16 weights, each parameter takes two bytes. A second untied output matrix doubles that vocabulary-dependent memory.

Exact vocabulary memory budget for a 4096-wide float16 model: 8,000 entries use 62.5 MiB for input embeddings or 125 MiB with an untied output matrix, 32,000 use 250 or 500 MiB, and 128,000 use 1,000 or 2,000 MiB; compression and quality remain measurements across traffic, locales, and code. — Memory grows exactly with `V × d × bytes`; the bars use the lab's calculated outputs. Compression has no universal curve, so extra rows are justified only by measured token savings and downstream quality across the workloads you serve.

Use numbers before making a design argument. This lab compares hypothetical tokenizer vocabularies for a model with hidden dimension 4096; it calculates embedding memory rather than assuming a larger vocabulary is free.

08-vocabulary-memory-budget.py

def embedding_memory_mib(
    vocabulary_size: int, hidden_size: int, bytes_per_weight: int = 2
) -> float:
    return vocabulary_size * hidden_size * bytes_per_weight / (1024**2)

hidden_size = 4096
for vocabulary_size in [8_000, 32_000, 128_000]:
    input_mib = embedding_memory_mib(vocabulary_size, hidden_size)
    untied_mib = 2 * input_mib
    print(
        f"{vocabulary_size:>6,} tokens:",
        f"input={input_mib:>7.1f} MiB",
        f"input+untied-output={untied_mib:>7.1f} MiB",
    )

Output

8,000 tokens: input=   62.5 MiB input+untied-output=  125.0 MiB
32,000 tokens: input=  250.0 MiB input+untied-output=  500.0 MiB
128,000 tokens: input= 1000.0 MiB input+untied-output= 2000.0 MiB

Sequence compression must be measured too. A vocabulary can shorten common English tokenizer prompts and still fragment another script or your TypeScript repository badly. Tokenizer design is an evaluation problem, not a race to the largest V.

Audit language and code token budgets

Fertility is a token-length measure. At word level, fertility is the average number of tokenizer pieces needed per word. For parallel-message audits, you can also compare total token count or a locale-to-baseline ratio for equivalent text. For a product with international users, compare parallel messages across target languages instead of using English traffic alone. Petrov et al. measured translated text and found tokenizer-length disparities as large as 15 times for some language and tokenizer combinations, with implications for cost, latency, and available context.^{[7]Reference 7Language Model Tokenizers Introduce Unfairness Between Languages.https://arxiv.org/abs/2305.15425}

Exact cl100k_base fixture audit comparing English, Portuguese, and Japanese token counts, followed by a safer locale-audit path. — The bars reproduce the lab's exact three-message output, not a language ranking. A production audit pins the tokenizer version, expands to reviewed parallel messages, summarizes distributions rather than one count, and checks downstream quality.

Code belongs in the same audit. Identifiers, indentation, and operators are all model input. A tokenizer trained with limited code coverage may split familiar repository patterns into many pieces, leaving less room for files, tests, and error logs. Human-readable splits are useful debugging clues, not proof of model quality. Measure token totals and validate downstream code tasks.

Exact cl100k_base tokenization of const tokenCount equals encodePrompt prompt length: 10 tokenizer pieces and IDs expose identifier fragmentation, leading-space pieces, and syntax-sensitive spans, followed by round-trip and downstream code-quality checks. — The strip uses the lab's exact 10 `cl100k_base` pieces and IDs, with `·` marking a leading space. It exposes identifier and syntax fragmentation without treating readable pieces or token count as proof of code quality.

The next lab uses one named tokenizer artifact to measure several fixtures. The sample isn't a language-quality study; its job is to give you a repeatable audit shape. Replace the fixture messages with reviewed parallel translations and repository files for a real decision.

09-audit-token-lengths.py

import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")
fixtures = {
    "english": "How do I tokenize this prompt?",
    "portuguese": "Como tokenizo este prompt?",
    "japanese": "このプロンプトをトークン化するには？",
    "typescript": "const tokenCount = encodePrompt(prompt).length;",
}

english_tokens = len(encoding.encode(fixtures["english"]))
for name, text in fixtures.items():
    ids = encoding.encode(text)
    print(f"{name:>10}: {len(ids):>2} tokens {len(ids) / english_tokens:>4.1f}x english")
    assert encoding.decode(ids) == text

Output

english:  7 tokens  1.0x english
portuguese:  6 tokens  0.9x english
  japanese: 16 tokens  2.3x english
typescript: 10 tokens  1.4x english

Don't read a single sample as a ranking of languages. Build a locale-aware test set, record tokenizer version, compare distribution summaries, and then check downstream task quality.

Make Unicode policy explicit

Two strings can look identical while holding different Unicode code points. For example, café may contain one composed é or the sequence e plus a combining accent. Without a stable policy, cache keys, token counts, and filter behavior can disagree across clients.

Normalization Form C (NFC) composes canonically equivalent forms without folding broad compatibility distinctions. Normalization Form Compatibility Composition (NFKC) also folds compatibility characters, such as the ﬁ ligature into fi. Python exposes both through unicodedata.normalize; choosing between them is a product policy decision, not an automatic cleanup rule.

Unicode normalization comparison from the lab: composed café uses U+00E9 while decomposed cafe plus U+0301 starts unequal, NFC maps both to the same code-point sequence, NFC preserves the ﬁ ligature in ﬁle, and NFKC folds it to separate f and i characters. — The left lane reproduces the lab's `False → True` canonical-equivalence result under NFC. The right lane makes the policy boundary explicit: NFC preserves `ﬁle`, while NFKC changes it to `file`. Version that choice with cache keys and tokenizer IDs.

This final lab proves canonical equivalence and makes the compatibility choice visible. For user-visible prompts, you might choose NFC first and add separate security checks for invisible or confusable characters. Another product may deliberately choose NFKC after deciding the information loss is acceptable.

10-normalize-before-tokenizing.py

import unicodedata

composed = "café"
decomposed = "cafe\u0301"
ligature = "ﬁle"

print("raw cafe equal:", composed == decomposed)
print("NFC cafe equal:", unicodedata.normalize("NFC", composed) == unicodedata.normalize("NFC", decomposed))
print("NFC ligature:", unicodedata.normalize("NFC", ligature))
print("NFKC ligature:", unicodedata.normalize("NFKC", ligature))

assert composed != decomposed
assert unicodedata.normalize("NFC", composed) == unicodedata.normalize("NFC", decomposed)
assert unicodedata.normalize("NFC", ligature) != "file"
assert unicodedata.normalize("NFKC", ligature) == "file"

Output

raw cafe equal: False
NFC cafe equal: True
NFC ligature: ﬁle
NFKC ligature: file

Tokenizer behavior must be versioned with this policy. If one service normalizes with NFC and another silently folds with NFKC, they can send different IDs to the same model or generate different cache keys for text that looks unchanged.

Compare algorithms without mixing contracts

The common tokenizer families are design choices, not names to memorize.

Three-lane tokenizer comparison for unbelievably: BPE merges frequent pairs, WordPiece takes longest valid pieces, and Unigram scores full paths. — One word exposes three different mechanics. BPE replays learned merge ranks, WordPiece scans for the longest valid piece, and Unigram selects among scored segmentation paths. Each model still depends on its exact vocabulary, normalizer, and fallback policy.

Method	Training view	Serving view	Boundary/fallback detail
BPE	Add frequent adjacent merges	Replay ordered merges	Byte-level variants retain UTF-8 coverage
WordPiece	Grow vocabulary for likelihood objective	Greedy longest valid piece	BERT-style continuation uses `##`; missing segmentation can yield `[UNK]`
SentencePiece BPE	BPE trained directly on raw normalized text	Replay packaged model	Visible whitespace marker can be part of pieces
SentencePiece Unigram	Estimate and prune candidate-piece probabilities	Best path or sampled alternatives when requested	Segmentation sampling supports regularized training

Production review checklist

Symptom	Likely cause	Check or fix
Model output collapses after swapping tokenizer file	IDs no longer match trained embeddings	Pin tokenizer artifact and model checkpoint together
Locale hits token limit sooner than English	Unequal fertility on translated requests	Measure parallel message sets by locale and task
A WordPiece model emits `[UNK]` for new identifiers	No valid vocabulary segmentation	Evaluate vocabulary/model update rather than masking the failure
Cache misses differ across clients for same visible message	Unicode preprocessing isn't consistent	Version and test normalization plus tokenizer pipeline
Repository prompt holds less code than expected	Code fixtures fragment into many tokens	Measure actual files with intended deployed tokenizer

What this lets you build

You started with a prompt string and finished with a tokenizer review harness. From here, you can:

Explain why subwords balance sequence length against vocabulary coverage.
Train and replay a minimal BPE merge table.
Explain why byte fallback preserves representability but doesn't guarantee compactness.
Implement WordPiece longest-match encoding and reproduce its unknown-token failure.
Train a raw-text SentencePiece Unigram tokenizer and distinguish the framework from its segmentation algorithms.
Calculate vocabulary-dependent embedding memory.
Audit token length across locales and code fixtures without overstating one sample.
Choose and test a Unicode normalization policy.

Mastery check

Evaluation rubric

Foundational: Given a five-word corpus, you can calculate one BPE winner by hand and apply it to updated pieces.
Intermediate: You can implement BPE replay and WordPiece longest-match segmentation, then explain why their failure behavior differs.
Intermediate: You can show the difference between SentencePiece BPE and SentencePiece Unigram without calling SentencePiece a merge algorithm.
Advanced: You can present a tokenizer audit with locale fixtures, code fixtures, normalization policy, vocabulary memory cost, and artifact versioning.

Follow-up questions

Common pitfalls

Retraining during inference: Pair frequencies are counted while building BPE rules, not while serving each prompt. Serving must replay fixed rules.
Conflating byte-level BPE with byte fallback: GPT-2 starts from byte representations. Other tokenizers may invoke explicit byte fallback only when ordinary pieces can't encode an input. Both preserve coverage, but they aren't the same mechanism.
Describing the WordPiece proxy as its spec: The association score clarifies the intuition. The original method is likelihood-driven, and implementations can vary.
Calling SentencePiece a fourth algorithm: SentencePiece packages raw-text handling and can host BPE or Unigram models.
Assuming byte fallback means fair multilingual cost: Coverage avoids unknown characters; it doesn't make segment lengths equal.
Normalizing without a policy: NFC and NFKC solve different problems. Compatibility folding can discard distinctions your product intended to preserve.

Next Step

Continue to Static to Contextual Embeddings

You can now turn text into stable token IDs and measure the cost of that choice. Next you'll turn those IDs into <span data-glossary="vector">vectors</span> whose geometry lets a model learn similarity and context.

PreviousThe Bitter Lesson & Compute

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Neural Machine Translation of Rare Words with Subword Units.

Sennrich, R., Haddow, B., & Birch, A. · 2016 · ACL 2016

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. · 2019

Japanese and Korean Voice Search.

Schuster, M. & Nakajima, K. · 2012

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.

Kudo, T. & Richardson, J. · 2018 · EMNLP 2018

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.

Kudo, T. · 2018 · ACL 2018

Language Model Tokenizers Introduce Unfairness Between Languages.

Petrov, A., La Malfa, E., Torr, P. H. S., & Bibi, A. · 2023 · NeurIPS 2023

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnCore LLM FoundationsBPE, WordPiece, and SentencePiece

📝MediumNLP Fundamentals

BPE, WordPiece, and SentencePiece

Build a small subword tokenizer, compare BPE, WordPiece, and SentencePiece, then audit token cost and Unicode behavior.

20 min read

Learning path

Step 49 of 158 in the full curriculum

The Bitter Lesson & Compute Static to Contextual Embeddings

Choose pieces between characters and words

Consider this prompt fixture:

prompt.txt

tokenizer failed for prompt128k

Token unit	Example pieces	What it buys you	What it costs
Character	`t o k e n i z e r <space> ...`	Every character, including whitespace, is representable	Long input sequence
Word	`tokenizer`, `failed`, `for`, `prompt128k`	Short common strings	Rare words and IDs need fallbacks
Subword	`token`, `izer`, `<space>fail`, `ed`, `<space>for`, `<space>prompt`, `128`, `k`	Compact common patterns with fallback parts	Requires learned vocabulary

01-count-token-units.py

message = "tokenizer failed for prompt128k"

characters = list(message)
words = message.split()
subwords = ["token", "izer", "<space>fail", "ed", "<space>for", "<space>prompt", "128", "k"]

print("characters:", len(characters))
print("words:", len(words))
print("candidate subwords:", len(subwords), subwords)
assert "".join(subwords).replace("<space>", " ") == message

Output

characters: 31
words: 4
candidate subwords: 8 ['token', 'izer', '<space>fail', 'ed', '<space>for', '<space>prompt', '128', 'k']

Train byte pair encoding from counts

Suppose a code-text corpus contains these term counts:

Term	Count
`code`	3
`coder`	2
`cope`	2
`token`	2
`prompt`	1

This miniature trainer stores words as tuples of current pieces, counts adjacent pairs, merges the winner everywhere, and prints the first three learned rules.

02-train-mini-bpe.py

from collections import Counter

frequencies = {
    "code": 3,
    "coder": 2,
    "cope": 2,
    "token": 2,
    "prompt": 1,
}
state = {tuple(word): count for word, count in frequencies.items()}

def count_pairs(words: dict[tuple[str, ...], int]) -> Counter[tuple[str, str]]:
    pairs: Counter[tuple[str, str]] = Counter()
    for pieces, count in words.items():
        for pair in zip(pieces, pieces[1:]):
            pairs[pair] += count
    return pairs

def merge_pair(
    pieces: tuple[str, ...], pair: tuple[str, str]
) -> tuple[str, ...]:
    merged: list[str] = []
    i = 0
    while i < len(pieces):
        if i + 1 < len(pieces) and pieces[i : i + 2] == pair:
            merged.append("".join(pair))
            i += 2
        else:
            merged.append(pieces[i])
            i += 1
    return tuple(merged)

merges: list[tuple[str, str]] = []
for step in range(3):
    pair, count = count_pairs(state).most_common(1)[0]
    state = {merge_pair(pieces, pair): freq for pieces, freq in state.items()}
    merges.append(pair)
    print(step + 1, pair, "->", "".join(pair), "count", count)

print("learned merges:", merges)

Output

('c', 'o') -> co count 7
('co', 'd') -> cod count 5
('cod', 'e') -> code count 5
learned merges: [('c', 'o'), ('co', 'd'), ('cod', 'e')]

The result isn't a linguistic analysis. BPE doesn't know that code relates to programming. It knows only that a boundary occurs often enough to compress.

Replay merges on a new term

03-replay-bpe-merges.py

def apply_rule(pieces: list[str], pair: tuple[str, str]) -> list[str]:
    result: list[str] = []
    i = 0
    while i < len(pieces):
        if i + 1 < len(pieces) and tuple(pieces[i : i + 2]) == pair:
            result.append("".join(pair))
            i += 2
        else:
            result.append(pieces[i])
            i += 1
    return result

rules = [("c", "o"), ("co", "d"), ("cod", "e")]

for term in ["coder", "codec", "coper"]:
    pieces = list(term)
    for rule in rules:
        pieces = apply_rule(pieces, rule)
    print(term, "->", pieces)

Output

coder -> ['code', 'r']
codec -> ['code', 'c']
coper -> ['co', 'p', 'e', 'r']

Use bytes as a complete base alphabet

04-utf8-byte-round-trip.py

message = "関数✨"
byte_ids = list(message.encode("utf-8"))
reconstructed = bytes(byte_ids).decode("utf-8")

print("byte count:", len(byte_ids))
print("first byte ids:", byte_ids[:8])
print("round trip:", reconstructed)
assert reconstructed == message
assert all(0 <= value <= 255 for value in byte_ids)

Output

byte count: 9
first byte ids: [233, 150, 162, 230, 149, 176, 226, 156]
round trip: 関数✨

Byte coverage prevents an unrepresentable character. It doesn't promise compact tokenization: if the training data rarely covers a script or emoji sequence, several bytes may remain separate pieces.

WordPiece chooses vocabulary differently

Exact training recipes aren't fully specified by the short original paper, and library trainers can differ. A useful classroom proxy is an association score:

\operatorname{score}(a,b) = \frac{\operatorname{count}(ab)} {\operatorname{count}(a)\operatorname{count}(b)}

Candidate pair	Pair count	Individual counts	Proxy score	Lesson
`code` + `base`	42	50 and 44	`0.0191`	Often occurs together
`the` + `model`	90	900 and 300	`0.0003`	Frequent pieces aren't necessarily exclusive

The next lab implements that lookup for tokenizer vocabulary. It always tries the longest valid piece from the current cursor position.

05-wordpiece-longest-match.py

def wordpiece_tokenize(word: str, vocabulary: set[str]) -> list[str]:
    result: list[str] = []
    start = 0
    while start < len(word):
        chosen = None
        for end in range(len(word), start, -1):
            candidate = word[start:end]
            if start > 0:
                candidate = "##" + candidate
            if candidate in vocabulary:
                chosen = candidate
                start = end
                break
        if chosen is None:
            return ["[UNK]"]
        result.append(chosen)
    return result

vocabulary = {"code", "##base", "token", "##ized"}
for term in ["code", "codebase", "tokenized"]:
    print(term, "->", wordpiece_tokenize(term, vocabulary))

Output

code -> ['code']
codebase -> ['code', '##base']
tokenized -> ['token', '##ized']

Expose the unknown-token failure

06-wordpiece-unknown-token.py

def encode_word(word: str, vocabulary: set[str]) -> list[str]:
    pieces: list[str] = []
    cursor = 0
    while cursor < len(word):
        match = None
        for end in range(len(word), cursor, -1):
            candidate = word[cursor:end]
            if cursor:
                candidate = "##" + candidate
            if candidate in vocabulary:
                match = candidate
                cursor = end
                break
        if match is None:
            return ["[UNK]"]
        pieces.append(match)
    return pieces

vocabulary = {"code", "##base", "token"}
known = encode_word("codebase", vocabulary)
missing = encode_word("codebot", vocabulary)

print("known term:", known)
print("missing continuation:", missing)
assert known == ["code", "##base"]
assert missing == ["[UNK]"]

Output

known term: ['code', '##base']
missing continuation: ['[UNK]']

SentencePiece treats boundaries as part of the artifact

07-train-sentencepiece-unigram.py

from pathlib import Path
from tempfile import TemporaryDirectory

import sentencepiece as spm

spm.set_min_log_level(2)

with TemporaryDirectory() as tmp:
    corpus = Path(tmp) / "text.txt"
    corpus.write_text(
        "token budget pending\n"
        "token budget approved\n"
        "prompt cache created\n"
        "context window delayed\n"
        "unicode file requested\n",
        encoding="utf-8",
    )
    prefix = str(Path(tmp) / "text_tokenizer")
    spm.SentencePieceTrainer.train(
        input=str(corpus),
        model_prefix=prefix,
        model_type="unigram",
        vocab_size=40,
        character_coverage=1.0,
        hard_vocab_limit=False,
        bos_id=-1,
        eos_id=-1,
        pad_id=-1,
    )
    tokenizer = spm.SentencePieceProcessor(model_file=prefix + ".model")
    raw_message = "token ﬁle delayed"
    normalized_with_marker = tokenizer.normalize(raw_message)
    pieces = tokenizer.encode(raw_message, out_type=str)
    decoded = tokenizer.decode(pieces)

print("raw:", raw_message)
print("normalized with marker:", normalized_with_marker)
print("pieces:", pieces)
print("decoded:", decoded)
assert raw_message != decoded
assert normalized_with_marker == "▁token▁file▁delayed"
assert decoded == "token file delayed"
assert any(piece.startswith("▁") for piece in pieces)

Output

raw: token ﬁle delayed
normalized with marker: ▁token▁file▁delayed
pieces: ['▁token', '▁', 'f', 'i', 'l', 'e', '▁', 'de', 'l', 'a', 'y', 'ed']
decoded: token file delayed

Vocabulary size spends parameters to save positions

08-vocabulary-memory-budget.py

def embedding_memory_mib(
    vocabulary_size: int, hidden_size: int, bytes_per_weight: int = 2
) -> float:
    return vocabulary_size * hidden_size * bytes_per_weight / (1024**2)

hidden_size = 4096
for vocabulary_size in [8_000, 32_000, 128_000]:
    input_mib = embedding_memory_mib(vocabulary_size, hidden_size)
    untied_mib = 2 * input_mib
    print(
        f"{vocabulary_size:>6,} tokens:",
        f"input={input_mib:>7.1f} MiB",
        f"input+untied-output={untied_mib:>7.1f} MiB",
    )

Output

8,000 tokens: input=   62.5 MiB input+untied-output=  125.0 MiB
32,000 tokens: input=  250.0 MiB input+untied-output=  500.0 MiB
128,000 tokens: input= 1000.0 MiB input+untied-output= 2000.0 MiB

Audit language and code token budgets

09-audit-token-lengths.py

import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")
fixtures = {
    "english": "How do I tokenize this prompt?",
    "portuguese": "Como tokenizo este prompt?",
    "japanese": "このプロンプトをトークン化するには？",
    "typescript": "const tokenCount = encodePrompt(prompt).length;",
}

english_tokens = len(encoding.encode(fixtures["english"]))
for name, text in fixtures.items():
    ids = encoding.encode(text)
    print(f"{name:>10}: {len(ids):>2} tokens {len(ids) / english_tokens:>4.1f}x english")
    assert encoding.decode(ids) == text

Output

english:  7 tokens  1.0x english
portuguese:  6 tokens  0.9x english
  japanese: 16 tokens  2.3x english
typescript: 10 tokens  1.4x english

Don't read a single sample as a ranking of languages. Build a locale-aware test set, record tokenizer version, compare distribution summaries, and then check downstream task quality.

Make Unicode policy explicit

10-normalize-before-tokenizing.py

import unicodedata

composed = "café"
decomposed = "cafe\u0301"
ligature = "ﬁle"

print("raw cafe equal:", composed == decomposed)
print("NFC cafe equal:", unicodedata.normalize("NFC", composed) == unicodedata.normalize("NFC", decomposed))
print("NFC ligature:", unicodedata.normalize("NFC", ligature))
print("NFKC ligature:", unicodedata.normalize("NFKC", ligature))

assert composed != decomposed
assert unicodedata.normalize("NFC", composed) == unicodedata.normalize("NFC", decomposed)
assert unicodedata.normalize("NFC", ligature) != "file"
assert unicodedata.normalize("NFKC", ligature) == "file"

Output

raw cafe equal: False
NFC cafe equal: True
NFC ligature: ﬁle
NFKC ligature: file

Compare algorithms without mixing contracts

The common tokenizer families are design choices, not names to memorize.

Method	Training view	Serving view	Boundary/fallback detail
BPE	Add frequent adjacent merges	Replay ordered merges	Byte-level variants retain UTF-8 coverage
WordPiece	Grow vocabulary for likelihood objective	Greedy longest valid piece	BERT-style continuation uses `##`; missing segmentation can yield `[UNK]`
SentencePiece BPE	BPE trained directly on raw normalized text	Replay packaged model	Visible whitespace marker can be part of pieces
SentencePiece Unigram	Estimate and prune candidate-piece probabilities	Best path or sampled alternatives when requested	Segmentation sampling supports regularized training

Production review checklist

Symptom	Likely cause	Check or fix
Model output collapses after swapping tokenizer file	IDs no longer match trained embeddings	Pin tokenizer artifact and model checkpoint together
Locale hits token limit sooner than English	Unequal fertility on translated requests	Measure parallel message sets by locale and task
A WordPiece model emits `[UNK]` for new identifiers	No valid vocabulary segmentation	Evaluate vocabulary/model update rather than masking the failure
Cache misses differ across clients for same visible message	Unicode preprocessing isn't consistent	Version and test normalization plus tokenizer pipeline
Repository prompt holds less code than expected	Code fixtures fragment into many tokens	Measure actual files with intended deployed tokenizer

What this lets you build

You started with a prompt string and finished with a tokenizer review harness. From here, you can:

Explain why subwords balance sequence length against vocabulary coverage.
Train and replay a minimal BPE merge table.
Explain why byte fallback preserves representability but doesn't guarantee compactness.
Implement WordPiece longest-match encoding and reproduce its unknown-token failure.
Train a raw-text SentencePiece Unigram tokenizer and distinguish the framework from its segmentation algorithms.
Calculate vocabulary-dependent embedding memory.
Audit token length across locales and code fixtures without overstating one sample.
Choose and test a Unicode normalization policy.

Mastery check

Evaluation rubric

Foundational: Given a five-word corpus, you can calculate one BPE winner by hand and apply it to updated pieces.
Intermediate: You can implement BPE replay and WordPiece longest-match segmentation, then explain why their failure behavior differs.
Intermediate: You can show the difference between SentencePiece BPE and SentencePiece Unigram without calling SentencePiece a merge algorithm.
Advanced: You can present a tokenizer audit with locale fixtures, code fixtures, normalization policy, vocabulary memory cost, and artifact versioning.

Follow-up questions

Common pitfalls

Retraining during inference: Pair frequencies are counted while building BPE rules, not while serving each prompt. Serving must replay fixed rules.
Conflating byte-level BPE with byte fallback: GPT-2 starts from byte representations. Other tokenizers may invoke explicit byte fallback only when ordinary pieces can't encode an input. Both preserve coverage, but they aren't the same mechanism.
Describing the WordPiece proxy as its spec: The association score clarifies the intuition. The original method is likelihood-driven, and implementations can vary.
Calling SentencePiece a fourth algorithm: SentencePiece packages raw-text handling and can host BPE or Unigram models.
Assuming byte fallback means fair multilingual cost: Coverage avoids unknown characters; it doesn't make segment lengths equal.
Normalizing without a policy: NFC and NFKC solve different problems. Compatibility folding can discard distinctions your product intended to preserve.

Next Step

Continue to Static to Contextual Embeddings

PreviousThe Bitter Lesson & Compute

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Neural Machine Translation of Rare Words with Subword Units.

Sennrich, R., Haddow, B., & Birch, A. · 2016 · ACL 2016

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. · 2019

Japanese and Korean Voice Search.

Schuster, M. & Nakajima, K. · 2012

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.

Kudo, T. & Richardson, J. · 2018 · EMNLP 2018

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.

Kudo, T. · 2018 · ACL 2018

Language Model Tokenizers Introduce Unfairness Between Languages.

Petrov, A., La Malfa, E., Torr, P. H. S., & Bibi, A. · 2023 · NeurIPS 2023

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

BPE, WordPiece, and SentencePiece

Choose pieces between characters and words

Train byte pair encoding from counts

Replay merges on a new term

Use bytes as a complete base alphabet

WordPiece chooses vocabulary differently

Expose the unknown-token failure

SentencePiece treats boundaries as part of the artifact

Vocabulary size spends parameters to save positions

Audit language and code token budgets

Make Unicode policy explicit

Compare algorithms without mixing contracts

Production review checklist

What this lets you build

Mastery check

Evaluation rubric

Follow-up questions

Common pitfalls

Mastery Check

Discussion

BPE, WordPiece, and SentencePiece

Choose pieces between characters and words

Train byte pair encoding from counts

Replay merges on a new term

Use bytes as a complete base alphabet

WordPiece chooses vocabulary differently

Expose the unknown-token failure

SentencePiece treats boundaries as part of the artifact

Vocabulary size spends parameters to save positions

Audit language and code token budgets

Make Unicode policy explicit

Compare algorithms without mixing contracts

Production review checklist

What this lets you build

Mastery check

Evaluation rubric

Follow-up questions

Common pitfalls

Mastery Check

Discussion

BPE, WordPiece, and SentencePiece

Choose pieces between characters and words

Train byte pair encoding from counts

Replay merges on a new term

During BPE encoding, why don't you count pairs in the user's new message?

Use bytes as a complete base alphabet

WordPiece chooses vocabulary differently

Expose the unknown-token failure

SentencePiece treats boundaries as part of the artifact

Is SentencePiece a fourth merge algorithm alongside BPE and WordPiece?

Vocabulary size spends parameters to save positions

Audit language and code token budgets

Make Unicode policy explicit

Compare algorithms without mixing contracts

Production review checklist

What this lets you build

Mastery check

Evaluation rubric

Follow-up questions

Why must a deployed model keep the same tokenizer artifact used during training?

What does byte fallback solve, and what does it leave unsolved?

Why isn't a low token count on one English prompt enough to choose a tokenizer?

Common pitfalls

Mastery Check

Discussion

BPE, WordPiece, and SentencePiece

Choose pieces between characters and words

Train byte pair encoding from counts

Replay merges on a new term

During BPE encoding, why don't you count pairs in the user's new message?

Use bytes as a complete base alphabet

WordPiece chooses vocabulary differently

Expose the unknown-token failure

SentencePiece treats boundaries as part of the artifact

Is SentencePiece a fourth merge algorithm alongside BPE and WordPiece?

Vocabulary size spends parameters to save positions

Audit language and code token budgets

Make Unicode policy explicit

Compare algorithms without mixing contracts

Production review checklist

What this lets you build

Mastery check

Evaluation rubric

Follow-up questions

Why must a deployed model keep the same tokenizer artifact used during training?

What does byte fallback solve, and what does it leave unsolved?

Why isn't a low token count on one English prompt enough to choose a tokenizer?

Common pitfalls

Mastery Check

Discussion