Build a small subword tokenizer, compare BPE, WordPiece, and SentencePiece, then audit token cost and Unicode behavior.
A customer types, "My refund for order 48291 still isn't here." Your support model can't read that sentence directly. It receives integer IDs, and a tokenizer decides which pieces of the message get those IDs.
In the previous lesson, you saw why learning systems need representations that can improve with data and compute. Tokenization is the first such representation for text: a fixed contract that turns new wording, languages, and symbols into model input. You'll compare Byte Pair Encoding (BPE), WordPiece, and SentencePiece as you build that contract, break it, and learn what to measure before shipping it.
refund, splits rarer wording into reusable fragments, and maps every resulting piece to an ID owned by one vocabulary.
An is a vector associated with a token ID. The next lesson teaches those vectors. For now, keep one rule in mind: ID 421 has no universal meaning. It only means something when paired with the exact tokenizer artifact that produced it.
A word-level tokenizer could keep refund as one item, but it needs a policy for every misspelling, product name, and new tracking code. A character-level tokenizer never runs out of pieces, but it turns short messages into long sequences. Subword tokenization keeps frequent fragments whole while retaining small fallback pieces for rare text.
Consider this ticket:
1refund delayed for order48291| Token unit | Example pieces | What it buys you | What it costs |
|---|---|---|---|
| Character | r e f u n d ... | Any spelling is representable | Long input sequence |
| Word | refund, delayed, for, order48291 | Short common messages | Rare words and IDs need fallbacks |
| Subword | refund, delay, ed, order, 48291 | Compact common patterns with fallback parts | Requires learned vocabulary |
The first lab makes that tradeoff concrete. The subword split is written by hand because you haven't trained a tokenizer yet. It gives you a budget to compare against character and whitespace-word baselines.
1message = "refund delayed for order48291"
2
3characters = list(message.replace(" ", ""))
4words = message.split()
5subwords = ["refund", "delay", "ed", "for", "order", "48291"]
6
7print("characters:", len(characters))
8print("words:", len(words))
9print("candidate subwords:", len(subwords), subwords)
10assert "".join(subwords) == message.replace(" ", "")1characters: 26
2words: 4
3candidate subwords: 6 ['refund', 'delay', 'ed', 'for', 'order', '48291']The word count looks smallest, but it hides the hard question: what happens when order48291 was never in the vocabulary? Subwords answer that question by learning common fragments and retaining smaller pieces for the rest.
Byte Pair Encoding (BPE) builds a vocabulary from repeated adjacent pieces. Sennrich, Haddow, and Birch applied BPE subword units to open-vocabulary neural machine translation: start from small symbols, merge frequent adjacent pairs, and reuse the resulting pieces for rare words.[1]
For teaching, start with characters inside pre-separated ticket terms. A production tokenizer has extra decisions about whitespace, bytes, and normalization, but the merge loop is the important first mechanism.
s + h becomes sh, pair counts are recomputed, so later merges can build ship as one common routing fragment.Suppose a delivery-support corpus contains these term counts:
| Term | Count |
|---|---|
ship | 3 |
shipping | 2 |
shop | 2 |
refund | 2 |
tracking | 1 |
At the start, s + h appears seven times: three in ship, two in shipping, and two in shop. Merge it into sh. On the updated corpus, sh + i appears five times and can become shi; then shi + p becomes ship.
This miniature trainer stores words as tuples of current pieces, counts adjacent pairs, merges the winner everywhere, and prints the first three learned rules.
1from collections import Counter
2
3frequencies = {
4 "ship": 3,
5 "shipping": 2,
6 "shop": 2,
7 "refund": 2,
8 "tracking": 1,
9}
10state = {tuple(word): count for word, count in frequencies.items()}
11
12def count_pairs(words: dict[tuple[str, ...], int]) -> Counter[tuple[str, str]]:
13 pairs: Counter[tuple[str, str]] = Counter()
14 for pieces, count in words.items():
15 for pair in zip(pieces, pieces[1:]):
16 pairs[pair] += count
17 return pairs
18
19def merge_pair(
20 pieces: tuple[str, ...], pair: tuple[str, str]
21) -> tuple[str, ...]:
22 merged: list[str] = []
23 i = 0
24 while i < len(pieces):
25 if i + 1 < len(pieces) and pieces[i : i + 2] == pair:
26 merged.append("".join(pair))
27 i += 2
28 else:
29 merged.append(pieces[i])
30 i += 1
31 return tuple(merged)
32
33merges: list[tuple[str, str]] = []
34for step in range(3):
35 pair, count = count_pairs(state).most_common(1)[0]
36 state = {merge_pair(pieces, pair): freq for pieces, freq in state.items()}
37 merges.append(pair)
38 print(step + 1, pair, "->", "".join(pair), "count", count)
39
40print("learned merges:", merges)11 ('s', 'h') -> sh count 7
22 ('sh', 'i') -> shi count 5
33 ('shi', 'p') -> ship count 5
4learned merges: [('s', 'h'), ('sh', 'i'), ('shi', 'p')]The result isn't a linguistic analysis. BPE doesn't know that ship relates to logistics. It knows only that a boundary occurs often enough to compress.
Training chooses an ordered merge list once. Encoding a new input doesn't recount a new corpus; it replays those learned rules. That distinction matters in production because the tokenizer must stay fixed with the model.
The next lab takes the three rules learned above and applies them to new terms. shipper benefits from the common stem, while shopper gets only the sh merge because the corpus never earned shop as one piece in the first three steps.
1def apply_rule(pieces: list[str], pair: tuple[str, str]) -> list[str]:
2 result: list[str] = []
3 i = 0
4 while i < len(pieces):
5 if i + 1 < len(pieces) and tuple(pieces[i : i + 2]) == pair:
6 result.append("".join(pair))
7 i += 2
8 else:
9 result.append(pieces[i])
10 i += 1
11 return result
12
13rules = [("s", "h"), ("sh", "i"), ("shi", "p")]
14
15for term in ["shipping", "shipper", "shopper"]:
16 pieces = list(term)
17 for rule in rules:
18 pieces = apply_rule(pieces, rule)
19 print(term, "->", pieces)1shipping -> ['ship', 'p', 'i', 'n', 'g']
2shipper -> ['ship', 'p', 'e', 'r']
3shopper -> ['sh', 'o', 'p', 'p', 'e', 'r']Character-starting BPE still needs an unknown-token policy for characters absent from its base vocabulary. GPT-2 used a byte-level BPE variant: it begins from representations for bytes, then learns merges over that base, which lets any UTF-8 input remain representable without an unknown character token.[2]
This lab doesn't train merges; it demonstrates the safety net beneath byte-level BPE. A return message containing Japanese characters and an emoji is still reversible through raw UTF-8 byte values.
1message = "返品📦"
2byte_ids = list(message.encode("utf-8"))
3reconstructed = bytes(byte_ids).decode("utf-8")
4
5print("byte count:", len(byte_ids))
6print("first byte ids:", byte_ids[:8])
7print("round trip:", reconstructed)
8assert reconstructed == message
9assert all(0 <= value <= 255 for value in byte_ids)1byte count: 10
2first byte ids: [232, 191, 148, 229, 147, 129, 240, 159]
3round trip: 返品📦Byte coverage prevents an unrepresentable character. It doesn't promise compact tokenization: if the training data rarely covers a script or emoji sequence, several bytes may remain separate pieces.
WordPiece appeared in Google's Japanese and Korean voice-search work and later became familiar through BERT's tokenizer.[3][4] Like BPE, it creates reusable pieces. Unlike simple frequency-based BPE, the original WordPiece description selects vocabulary additions to improve a language-model likelihood objective.
Exact training recipes aren't fully specified by the short original paper, and library trainers can differ. A useful classroom proxy is an association score:
Here, count(ab) is how often two neighboring pieces occur together; the denominator penalizes pieces that occur frequently in many other contexts. This formula teaches why a less frequent but tightly associated pair could be attractive. Treat it as intuition for WordPiece's likelihood motivation, not as the original implementation specification.
| Candidate pair | Pair count | Individual counts | Proxy score | Lesson |
|---|---|---|---|---|
re + fund | 42 | 50 and 44 | 0.0191 | Often occurs together |
the + order | 90 | 900 and 300 | 0.0003 | Frequent pieces aren't necessarily exclusive |
At encoding time, BERT-style WordPiece uses continuation pieces such as ##ing and a greedy longest-match lookup. A piece beginning with ## continues the current word rather than beginning a new word.
The next lab implements that lookup for support vocabulary. It always tries the longest valid piece from the current cursor position.
1def wordpiece_tokenize(word: str, vocabulary: set[str]) -> list[str]:
2 result: list[str] = []
3 start = 0
4 while start < len(word):
5 chosen = None
6 for end in range(len(word), start, -1):
7 candidate = word[start:end]
8 if start > 0:
9 candidate = "##" + candidate
10 if candidate in vocabulary:
11 chosen = candidate
12 start = end
13 break
14 if chosen is None:
15 return ["[UNK]"]
16 result.append(chosen)
17 return result
18
19vocabulary = {"refund", "ship", "##ping", "delay", "##ed"}
20for term in ["refund", "shipping", "delayed"]:
21 print(term, "->", wordpiece_tokenize(term, vocabulary))1refund -> ['refund']
2shipping -> ['ship', '##ping']
3delayed -> ['delay', '##ed']Standard WordPiece can't necessarily spell a word from arbitrary bytes. If no valid segmentation reaches the end of a word, BERT-style tokenization emits [UNK] for that word. That loses distinctions between two different unseen strings.
This failure test uses the same algorithm with a missing ##bot continuation. The fix isn't to silently map new strings to a known ID. You must use the model's tokenizer contract, or explicitly change the vocabulary and corresponding model parameters as a training decision.
1def encode_word(word: str, vocabulary: set[str]) -> list[str]:
2 pieces: list[str] = []
3 cursor = 0
4 while cursor < len(word):
5 match = None
6 for end in range(len(word), cursor, -1):
7 candidate = word[cursor:end]
8 if cursor:
9 candidate = "##" + candidate
10 if candidate in vocabulary:
11 match = candidate
12 cursor = end
13 break
14 if match is None:
15 return ["[UNK]"]
16 pieces.append(match)
17 return pieces
18
19vocabulary = {"refund", "ship", "##ping"}
20known = encode_word("shipping", vocabulary)
21missing = encode_word("refundbot", vocabulary)
22
23print("known term:", known)
24print("missing continuation:", missing)
25assert known == ["ship", "##ping"]
26assert missing == ["[UNK]"]1known term: ['ship', '##ping']
2missing continuation: ['[UNK]']Many subword pipelines first split a sentence into word-like units, then segment inside each unit. That assumption is awkward for text where spaces don't mark each word. SentencePiece is a tokenizer and detokenizer framework that trains directly from raw sentences instead of requiring pre-tokenized word sequences.[5]
SentencePiece commonly makes spaces visible as ▁, so Orders ship late becomes a stream like ▁Orders▁ship▁late before final pieces are chosen. Decoding targets the normalized input string, not necessarily the original raw byte sequence, because normalization can fold equivalent or compatibility forms.
SentencePiece supports BPE, and it also supports the Unigram language model algorithm proposed with subword regularization. Unigram starts with many candidate pieces, assigns probabilities, removes pieces that contribute least to corpus likelihood, and can sample multiple valid segmentations during model training.[6]
That sampling option matters because a word such as refundable can have several legal piece boundaries. Training on more than one segmentation can make the downstream model less dependent on a single boundary choice. For deterministic serving, the tokenizer can still select its highest-probability segmentation.
The official library makes the artifact visible. This lab trains a small Unigram SentencePiece model on raw support messages, encodes an unseen request, and proves that its decoded form matches the tokenizer's normalized text.
1from pathlib import Path
2from tempfile import TemporaryDirectory
3
4import sentencepiece as spm
5
6spm.set_min_log_level(2)
7
8with TemporaryDirectory() as tmp:
9 corpus = Path(tmp) / "tickets.txt"
10 corpus.write_text(
11 "refund request pending\n"
12 "refund approved today\n"
13 "shipping label created\n"
14 "order tracking delayed\n"
15 "return label requested\n",
16 encoding="utf-8",
17 )
18 prefix = str(Path(tmp) / "support_tokenizer")
19 spm.SentencePieceTrainer.train(
20 input=str(corpus),
21 model_prefix=prefix,
22 model_type="unigram",
23 vocab_size=40,
24 character_coverage=1.0,
25 hard_vocab_limit=False,
26 bos_id=-1,
27 eos_id=-1,
28 pad_id=-1,
29 )
30 tokenizer = spm.SentencePieceProcessor(model_file=prefix + ".model")
31 message = "refund label delayed"
32 pieces = tokenizer.encode(message, out_type=str)
33 decoded = tokenizer.decode(pieces)
34
35print("pieces:", pieces)
36print("decoded:", decoded)
37assert decoded == message
38assert any(piece.startswith("▁") for piece in pieces)1pieces: ['▁re', 'f', 'u', 'nd', '▁', 'la', 'b', 'el', '▁', 'd', 'el', 'ay', 'ed']
2decoded: refund label delayedEvery added vocabulary entry can compress a recurring string into fewer tokens. It also adds an embedding row. If the output projection isn't tied to the input embedding , it adds another row there too.
For a vocabulary of size and hidden dimension , an input embedding matrix contains parameters. With float16 weights, each parameter takes two bytes. A second untied output matrix doubles that vocabulary-dependent memory.
Use numbers before making a design argument. This lab compares hypothetical tokenizer vocabularies for a model with hidden dimension 4096; it calculates embedding memory rather than assuming a larger vocabulary is free.
1def embedding_memory_mib(
2 vocabulary_size: int, hidden_size: int, bytes_per_weight: int = 2
3) -> float:
4 return vocabulary_size * hidden_size * bytes_per_weight / (1024**2)
5
6hidden_size = 4096
7for vocabulary_size in [8_000, 32_000, 128_000]:
8 input_mib = embedding_memory_mib(vocabulary_size, hidden_size)
9 untied_mib = 2 * input_mib
10 print(
11 f"{vocabulary_size:>6,} tokens:",
12 f"input={input_mib:>7.1f} MiB",
13 f"input+untied-output={untied_mib:>7.1f} MiB",
14 )18,000 tokens: input= 62.5 MiB input+untied-output= 125.0 MiB
232,000 tokens: input= 250.0 MiB input+untied-output= 500.0 MiB
3128,000 tokens: input= 1000.0 MiB input+untied-output= 2000.0 MiBSequence compression must be measured too. A vocabulary can shorten common English refund requests and still fragment another script or your TypeScript repository badly. This is why tokenizer design is an evaluation problem, not a race to the largest V.
Fertility is a token-length measure: how many tokenizer pieces a unit of text needs. For a product with international customers, compare parallel messages across target languages instead of using English traffic alone. Petrov et al. measured translated text and found tokenizer-length disparities as large as 15 times for some language and tokenizer combinations, with implications for cost, latency, and available context.[7]
Code belongs in the same audit. Identifiers, indentation, and operators are all model input. A tokenizer trained with limited code coverage may split familiar repository patterns into many pieces, leaving less room for files, tests, and error logs.
The next lab uses one named tokenizer artifact to measure several fixtures. The sample isn't a language-quality study; its job is to give you a repeatable audit shape. Replace the fixture messages with reviewed parallel translations and repository files for a real decision.
1import tiktoken
2
3encoding = tiktoken.get_encoding("cl100k_base")
4fixtures = {
5 "english": "Where is my refund for order 48291?",
6 "portuguese": "Onde esta meu reembolso do pedido 48291?",
7 "japanese": "注文48291の返金はどこですか?",
8 "typescript": "const refundEligibilityByOrderId = await getRefund(orderId);",
9}
10
11for name, text in fixtures.items():
12 ids = encoding.encode(text)
13 print(f"{name:>10}: {len(ids):>2} tokens")
14 assert encoding.decode(ids) == text1english: 10 tokens
2portuguese: 14 tokens
3 japanese: 14 tokens
4typescript: 15 tokensDon't read a single sample as a ranking of languages. Build a locale-aware test set, record tokenizer version, compare distribution summaries, and then check downstream task quality.
Two strings can look identical while holding different Unicode code points. For example, café may contain one composed é or the sequence e plus a combining accent. Without a stable policy, cache keys, token counts, and filter behavior can disagree across clients.
Normalization Form C (NFC) composes canonically equivalent forms without folding broad compatibility distinctions. Normalization Form Compatibility Composition (NFKC) also folds compatibility characters, such as the fi ligature into fi. Python exposes both through unicodedata.normalize; choosing between them is a product policy decision, not an automatic cleanup rule.
This final lab proves canonical equivalence and highlights the compatibility choice. For user-visible support messages, you might choose NFC first and add separate security checks for invisible or confusable characters. Another product may deliberately choose NFKC after deciding the information loss is acceptable.
1import unicodedata
2
3composed = "café"
4decomposed = "cafe\u0301"
5ligature = "file"
6
7print("raw cafe equal:", composed == decomposed)
8print("NFC cafe equal:", unicodedata.normalize("NFC", composed) == unicodedata.normalize("NFC", decomposed))
9print("NFC ligature:", unicodedata.normalize("NFC", ligature))
10print("NFKC ligature:", unicodedata.normalize("NFKC", ligature))
11
12assert composed != decomposed
13assert unicodedata.normalize("NFC", composed) == unicodedata.normalize("NFC", decomposed)
14assert unicodedata.normalize("NFC", ligature) != "file"
15assert unicodedata.normalize("NFKC", ligature) == "file"1raw cafe equal: False
2NFC cafe equal: True
3NFC ligature: file
4NFKC ligature: fileTokenizer behavior must be versioned with this policy. If one service normalizes with NFC and another silently folds with NFKC, they can send different IDs to the same model or generate different cache keys for a ticket that looks unchanged.
You can now read the common tokenizer families as design choices rather than names to memorize.
| Method | Training view | Serving view | Boundary/fallback detail |
|---|---|---|---|
| BPE | Add frequent adjacent merges | Replay ordered merges | Byte-level variants retain UTF-8 coverage |
| WordPiece | Grow vocabulary for likelihood objective | Greedy longest valid piece | BERT-style continuation uses ##; missing segmentation can yield [UNK] |
| SentencePiece BPE | BPE trained directly on raw normalized text | Replay packaged model | Visible whitespace marker can be part of pieces |
| SentencePiece Unigram | Estimate and prune candidate-piece probabilities | Best path or sampled alternatives when requested | Segmentation sampling supports regularized training |
| Symptom | Likely cause | Check or fix |
|---|---|---|
| Model output collapses after swapping tokenizer file | IDs no longer match trained embeddings | Pin tokenizer artifact and model checkpoint together |
| Locale hits token limit sooner than English | Unequal fertility on translated requests | Measure parallel message sets by locale and task |
A WordPiece model emits [UNK] for product names | No valid vocabulary segmentation | Evaluate vocabulary/model update rather than masking the failure |
| Cache misses differ across clients for same visible message | Unicode preprocessing isn't consistent | Version and test normalization plus tokenizer pipeline |
| Repository prompt holds less code than expected | Code fixtures fragment into many tokens | Measure actual files with intended deployed tokenizer |
You started with a refund sentence and finished with a tokenizer review harness. You can now:
Neural Machine Translation of Rare Words with Subword Units.
Sennrich, R., Haddow, B., & Birch, A. · 2016 · ACL 2016
Language Models are Unsupervised Multitask Learners.
Radford, A., et al. · 2019
Japanese and Korean Voice Search.
Schuster, M. & Nakajima, K. · 2012
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. · 2019 · NAACL 2019
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.
Kudo, T. & Richardson, J. · 2018 · EMNLP 2018
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.
Kudo, T. · 2018 · ACL 2018
Language Model Tokenizers Introduce Unfairness Between Languages.
Petrov, A., La Malfa, E., Torr, P. H. S., & Bibi, A. · 2023 · NeurIPS 2023