LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnCore LLM FoundationsBPE, WordPiece, and SentencePiece
📝MediumNLP Fundamentals

BPE, WordPiece, and SentencePiece

Build a small subword tokenizer, compare BPE, WordPiece, and SentencePiece, then audit token cost and Unicode behavior.

17 min read
Learning path
Step 46 of 155 in the full curriculum
The Bitter Lesson & ComputeStatic to Contextual Embeddings

A customer types, "My refund for order 48291 still isn't here." Your support model can't read that sentence directly. It receives integer IDs, and a tokenizer decides which pieces of the message get those IDs.

In the previous lesson, you saw why learning systems need representations that can improve with data and compute. Tokenization is the first such representation for text: a fixed contract that turns new wording, languages, and symbols into model input. You'll compare Byte Pair Encoding (BPE), WordPiece, and SentencePiece as you build that contract, break it, and learn what to measure before shipping it.

Refund-request tokenization pipeline from raw text to reusable pieces to tokenizer-specific token IDs. Refund-request tokenization pipeline from raw text to reusable pieces to tokenizer-specific token IDs.
A tokenizer preserves common pieces such as refund, splits rarer wording into reusable fragments, and maps every resulting piece to an ID owned by one vocabulary.
Diagram showing Raw support message, Normalize and segment, Map pieces to IDs, and Look up embeddings. Diagram showing Raw support message, Normalize and segment, Map pieces to IDs, and Look up embeddings.
Raw support message, Normalize and segment, Map pieces to IDs, and Look up embeddings.

An is a vector associated with a token ID. The next lesson teaches those vectors. For now, keep one rule in mind: ID 421 has no universal meaning. It only means something when paired with the exact tokenizer artifact that produced it.

Choose pieces between characters and words

A word-level tokenizer could keep refund as one item, but it needs a policy for every misspelling, product name, and new tracking code. A character-level tokenizer never runs out of pieces, but it turns short messages into long sequences. Subword tokenization keeps frequent fragments whole while retaining small fallback pieces for rare text.

Consider this ticket:

ticket.txt
1refund delayed for order48291
Token unitExample piecesWhat it buys youWhat it costs
Characterr e f u n d ...Any spelling is representableLong input sequence
Wordrefund, delayed, for, order48291Short common messagesRare words and IDs need fallbacks
Subwordrefund, delay, ed, order, 48291Compact common patterns with fallback partsRequires learned vocabulary
Token granularity tradeoff comparing character, subword, and word tokenization by vocabulary size, sequence length, unknown-token risk, and practical model fit. Token granularity tradeoff comparing character, subword, and word tokenization by vocabulary size, sequence length, unknown-token risk, and practical model fit.
Subwords sit between two extremes: characters provide maximum coverage but many positions, while whole words keep sequences short only when vocabulary coverage is strong.

The first lab makes that tradeoff concrete. The subword split is written by hand because you haven't trained a tokenizer yet. It gives you a budget to compare against character and whitespace-word baselines.

01-count-token-units.py
1message = "refund delayed for order48291" 2 3characters = list(message.replace(" ", "")) 4words = message.split() 5subwords = ["refund", "delay", "ed", "for", "order", "48291"] 6 7print("characters:", len(characters)) 8print("words:", len(words)) 9print("candidate subwords:", len(subwords), subwords) 10assert "".join(subwords) == message.replace(" ", "")
Output
1characters: 26 2words: 4 3candidate subwords: 6 ['refund', 'delay', 'ed', 'for', 'order', '48291']

The word count looks smallest, but it hides the hard question: what happens when order48291 was never in the vocabulary? Subwords answer that question by learning common fragments and retaining smaller pieces for the rest.

Train byte pair encoding from counts

Byte Pair Encoding (BPE) builds a vocabulary from repeated adjacent pieces. Sennrich, Haddow, and Birch applied BPE subword units to open-vocabulary neural machine translation: start from small symbols, merge frequent adjacent pairs, and reuse the resulting pieces for rare words.[1]

For teaching, start with characters inside pre-separated ticket terms. A production tokenizer has extra decisions about whitespace, bytes, and normalization, but the merge loop is the important first mechanism.

BPE training loop showing adjacent-pair counting, winner selection, and repeated merges that turn shipping into reusable pieces like sh, shi, and ship. BPE training loop showing adjacent-pair counting, winner selection, and repeated merges that turn shipping into reusable pieces like sh, shi, and ship.
Read the table from top to bottom: after s + h becomes sh, pair counts are recomputed, so later merges can build ship as one common routing fragment.

Suppose a delivery-support corpus contains these term counts:

TermCount
ship3
shipping2
shop2
refund2
tracking1

At the start, s + h appears seven times: three in ship, two in shipping, and two in shop. Merge it into sh. On the updated corpus, sh + i appears five times and can become shi; then shi + p becomes ship.

This miniature trainer stores words as tuples of current pieces, counts adjacent pairs, merges the winner everywhere, and prints the first three learned rules.

02-train-mini-bpe.py
1from collections import Counter 2 3frequencies = { 4 "ship": 3, 5 "shipping": 2, 6 "shop": 2, 7 "refund": 2, 8 "tracking": 1, 9} 10state = {tuple(word): count for word, count in frequencies.items()} 11 12def count_pairs(words: dict[tuple[str, ...], int]) -> Counter[tuple[str, str]]: 13 pairs: Counter[tuple[str, str]] = Counter() 14 for pieces, count in words.items(): 15 for pair in zip(pieces, pieces[1:]): 16 pairs[pair] += count 17 return pairs 18 19def merge_pair( 20 pieces: tuple[str, ...], pair: tuple[str, str] 21) -> tuple[str, ...]: 22 merged: list[str] = [] 23 i = 0 24 while i < len(pieces): 25 if i + 1 < len(pieces) and pieces[i : i + 2] == pair: 26 merged.append("".join(pair)) 27 i += 2 28 else: 29 merged.append(pieces[i]) 30 i += 1 31 return tuple(merged) 32 33merges: list[tuple[str, str]] = [] 34for step in range(3): 35 pair, count = count_pairs(state).most_common(1)[0] 36 state = {merge_pair(pieces, pair): freq for pieces, freq in state.items()} 37 merges.append(pair) 38 print(step + 1, pair, "->", "".join(pair), "count", count) 39 40print("learned merges:", merges)
Output
11 ('s', 'h') -> sh count 7 22 ('sh', 'i') -> shi count 5 33 ('shi', 'p') -> ship count 5 4learned merges: [('s', 'h'), ('sh', 'i'), ('shi', 'p')]

The result isn't a linguistic analysis. BPE doesn't know that ship relates to logistics. It knows only that a boundary occurs often enough to compress.

Replay merges on a new term

Training chooses an ordered merge list once. Encoding a new input doesn't recount a new corpus; it replays those learned rules. That distinction matters in production because the tokenizer must stay fixed with the model.

The next lab takes the three rules learned above and applies them to new terms. shipper benefits from the common stem, while shopper gets only the sh merge because the corpus never earned shop as one piece in the first three steps.

03-replay-bpe-merges.py
1def apply_rule(pieces: list[str], pair: tuple[str, str]) -> list[str]: 2 result: list[str] = [] 3 i = 0 4 while i < len(pieces): 5 if i + 1 < len(pieces) and tuple(pieces[i : i + 2]) == pair: 6 result.append("".join(pair)) 7 i += 2 8 else: 9 result.append(pieces[i]) 10 i += 1 11 return result 12 13rules = [("s", "h"), ("sh", "i"), ("shi", "p")] 14 15for term in ["shipping", "shipper", "shopper"]: 16 pieces = list(term) 17 for rule in rules: 18 pieces = apply_rule(pieces, rule) 19 print(term, "->", pieces)
Output
1shipping -> ['ship', 'p', 'i', 'n', 'g'] 2shipper -> ['ship', 'p', 'e', 'r'] 3shopper -> ['sh', 'o', 'p', 'p', 'e', 'r']

Fall back to bytes when characters are new

Character-starting BPE still needs an unknown-token policy for characters absent from its base vocabulary. GPT-2 used a byte-level BPE variant: it begins from representations for bytes, then learns merges over that base, which lets any UTF-8 input remain representable without an unknown character token.[2]

This lab doesn't train merges; it demonstrates the safety net beneath byte-level BPE. A return message containing Japanese characters and an emoji is still reversible through raw UTF-8 byte values.

04-byte-fallback-round-trip.py
1message = "返品📦" 2byte_ids = list(message.encode("utf-8")) 3reconstructed = bytes(byte_ids).decode("utf-8") 4 5print("byte count:", len(byte_ids)) 6print("first byte ids:", byte_ids[:8]) 7print("round trip:", reconstructed) 8assert reconstructed == message 9assert all(0 <= value <= 255 for value in byte_ids)
Output
1byte count: 10 2first byte ids: [232, 191, 148, 229, 147, 129, 240, 159] 3round trip: 返品📦

Byte coverage prevents an unrepresentable character. It doesn't promise compact tokenization: if the training data rarely covers a script or emoji sequence, several bytes may remain separate pieces.

WordPiece chooses vocabulary differently

WordPiece appeared in Google's Japanese and Korean voice-search work and later became familiar through BERT's tokenizer.[3][4] Like BPE, it creates reusable pieces. Unlike simple frequency-based BPE, the original WordPiece description selects vocabulary additions to improve a language-model likelihood objective.

Exact training recipes aren't fully specified by the short original paper, and library trainers can differ. A useful classroom proxy is an association score:

score⁡(a,b)=count⁡(ab)count⁡(a)count⁡(b)\operatorname{score}(a,b) = \frac{\operatorname{count}(ab)} {\operatorname{count}(a)\operatorname{count}(b)}score(a,b)=count(a)count(b)count(ab)​

Here, count(ab) is how often two neighboring pieces occur together; the denominator penalizes pieces that occur frequently in many other contexts. This formula teaches why a less frequent but tightly associated pair could be attractive. Treat it as intuition for WordPiece's likelihood motivation, not as the original implementation specification.

Candidate pairPair countIndividual countsProxy scoreLesson
re + fund4250 and 440.0191Often occurs together
the + order90900 and 3000.0003Frequent pieces aren't necessarily exclusive

At encoding time, BERT-style WordPiece uses continuation pieces such as ##ing and a greedy longest-match lookup. A piece beginning with ## continues the current word rather than beginning a new word.

The next lab implements that lookup for support vocabulary. It always tries the longest valid piece from the current cursor position.

05-wordpiece-longest-match.py
1def wordpiece_tokenize(word: str, vocabulary: set[str]) -> list[str]: 2 result: list[str] = [] 3 start = 0 4 while start < len(word): 5 chosen = None 6 for end in range(len(word), start, -1): 7 candidate = word[start:end] 8 if start > 0: 9 candidate = "##" + candidate 10 if candidate in vocabulary: 11 chosen = candidate 12 start = end 13 break 14 if chosen is None: 15 return ["[UNK]"] 16 result.append(chosen) 17 return result 18 19vocabulary = {"refund", "ship", "##ping", "delay", "##ed"} 20for term in ["refund", "shipping", "delayed"]: 21 print(term, "->", wordpiece_tokenize(term, vocabulary))
Output
1refund -> ['refund'] 2shipping -> ['ship', '##ping'] 3delayed -> ['delay', '##ed']

Expose the unknown-token failure

Standard WordPiece can't necessarily spell a word from arbitrary bytes. If no valid segmentation reaches the end of a word, BERT-style tokenization emits [UNK] for that word. That loses distinctions between two different unseen strings.

This failure test uses the same algorithm with a missing ##bot continuation. The fix isn't to silently map new strings to a known ID. You must use the model's tokenizer contract, or explicitly change the vocabulary and corresponding model parameters as a training decision.

06-wordpiece-unknown-token.py
1def encode_word(word: str, vocabulary: set[str]) -> list[str]: 2 pieces: list[str] = [] 3 cursor = 0 4 while cursor < len(word): 5 match = None 6 for end in range(len(word), cursor, -1): 7 candidate = word[cursor:end] 8 if cursor: 9 candidate = "##" + candidate 10 if candidate in vocabulary: 11 match = candidate 12 cursor = end 13 break 14 if match is None: 15 return ["[UNK]"] 16 pieces.append(match) 17 return pieces 18 19vocabulary = {"refund", "ship", "##ping"} 20known = encode_word("shipping", vocabulary) 21missing = encode_word("refundbot", vocabulary) 22 23print("known term:", known) 24print("missing continuation:", missing) 25assert known == ["ship", "##ping"] 26assert missing == ["[UNK]"]
Output
1known term: ['ship', '##ping'] 2missing continuation: ['[UNK]']

SentencePiece treats boundaries as part of the artifact

Many subword pipelines first split a sentence into word-like units, then segment inside each unit. That assumption is awkward for text where spaces don't mark each word. SentencePiece is a tokenizer and detokenizer framework that trains directly from raw sentences instead of requiring pre-tokenized word sequences.[5]

SentencePiece commonly makes spaces visible as ▁, so Orders ship late becomes a stream like ▁Orders▁ship▁late before final pieces are chosen. Decoding targets the normalized input string, not necessarily the original raw byte sequence, because normalization can fold equivalent or compatibility forms.

SentencePiece comparison showing raw-text normalization with explicit whitespace markers, BPE bottom-up vocabulary growth, and Unigram top-down vocabulary pruning. SentencePiece comparison showing raw-text normalization with explicit whitespace markers, BPE bottom-up vocabulary growth, and Unigram top-down vocabulary pruning.
SentencePiece places normalization and visible whitespace handling inside one tokenizer artifact. Its BPE mode adds merges; its Unigram mode begins with candidates and prunes them.

SentencePiece supports BPE, and it also supports the Unigram language model algorithm proposed with subword regularization. Unigram starts with many candidate pieces, assigns probabilities, removes pieces that contribute least to corpus likelihood, and can sample multiple valid segmentations during model training.[6]

That sampling option matters because a word such as refundable can have several legal piece boundaries. Training on more than one segmentation can make the downstream model less dependent on a single boundary choice. For deterministic serving, the tokenizer can still select its highest-probability segmentation.

The official library makes the artifact visible. This lab trains a small Unigram SentencePiece model on raw support messages, encodes an unseen request, and proves that its decoded form matches the tokenizer's normalized text.

07-train-sentencepiece-unigram.py
1from pathlib import Path 2from tempfile import TemporaryDirectory 3 4import sentencepiece as spm 5 6spm.set_min_log_level(2) 7 8with TemporaryDirectory() as tmp: 9 corpus = Path(tmp) / "tickets.txt" 10 corpus.write_text( 11 "refund request pending\n" 12 "refund approved today\n" 13 "shipping label created\n" 14 "order tracking delayed\n" 15 "return label requested\n", 16 encoding="utf-8", 17 ) 18 prefix = str(Path(tmp) / "support_tokenizer") 19 spm.SentencePieceTrainer.train( 20 input=str(corpus), 21 model_prefix=prefix, 22 model_type="unigram", 23 vocab_size=40, 24 character_coverage=1.0, 25 hard_vocab_limit=False, 26 bos_id=-1, 27 eos_id=-1, 28 pad_id=-1, 29 ) 30 tokenizer = spm.SentencePieceProcessor(model_file=prefix + ".model") 31 message = "refund label delayed" 32 pieces = tokenizer.encode(message, out_type=str) 33 decoded = tokenizer.decode(pieces) 34 35print("pieces:", pieces) 36print("decoded:", decoded) 37assert decoded == message 38assert any(piece.startswith("▁") for piece in pieces)
Output
1pieces: ['▁re', 'f', 'u', 'nd', '▁', 'la', 'b', 'el', '▁', 'd', 'el', 'ay', 'ed'] 2decoded: refund label delayed

Vocabulary size spends parameters to save positions

Every added vocabulary entry can compress a recurring string into fewer tokens. It also adds an embedding row. If the output projection isn't tied to the input embedding , it adds another row there too.

For a vocabulary of size VVV and hidden dimension ddd, an input embedding matrix contains VdVdVd parameters. With float16 weights, each parameter takes two bytes. A second untied output matrix doubles that vocabulary-dependent memory.

Vocabulary size tradeoff chart showing larger tokenizer vocabularies shortening sequences while increasing embedding memory and rare-token sparsity risk. Vocabulary size tradeoff chart showing larger tokenizer vocabularies shortening sequences while increasing embedding memory and rare-token sparsity risk.
Extra vocabulary rows are worthwhile only when measured compression or model quality earns their cost; shortening one workload doesn't prove a better tokenizer for every locale.

Use numbers before making a design argument. This lab compares hypothetical tokenizer vocabularies for a model with hidden dimension 4096; it calculates embedding memory rather than assuming a larger vocabulary is free.

08-vocabulary-memory-budget.py
1def embedding_memory_mib( 2 vocabulary_size: int, hidden_size: int, bytes_per_weight: int = 2 3) -> float: 4 return vocabulary_size * hidden_size * bytes_per_weight / (1024**2) 5 6hidden_size = 4096 7for vocabulary_size in [8_000, 32_000, 128_000]: 8 input_mib = embedding_memory_mib(vocabulary_size, hidden_size) 9 untied_mib = 2 * input_mib 10 print( 11 f"{vocabulary_size:>6,} tokens:", 12 f"input={input_mib:>7.1f} MiB", 13 f"input+untied-output={untied_mib:>7.1f} MiB", 14 )
Output
18,000 tokens: input= 62.5 MiB input+untied-output= 125.0 MiB 232,000 tokens: input= 250.0 MiB input+untied-output= 500.0 MiB 3128,000 tokens: input= 1000.0 MiB input+untied-output= 2000.0 MiB

Sequence compression must be measured too. A vocabulary can shorten common English refund requests and still fragment another script or your TypeScript repository badly. This is why tokenizer design is an evaluation problem, not a race to the largest V.

Audit language and code token budgets

Fertility is a token-length measure: how many tokenizer pieces a unit of text needs. For a product with international customers, compare parallel messages across target languages instead of using English traffic alone. Petrov et al. measured translated text and found tokenizer-length disparities as large as 15 times for some language and tokenizer combinations, with implications for cost, latency, and available context.[7]

Schematic multilingual token-length audit showing how equivalent messages can use different token budgets and affect cost, context, latency, and quality checks. Schematic multilingual token-length audit showing how equivalent messages can use different token budgets and affect cost, context, latency, and quality checks.
This schematic is a prompt to audit, not a benchmark result: measure equivalent customer messages by locale, then examine cost, latency, context use, and quality together.

Code belongs in the same audit. Identifiers, indentation, and operators are all model input. A tokenizer trained with limited code coverage may split familiar repository patterns into many pieces, leaving less room for files, tests, and error logs.

Code tokenization audit showing an identifier, possible useful and fragmented splits, syntax-sensitive spans, and illustrative token-count comparison. Code tokenization audit showing an identifier, possible useful and fragmented splits, syntax-sensitive spans, and illustrative token-count comparison.
A code audit should inspect the real tokenizer output for identifiers and syntax, then compare token totals on actual repository files instead of trusting illustrative splits.

The next lab uses one named tokenizer artifact to measure several fixtures. The sample isn't a language-quality study; its job is to give you a repeatable audit shape. Replace the fixture messages with reviewed parallel translations and repository files for a real decision.

09-audit-token-lengths.py
1import tiktoken 2 3encoding = tiktoken.get_encoding("cl100k_base") 4fixtures = { 5 "english": "Where is my refund for order 48291?", 6 "portuguese": "Onde esta meu reembolso do pedido 48291?", 7 "japanese": "注文48291の返金はどこですか?", 8 "typescript": "const refundEligibilityByOrderId = await getRefund(orderId);", 9} 10 11for name, text in fixtures.items(): 12 ids = encoding.encode(text) 13 print(f"{name:>10}: {len(ids):>2} tokens") 14 assert encoding.decode(ids) == text
Output
1english: 10 tokens 2portuguese: 14 tokens 3 japanese: 14 tokens 4typescript: 15 tokens

Don't read a single sample as a ranking of languages. Build a locale-aware test set, record tokenizer version, compare distribution summaries, and then check downstream task quality.

Make Unicode policy explicit

Two strings can look identical while holding different Unicode code points. For example, café may contain one composed é or the sequence e plus a combining accent. Without a stable policy, cache keys, token counts, and filter behavior can disagree across clients.

Normalization Form C (NFC) composes canonically equivalent forms without folding broad compatibility distinctions. Normalization Form Compatibility Composition (NFKC) also folds compatibility characters, such as the fi ligature into fi. Python exposes both through unicodedata.normalize; choosing between them is a product policy decision, not an automatic cleanup rule.

Unicode normalization pipeline showing visually similar strings, code point differences, NFKC or NFC normalization, and stable tokenizer output. Unicode normalization pipeline showing visually similar strings, code point differences, NFKC or NFC normalization, and stable tokenizer output.
Canonical normalization can stabilize equivalent input forms, while compatibility folding changes some visible characters. Document that choice alongside tokenizer version and cache policy.

This final lab proves canonical equivalence and highlights the compatibility choice. For user-visible support messages, you might choose NFC first and add separate security checks for invisible or confusable characters. Another product may deliberately choose NFKC after deciding the information loss is acceptable.

10-normalize-before-tokenizing.py
1import unicodedata 2 3composed = "café" 4decomposed = "cafe\u0301" 5ligature = "file" 6 7print("raw cafe equal:", composed == decomposed) 8print("NFC cafe equal:", unicodedata.normalize("NFC", composed) == unicodedata.normalize("NFC", decomposed)) 9print("NFC ligature:", unicodedata.normalize("NFC", ligature)) 10print("NFKC ligature:", unicodedata.normalize("NFKC", ligature)) 11 12assert composed != decomposed 13assert unicodedata.normalize("NFC", composed) == unicodedata.normalize("NFC", decomposed) 14assert unicodedata.normalize("NFC", ligature) != "file" 15assert unicodedata.normalize("NFKC", ligature) == "file"
Output
1raw cafe equal: False 2NFC cafe equal: True 3NFC ligature: file 4NFKC ligature: file

Tokenizer behavior must be versioned with this policy. If one service normalizes with NFC and another silently folds with NFKC, they can send different IDs to the same model or generate different cache keys for a ticket that looks unchanged.

Compare algorithms without mixing contracts

You can now read the common tokenizer families as design choices rather than names to memorize.

Comparison of BPE, WordPiece, and Unigram tokenizers showing example segmentations, training direction, inference rule, unknown-token behavior, and production fit. Comparison of BPE, WordPiece, and Unigram tokenizers showing example segmentations, training direction, inference rule, unknown-token behavior, and production fit.
BPE, WordPiece, and Unigram produce subwords using different training signals and boundary conventions. None of their token IDs can be exchanged between model checkpoints.
MethodTraining viewServing viewBoundary/fallback detail
BPEAdd frequent adjacent mergesReplay ordered mergesByte-level variants retain UTF-8 coverage
WordPieceGrow vocabulary for likelihood objectiveGreedy longest valid pieceBERT-style continuation uses ##; missing segmentation can yield [UNK]
SentencePiece BPEBPE trained directly on raw normalized textReplay packaged modelVisible whitespace marker can be part of pieces
SentencePiece UnigramEstimate and prune candidate-piece probabilitiesBest path or sampled alternatives when requestedSegmentation sampling supports regularized training

Production review checklist

SymptomLikely causeCheck or fix
Model output collapses after swapping tokenizer fileIDs no longer match trained embeddingsPin tokenizer artifact and model checkpoint together
Locale hits token limit sooner than EnglishUnequal fertility on translated requestsMeasure parallel message sets by locale and task
A WordPiece model emits [UNK] for product namesNo valid vocabulary segmentationEvaluate vocabulary/model update rather than masking the failure
Cache misses differ across clients for same visible messageUnicode preprocessing isn't consistentVersion and test normalization plus tokenizer pipeline
Repository prompt holds less code than expectedCode fixtures fragment into many tokensMeasure actual files with intended deployed tokenizer

What you can now build

You started with a refund sentence and finished with a tokenizer review harness. You can now:

  1. Explain why subwords balance sequence length against vocabulary coverage.
  2. Train and replay a minimal BPE merge table.
  3. Explain why byte fallback preserves representability but doesn't guarantee compactness.
  4. Implement WordPiece longest-match encoding and reproduce its unknown-token failure.
  5. Train a raw-text SentencePiece Unigram tokenizer and distinguish the framework from its segmentation algorithms.
  6. Calculate vocabulary-dependent embedding memory.
  7. Audit token length across locales and code fixtures without overstating one sample.
  8. Choose and test a Unicode normalization policy.

Mastery check

Evaluation rubric

  • Foundational: Given a five-word corpus, you can calculate one BPE winner by hand and apply it to updated pieces.
  • Intermediate: You can implement BPE replay and WordPiece longest-match segmentation, then explain why their failure behavior differs.
  • Intermediate: You can show the difference between SentencePiece BPE and SentencePiece Unigram without calling SentencePiece a merge algorithm.
  • Advanced: You can present a tokenizer audit with locale fixtures, code fixtures, normalization policy, vocabulary memory cost, and artifact versioning.

Follow-up questions

Common pitfalls

  • Retraining during inference: Pair frequencies are counted while building BPE rules, not while serving each customer message. Serving must replay fixed rules.
  • Describing the WordPiece proxy as its spec: The association score clarifies the intuition. The original method is likelihood-driven, and implementations can vary.
  • Calling SentencePiece a fourth algorithm: SentencePiece packages raw-text handling and can host BPE or Unigram models.
  • Assuming byte fallback means fair multilingual cost: Coverage avoids unknown characters; it doesn't make segment lengths equal.
  • Normalizing without a policy: NFC and NFKC solve different problems. Compatibility folding can discard distinctions your product intended to preserve.
Next Step
Continue to Static to Contextual Embeddings

You can now turn text into stable token IDs and measure the cost of that choice. Next you'll turn those IDs into <span data-glossary="vector">vectors</span> whose geometry lets a model learn similarity and context.

PreviousThe Bitter Lesson & Compute
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Neural Machine Translation of Rare Words with Subword Units.

Sennrich, R., Haddow, B., & Birch, A. · 2016 · ACL 2016

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. · 2019

Japanese and Korean Voice Search.

Schuster, M. & Nakajima, K. · 2012

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.

Kudo, T. & Richardson, J. · 2018 · EMNLP 2018

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.

Kudo, T. · 2018 · ACL 2018

Language Model Tokenizers Introduce Unfairness Between Languages.

Petrov, A., La Malfa, E., Torr, P. H. S., & Bibi, A. · 2023 · NeurIPS 2023