LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

Β© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

πŸ§ͺAI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & model evaluation
⚑Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts (MoE)Mamba & State Space ModelsReasoning & Test-Time Compute
πŸ”Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
πŸ€–Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
πŸ“ŠEvaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
πŸ› οΈLLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering and Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge DistillationModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
πŸ—οΈSystem Design Case Studies0/10
Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnAI Engineering FoundationsTokenization: BPE & SentencePiece
πŸ“MediumNLP Fundamentals

Tokenization: BPE & SentencePiece

Compare tokenization algorithms, understand vocabulary size tradeoffs, analyze the multilingual tokenization tax, and handle Unicode edge cases in production.

30 min readGoogle, Meta, OpenAI +38 key concepts

Imagine trying to teach a computer to read. You can't just feed it raw text, because it doesn't know what letters, words, or sentences are. You need to break the text into small, meaningful pieces that the computer can work with. This process is called tokenization, and it's the very first step in how every language model, from ChatGPT to Gemini, processes your input.

Think of it like breaking a sentence into Scrabble tiles: you need pieces that are small enough to be flexible, but large enough to carry meaning. The word "unhappiness" might become ["un", "happi", "ness"]. Each piece is meaningful on its own, and together they reconstruct the whole word. Every large language model (LLM) sees the world through these subword tokens, not raw words or characters.

This article is your foundation. Every topic that follows (attention, embeddings, inference) builds on understanding how raw text becomes the numerical sequences that models actually process.

🎯 Core concept: Tokenization is the bridge between human language and model computation. Understanding how and why text gets split into tokens will help you reason about everything from API costs to why some languages perform worse than others in AI systems.


The core problem: why not just use words or characters?

Before diving into algorithms, understand the fundamental tension:

Diagram Diagram
ApproachVocabulary SizeAvg Sequence LengthProblem
Character-level~256Very long (5–7Γ— word count)Sequences too long for attention; hard to learn word semantics
Word-level500K+ShortHuge embedding table; can't handle misspellings, neologisms, or morphology
Subword-level32K–200KBalancedβœ… Best of both worlds

Subword tokenization solves this by learning a vocabulary of frequently co-occurring character sequences. Common words like "the" stay whole. Rare words like "unhappiness" get split into meaningful pieces: ["un", "happi", "ness"]. This is the approach used by every modern LLM.


Byte pair encoding (BPE)

BPE is the dominant tokenization algorithm in modern LLMs. Originally a data compression algorithm invented by Philip Gage in 1994[1], it was adapted for NLP by Sennrich, Haddow & Birch (2016)[2] and later refined into byte-level BPE by Radford et al. (2019)[3] for GPT-2.

How BPE training works

The algorithm is elegantly simple: iteratively merge the most frequent pair.

Diagram Diagram

Here is the complete training algorithm. It takes a corpus of text and a target vocabulary size as input, and outputs the final vocabulary along with the ordered list of merge rules that will be used to tokenize new text:

python
1def merge_pair(ids: list, pair: tuple, new_id: int) -> list: 2 """Replaces all occurrences of a pair with a new_id in a list of ids.""" 3 new_ids = [] 4 i = 0 5 while i < len(ids): 6 if i + 1 < len(ids) and (ids[i], ids[i+1]) == pair: 7 new_ids.append(new_id) 8 i += 2 9 else: 10 new_ids.append(ids[i]) 11 i += 1 12 return new_ids 13 14def train_bpe(corpus: str, vocab_size: int): 15 """Train a BPE tokenizer from a text corpus.""" 16 17 # Step 1: Encode text as UTF-8 bytes (byte-level BPE) 18 # This gives us a base vocabulary of exactly 256 byte values 19 token_ids = list(corpus.encode("utf-8")) # e.g., [72, 101, 108, ...] 20 21 merges = {} # (pair) -> new_token_id 22 vocab = {i: bytes([i]) for i in range(256)} # Base: all byte values 23 24 num_merges = vocab_size - 256 25 26 for i in range(num_merges): 27 # Step 2: Count all adjacent pairs 28 pair_counts = {} 29 for j in range(len(token_ids) - 1): 30 pair = (token_ids[j], token_ids[j + 1]) 31 pair_counts[pair] = pair_counts.get(pair, 0) + 1 32 33 if not pair_counts: 34 break 35 36 # Step 3: Find the most frequent pair 37 best_pair = max(pair_counts, key=pair_counts.get) 38 39 # Step 4: Create new token and replace all occurrences 40 new_id = 256 + i 41 token_ids = merge_pair(token_ids, best_pair, new_id) 42 43 merges[best_pair] = new_id 44 vocab[new_id] = vocab[best_pair[0]] + vocab[best_pair[1]] 45 46 return vocab, merges # Ordered merge rules are the "model"

Concrete example: BPE training step by step

Let's trace BPE on a tiny corpus to build intuition:

text
1Corpus: "low low low low lowest lowest newer newer newer wider wider" 2 3Step 0: Initial tokens (characters): 4 ['l','o','w',' ','l','o','w',' ','l','o','w',' ','l','o','w',' ', 5 'l','o','w','e','s','t',' ','l','o','w','e','s','t',' ', 6 'n','e','w','e','r',' ','n','e','w','e','r',' ','n','e','w','e','r',' ', 7 'w','i','d','e','r',' ','w','i','d','e','r'] 8 9Step 1: Most frequent pair: ('l','o') appears 6 times 10 Merge: 'l' + 'o' β†’ 'lo' 11 12Step 2: Most frequent pair: ('lo','w') appears 6 times 13 Merge: 'lo' + 'w' β†’ 'low' 14 15Step 3: Most frequent pair: ('e','r') appears 5 times 16 Merge: 'e' + 'r' β†’ 'er' 17 18Step 4: Most frequent pair: ('e','s') appears 2 times 19 Merge: 'e' + 's' β†’ 'es' 20 21Step 5: Most frequent pair: ('es','t') appears 2 times 22 Merge: 'es' + 't' β†’ 'est' 23 24Final vocabulary includes: {..., 'lo', 'low', 'er', 'es', 'est', ...}

πŸ’‘ Key Insight: The merge rules capture linguistic patterns. "est" emerges as a suffix, "er" as another. BPE discovers morphology without being told about it.

Tokenization at inference time

Once trained, the merge rules are applied in order to new text:

BPE tokenization pipeline: the input sentence is split into subword tokens using merge rules, then each token is mapped to an integer ID via vocabulary lookup. Repeated words like "The" and "the" both map to the same token ID 464. BPE tokenization pipeline: the input sentence is split into subword tokens using merge rules, then each token is mapped to an integer ID via vocabulary lookup. Repeated words like "The" and "the" both map to the same token ID 464.
text
1Input: "lowest" 2 3Step 0: ['l', 'o', 'w', 'e', 's', 't'] ← start with bytes 4Merge 1 ('l','o' β†’ 'lo'): ['lo', 'w', 'e', 's', 't'] 5Merge 2 ('lo','w' β†’ 'low'): ['low', 'e', 's', 't'] 6Merge 3 ('e','s' β†’ 'es'): ['low', 'es', 't'] 7Merge 4 ('es','t' β†’ 'est'): ['low', 'est'] 8 9Final tokens: ['low', 'est'] β†’ IDs: [4521, 478]

Byte-level BPE: the modern standard

GPT-2 introduced a critical innovation: byte-level BPE[3]. Instead of starting from Unicode characters, it starts from raw UTF-8 bytes (a base vocabulary of exactly 256). This guarantees:

  • β€’No unknown tokens: Every possible input can be encoded (since any text is a sequence of bytes)
  • β€’Fixed base vocabulary: Always starts at 256, regardless of language
  • β€’Script agnostic: Handles Chinese, Arabic, emoji, and anything else that's valid UTF-8

⚠️ The tradeoff: Non-ASCII characters (like Chinese characters or emoji) require 2–4 bytes each, so they initially get split into multiple tokens before merges can compress them. This is one root cause of the multilingual "tokenization tax" discussed later.

Who uses BPE?

GPT-2, GPT-3, GPT-4, GPT-4o (via tiktoken), LLaMA/Llama 2/3, Mistral, Code Llama, StarCoder. Virtually every modern autoregressive LLM uses BPE.


WordPiece

WordPiece was developed by Schuster & Nakajima (2012)[4] at Google for Japanese and Korean voice search, then popularized by BERT (Devlin et al., 2019)[5].

How WordPiece differs from BPE

The critical distinction: WordPiece merges are chosen to maximize likelihood of the training corpus, not just raw frequency. The merge score is:

score(a,b)=freq(ab)freq(a)Γ—freq(b)\text{score}(a, b) = \frac{\text{freq}(ab)}{\text{freq}(a) \times \text{freq}(b)}score(a,b)=freq(a)Γ—freq(b)freq(ab)​

Reading the formula: divide the frequency of the merged pair by the product of the individual frequencies. This is essentially pointwise mutual information (PMI), which measures how much more likely two tokens appear together than you'd expect by chance. A pair like ("q", "u") scores high because "qu" is far more common than the individual frequencies of "q" and "u" would predict. Unlike BPE (which just picks the most frequent pair), WordPiece picks the pair with the strongest statistical association.

BPE vs WordPiece: the merge decision

Consider this scenario in a corpus:

text
1Token frequencies: "t" appears 1000x, "h" appears 800x, "th" appears 500x 2 "x" appears 10x, "z" appears 8x, "xz" appears 7x 3 4BPE picks: "th" (frequency 500 > 7) ← raw count wins 5WordPiece: "xz" (score: 7/(10Γ—8) = 0.0875 vs "th": 500/(1000Γ—800) = 0.000625)

WordPiece would merge "xz" first because the tokens almost always co-occur. It's a much stronger signal than the common-but-independent "th".

The ## prefix convention

WordPiece uses ## to mark continuation tokens (tokens that don't start a new word):

text
1Input: "unhappiness" 2 3WordPiece tokenization: 4 ["un", "##happi", "##ness"] 5 6The ## tells the model: "this token continues the previous word" 7Without ##: the model couldn't distinguish "un happy" from "unhappy"

Tokenization strategy: greedy longest-match

Unlike BPE (which replays merge rules in order), WordPiece uses greedy longest-match at inference. This function takes a single word and the learned vocabulary as input. It repeatedly finds the longest valid subword starting from the current position and outputs a list of matching subword tokens:

python
1def wordpiece_tokenize(word: str, vocab: set) -> list: 2 """WordPiece greedy longest-match tokenization.""" 3 tokens = [] 4 start = 0 5 while start < len(word): 6 end = len(word) 7 found = False 8 while start < end: 9 substr = word[start:end] 10 if start > 0: 11 substr = "##" + substr # Continuation prefix 12 if substr in vocab: 13 tokens.append(substr) 14 found = True 15 break 16 end -= 1 17 if not found: 18 tokens.append("[UNK]") # Unknown token fallback 19 start += 1 20 else: 21 start = end 22 return tokens

⚠️ Critical difference: If WordPiece can't tokenize a word at all, it falls back to [UNK]. BPE never produces unknown tokens (since it falls back to individual bytes).

Head-to-head comparison

AspectBPEWordPiece
Merge criterionMost frequent pairHighest likelihood gain (PMI)
Subword prefixNo special marker## for continuations
Inference tokenizationReplay merge rules in orderGreedy longest-match
Unknown tokensNever (byte fallback)[UNK] if no match
Used byGPT, LLaMA, MistralBERT, DistilBERT, ELECTRA
Dominant inGenerative / autoregressiveEncoder / masked LMs

SentencePiece

SentencePiece, developed by Kudo & Richardson (2018)[6] at Google, isn't a new tokenization algorithm. It's a framework that makes tokenization truly language-independent by removing the need for pre-tokenization.

🧩 Analogy, universal translator: Think of traditional tokenizers like English-language spell checkers. They work great for English but choke on Japanese or Thai because they assume words are separated by spaces. SentencePiece is more like a universal translator that doesn't assume anything about how languages work. It treats all text as a raw stream of characters, making it equally capable with any writing system.

The language independence problem

Traditional tokenizers (BPE, WordPiece) assume you can split text into words first:

text
1English: "I love cats" β†’ ["I", "love", "cats"] β†’ subtokenize each 2Japanese: "猫がε₯½γγ§γ™" β†’ ??? No spaces to split on! 3Thai: "ฉันรักแฑว" β†’ ??? Also no spaces! 4German: "Lebensversicherungsgesellschaft" β†’ One giant compound word

SentencePiece solves this by treating the entire input as a raw character stream, whitespace included:

text
1SentencePiece: "I love cats" β†’ "▁I▁love▁cats" β†’ subtokenize directly 2 "猫がε₯½γγ§γ™" β†’ "β–ηŒ«γŒε₯½γγ§γ™" β†’ subtokenize directly

The ▁ (Unicode U+2581, "lower one eighth block") explicitly represents whitespace as a visible token. This means:

  1. β€’No pre-tokenization needed. Works identically for all languages.
  2. β€’Perfectly reversible. You can reconstruct the exact original text, including spacing.
  3. β€’Whitespace is semantic. The model can learn that ▁the (word start) differs from the (mid-word).

Two algorithms under one roof

SentencePiece supports two training algorithms:

BPE mode

Standard BPE, but operating on the raw character stream (including ▁). This is what LLaMA and T5 use.

Unigram language model mode

The Unigram model, introduced by Kudo (2018)[7], takes the opposite approach from BPE:

πŸͺ† Analogy, sculptor's block of marble: BPE is like building a word from individual letter tiles: you start small and keep gluing pieces together. Unigram is like Michelangelo sculpting David. You start with a massive block of marble (a huge vocabulary) and chisel away the pieces that matter least, revealing the optimal vocabulary hidden inside.

Diagram Diagram

The Unigram algorithm:

  1. β€’Initialize with a large seed vocabulary (e.g., all substrings up to length N, or the top-K most frequent substrings)
  2. β€’Assign probabilities to each token using Expectation-Maximization (EM): P(x)=∏i=1MP(xi)whereΒ eachΒ xiΒ isΒ aΒ subwordΒ tokenP(\mathbf{x}) = \prod_{i=1}^{M} P(x_i) \quad \text{where each } x_i \text{ is a subword token}P(x)=∏i=1M​P(xi​)whereΒ eachΒ xi​ isΒ aΒ subwordΒ token
  3. β€’Find the best segmentation of each training sentence using Viterbi decoding (dynamic programming)
  4. β€’Compute the loss (negative log-likelihood) for each token's removal
  5. β€’Prune the tokens whose removal least impacts the total corpus likelihood
  6. β€’Repeat until the target vocabulary size is reached

πŸ’‘ Why Unigram matters: Unlike BPE (which produces a single deterministic segmentation), Unigram can sample from multiple valid segmentations. This is called subword regularization. During training, you expose the model to different ways of splitting the same word, which acts as a powerful data augmentation technique and improves robustness[7].

SentencePiece in practice

Implementing SentencePiece in production is straightforward using the official Python library. The following example demonstrates how to train a new tokenizer directly on raw text files and then use it to encode and decode both English and Japanese inputs:

python
1import sentencepiece as spm 2 3# Training: language-independent, works on raw text files 4spm.SentencePieceTrainer.train( 5 input='corpus.txt', 6 model_prefix='my_tokenizer', 7 vocab_size=32000, 8 model_type='bpe', # or 'unigram' 9 character_coverage=0.9995, # % of characters to cover 10 byte_fallback=True, # Handle unknown chars via UTF-8 bytes 11) 12 13# Usage 14sp = spm.SentencePieceProcessor(model_file='my_tokenizer.model') 15 16print(sp.encode("Hello world!", out_type=str)) 17# ['▁Hello', '▁world', '!'] 18 19print(sp.encode("猫がε₯½γγ§γ™", out_type=str)) 20# ['β–ηŒ«', 'が', 'ε₯½γ', 'です'] 21 22# Perfect reconstruction 23print(sp.decode(['▁Hello', '▁world', '!'])) 24# "Hello world!"

Who uses SentencePiece?

  • β€’BPE mode: LLaMA 1/2/3, T5, mBART, ALBERT
  • β€’Unigram mode: XLNet, ALBERT, mBART (some variants)
  • β€’Key advantage: A single library and model file handles any language without code changes

Vocabulary size: the critical hyperparameter

Vocabulary size is one of the most consequential design decisions in building a tokenizer. It directly controls the tradeoff between sequence compression (fewer tokens per text) and embedding table size (memory cost).

The tradeoff spectrum

Vocab SizeFertility (EN)Fertility (ZH)Embedding MemoryRepresentative Model
8K~1.8 tok/word~5.0 tok/char16 MBSmall research models
32K~1.3 tok/word~2.5 tok/char64 MBLLaMA 1 & 2, T5
50K~1.2 tok/word~2.0 tok/char100 MBGPT-2 (50,257)
100K~1.1 tok/word~1.5 tok/char200 MBGPT-4 (100,256, cl100k_base)
200K~1.05 tok/word~1.2 tok/char400 MBGPT-4o (199,997, o200k_base)

Fertility = average number of tokens per word (or per character for logographic scripts). Lower is better for efficiency.

πŸ“Š Industry trend: Vocabulary sizes are growing. GPT-2 (2019) used 50K tokens. GPT-4 (2023) doubled to 100K. GPT-4o (2024) doubled again to 200K. This reflects the push for better multilingual support and code handling.

Why bigger isn't always better

text
1Larger vocabulary: 2 βœ… Fewer tokens per text (more fits in context window) 3 βœ… Better representation of rare words and non-English scripts 4 βœ… Lower API cost per character 5 6 ❌ Larger embedding table (vocab_size Γ— hidden_dim bytes) 7 ❌ Rare tokens get less training signal β†’ undertrained embeddings 8 ❌ Softmax over larger vocabulary is slower

The empirical sweet spot for modern LLMs has settled around 32K–128K for primarily English models and 100K–200K for multilingual models[8].


The tokenization tax: multilingual fairness

One of the most critical aspects of modern LLM performance is tokenization bias, the systematic disadvantage that non-English languages face due to English-centric tokenizer training.

πŸ’° Analogy, currency exchange rates: Imagine a vending machine that only accepts US quarters. If you're American, a candy bar costs 4 quarters. But if you're Japanese, you first have to exchange your yen into quarters, and the exchange rate is terrible, so the same candy bar costs you 15 quarters. That's the tokenization tax: non-English languages pay more tokens for the same meaning, which means higher API costs, shorter effective context windows, and slower generation.

The problem: unequal fertility

When a tokenizer is trained primarily on English data, non-English text fragments into far more tokens:

LanguageScriptGPT-2 (50K, EN-centric)GPT-4o (200K, multilingual)
EnglishLatin1.0Γ— (baseline)1.0Γ— (baseline)
FrenchLatin1.5Γ—1.1Γ—
GermanLatin1.8Γ—1.2Γ—
ChineseCJK3.5Γ—1.4Γ—
HindiDevanagari4.2Γ—1.8Γ—
KoreanHangul3.8Γ—1.5Γ—
SwahiliLatin2.5Γ—1.6Γ—

Multipliers show token count relative to English for equivalent semantic content[9][10].

Real-world consequences

Petrov et al. (2023)[9] demonstrated that this "tokenization tax" has cascading effects:

Diagram Diagram

πŸ”¬ Research finding: A recent study on the "Token Tax" (2025)[10] found that fertility reliably predicts accuracy on multilingual benchmarks: higher fertility β†’ lower accuracy, consistently across 10 LLMs tested on 16 African languages. Doubling token count quadruples training cost and time.

How industry is addressing it

  1. β€’Larger, balanced vocabularies: GPT-4o's o200k_base encoding significantly reduces CJK (Chinese, Japanese, and Korean) and Indic token inflation compared to cl100k_base
  2. β€’Multilingual training data balancing: Ensuring the tokenizer training corpus has proportional representation across target languages
  3. β€’Character-coverage thresholds: SentencePiece's character_coverage parameter ensures a minimum percentage of characters get their own token
  4. β€’Post-hoc vocabulary extension: Adding high-frequency subwords from underrepresented languages after initial training

Unicode normalization: production edge cases

In production systems, raw user input is messy. Engineers must handle Unicode pitfalls to ensure consistent tokenization:

Normalization forms

Python's built-in unicodedata library is the standard tool for handling Unicode variations. The following snippet demonstrates the difference between composed and decomposed forms, and how NFKC normalization can convert ligatures back into standard characters:

python
1import unicodedata 2 3text = "cafΓ©" # But which "Γ©"? 4 5# Composed form (NFC): Γ© is a single code point (U+00E9) 6nfc = unicodedata.normalize('NFC', text) # "cafΓ©" - 4 chars 7 8# Decomposed form (NFD): Γ© = e + β—ŒΜ (two code points) 9nfd = unicodedata.normalize('NFD', text) # "cafΓ©" - 5 chars 10 11# NFKC: Also decomposes compatibility characters 12ligature = "finance" # fi is a single ligature character 13nfkc = unicodedata.normalize('NFKC', ligature) # "finance"

⚠️ Production critical: If you don't normalize before tokenizing, the same word can produce different token sequences depending on how the Unicode was encoded. Most production tokenizers apply NFKC normalization by default.

Common edge cases in production

Edge CaseExampleProblemSolution
Composed vs decomposedΓ© (1 char) vs e+Β΄ (2 chars)Same visual, different tokensNFC normalization
HomoglyphsΠ° (Cyrillic) vs a (Latin)Looks identical, different encodingsNFKC + homoglyph detection
Zero-width charsU+200B (zero-width space)Invisible, can bypass content filtersStrip during preprocessing
Emoji sequencesπŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ = 7 code pointsOne visual glyph, many bytesHandle via ZWJ-aware segmentation
RTL/Bidi textArabic mixed with EnglishDisplay order β‰  logical orderApply Unicode bidi algorithm
Surrogate pairsCharacters outside BMPCan cause errors in UTF-16 systemsUse UTF-8 internally

Modern tokenizer libraries

When implementing tokenization in production, engineers rarely write these algorithms from scratch. The open-source ecosystem provides highly optimized, production-ready libraries that handle the heavy lifting. The choice of library often depends on the target model and the specific deployment environment.

Quick comparison

For OpenAI models, tiktoken is the industry standard due to its raw speed and precise alignment with their APIs. When working with open-source models like LLaMA or Mistral, the Hugging Face tokenizers library provides maximum flexibility. For custom model training, especially in multilingual contexts, SentencePiece remains the go-to framework. The following examples demonstrate how to initialize and use these three major libraries to tokenize a simple string:

python
1# ─── tiktoken (OpenAI) ────────────────────────── 2# Rust-based, optimized for OpenAI models 3import tiktoken 4 5enc = tiktoken.encoding_for_model("gpt-4o") # o200k_base 6tokens = enc.encode("Hello, world!") # [13225, 11, 2375, 0] 7text = enc.decode(tokens) # "Hello, world!" 8num_tokens = len(tokens) # 4 9 10# ─── HuggingFace tokenizers ──────────────────── 11# Rust-based, supports all architectures 12from transformers import AutoTokenizer 13 14tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B") 15out = tok("Hello, world!", return_tensors="pt") 16# out.input_ids: tensor([[1, 15043, 29892, 3186, 29991]]) 17 18# ─── SentencePiece ───────────────────────────── 19# C++-based, for training custom tokenizers 20import sentencepiece as spm 21 22sp = spm.SentencePieceProcessor(model_file='llama.model') 23pieces = sp.encode("Hello, world!", out_type=str) 24# ['▁Hello', ',', '▁world', '!']
LibraryLanguageSpeedBest For
tiktokenRust+Python⚑ FastestToken counting for OpenAI API; cost estimation
HF tokenizersRust+Python⚑ Very fastTraining & inference with any model
SentencePieceC++πŸš€ FastTraining custom tokenizers; multilingual

The complete algorithm comparison

To synthesize the differences between these three major approaches, it helps to look at them side-by-side. The fundamental divergence lies in their directionality (bottom-up merging vs top-down pruning) and their decision criteria (raw frequency vs statistical likelihood).

Tokenization method comparison on the word "unbelievably": BPE splits by frequency-based merges, WordPiece uses likelihood-based prefixing (##), and SentencePiece uses a language-independent unigram model. Tokenization method comparison on the word "unbelievably": BPE splits by frequency-based merges, WordPiece uses likelihood-based prefixing (##), and SentencePiece uses a language-independent unigram model.
Diagram Diagram
FeatureBPEWordPieceUnigram (SentencePiece)
DirectionBottom-up (merge)Bottom-up (merge)Top-down (prune)
Merge criterionFrequencyLikelihood (PMI)N/A (pruning by likelihood impact)
Pre-tokenizationRequired (unless byte-level)RequiredNot required
Unknown tokensNone (byte fallback)[UNK]None (byte fallback)
DeterministicYesYesYes (Viterbi), but can sample
Subword regularizationVia BPE-dropoutNoNative (multiple segmentations)
Primary modelsGPT, LLaMA, MistralBERT, ELECTRAT5, XLNet, ALBERT
Training speedFastFastSlower (EM iterations)
Compression ratioBest (~25-29% better than Unigram)GoodGood

Specialized case: tokenization for code

Code tokenization requires specific handling distinct from natural language because code has strict syntax rules where whitespace and punctuation matter deeply.

Why standard NLP tokenizers fail on code

Standard BPE or WordPiece trained on English text often fails on code:

  1. β€’Whitespace sensitivity: In Python, indentation is semantic. A tokenizer that treats multiple spaces as one token or ignores them breaks the code's logic.
  2. β€’Long identifiers: getUserByEmailAddress might be split into ['get', 'User', 'By', 'Email', 'Address'] (good) or ['getU', 'serB', 'yEm', 'ailA', 'ddress'] (bad). CamelCase splitting is crucial.
  3. β€’Operators: Symbols like ===, !==, ->, :: should be atomic tokens, not fragmented into ['=', '=', '='].
  4. β€’Numerical literals: Floating point numbers (3.14159) should not be split arbitrarily.

Modern solutions (GPT-4o & StarCoder)

Recent tokenizers like GPT-4o's o200k_base explicitly add code-specific tokens to the vocabulary to reduce fragmentation.[11] The example below compares how standard and optimized tokenizers process the same line of Python code, highlighting how specialized vocabularies preserve semantic boundaries:

python
1# Example: Tokenizing code with GPT-4's cl100k_base (standard) vs o200k_base (optimized) 2 3code = " def calculate_pi():" 4 5# Standard (GPT-4): 5 tokens 6# [' ', 'def', ' calculate', '_', 'pi', '():'] (fragments the function name) 7 8# Optimized (GPT-4o): 4 tokens 9# [' ', 'def', ' calculate_pi', '():'] (preserves the identifier)

πŸ’‘ Key insight: Reducing the number of tokens per line of code directly increases the context window's effective size for repository-level reasoning. This is why models like StarCoder and Code Llama use specialized BPE training on GitHub data.


Key takeaways

  • β€’The Core Problem: Tokenization bridges raw text and model computation. The fundamental challenge is choosing a granularity between characters and words that balances sequence length and vocabulary size.
  • β€’Algorithm Comparison:
    • β€’BPE: Frequency-based merging, bottom-up (GPT series, LLaMA).
    • β€’WordPiece: Likelihood-based merging (PMI), bottom-up (BERT).
    • β€’Unigram: Likelihood-based pruning, top-down (T5, ALBERT).
  • β€’SentencePiece's Innovation: Achieves language independence through raw-text processing and the ▁ whitespace marker, supporting both BPE and Unigram modes without pre-tokenization.
  • β€’Vocabulary Tradeoffs: Larger vocabularies (50K β†’ 200K) reduce sequence length and improve multilingual support but increase embedding memory costs.
  • β€’Production Reality: Real-world tokenization requires handling Unicode normalization (NFC/NFKC) and mitigating the "tokenization tax" that increases costs and latency for non-English languages.

Common misconceptions

  • β€’Confusing BPE with WordPiece: BPE merges based on raw frequency; WordPiece merges based on likelihood improvement (PMI).
  • β€’Misunderstanding ##: This is a specific convention of WordPiece (and BERT), not a universal property of subword tokenization.
  • β€’SentencePiece as an Algorithm: SentencePiece is a library/framework that implements BPE and Unigram; it is not an algorithm itself.
  • β€’Byte-Level BPE & UNK: Byte-level BPE (used in GPT-2+) generally cannot produce unknown tokens because it falls back to raw bytes.
  • β€’Ignoring Multilingual Cost: Vocabulary choice isn't just about English performance; it determines the cost/latency tier for all other supported languages.

Evaluation Rubric
  • 1
    Derives BPE's iterative merge algorithm with a step-by-step walkthrough
  • 2
    Contrasts WordPiece's PMI scoring against BPE's frequency-based approach
  • 3
    Identifies SentencePiece as a framework distinct from the BPE/Unigram algorithms
  • 4
    Differentiates bottom-up (BPE) vs top-down (Unigram) tokenization strategies
  • 5
    Evaluates vocabulary size tradeoffs between sequence length and embedding memory
  • 6
    Quantifies the 'tokenization tax' for non-English languages and its cost implications
  • 7
    Designs a Unicode normalization pipeline (NFC/NFKC) for production inputs
Common Pitfalls
  • Confusing BPE's frequency criterion with WordPiece's likelihood/PMI criterion
  • Calling SentencePiece an algorithm instead of a framework that supports BPE + Unigram
  • Not knowing that byte-level BPE eliminates unknown tokens entirely
  • Ignoring the multilingual tokenization tax and its cascading effects on cost and accuracy
  • Overlooking Unicode normalization (NFC/NFKC) as a critical production requirement
  • Assuming larger vocabulary always improves quality (rare tokens get less training signal)
Follow-up Questions to Expect

Key Concepts Tested
BPE iterative merge algorithm (frequency-based)Byte-level BPE and the elimination of UNK tokensWordPiece likelihood maximization (PMI scoring)SentencePiece language-independent frameworkUnigram LM top-down pruning with subword regularizationVocabulary size tradeoffs and the fertility metricMultilingual tokenization tax and fairnessUnicode normalization (NFC, NFKC) in production
References

A New Algorithm for Data Compression.

Gage, P. Β· 1994

Neural Machine Translation of Rare Words with Subword Units.

Sennrich, R., Haddow, B., & Birch, A. Β· 2016 Β· ACL 2016

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. Β· 2019

Japanese and Korean Voice Search.

Schuster, M. & Nakajima, K. Β· 2012

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. Β· 2019 Β· NAACL 2019

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.

Kudo, T. & Richardson, J. Β· 2018 Β· EMNLP 2018

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.

Kudo, T. Β· 2018 Β· ACL 2018

Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch.

Raschka, S. Β· 2025

Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models.

Petrov, A., et al. Β· 2023 Β· EMNLP 2023

The Token Tax: Systematic Bias in Multilingual Tokenization.

Lundin, J. M., Zhang, A., et al. Β· 2025 Β· arXiv preprint

GPT-4o System Card.

OpenAI. Β· 2024 Β· arXiv preprint

Your account is free and you can post anonymously if you choose.