Compare tokenization algorithms, understand vocabulary size tradeoffs, analyze the multilingual tokenization tax, and handle Unicode edge cases in production.
Imagine trying to teach a computer to read. You can't just feed it raw text, because it doesn't know what letters, words, or sentences are. You need to break the text into small, meaningful pieces that the computer can work with. This process is called tokenization, and it's the very first step in how every language model, from ChatGPT to Gemini, processes your input.
Think of it like breaking a sentence into Scrabble tiles: you need pieces that are small enough to be flexible, but large enough to carry meaning. The word "unhappiness" might become ["un", "happi", "ness"]. Each piece is meaningful on its own, and together they reconstruct the whole word. Every large language model (LLM) sees the world through these subword tokens, not raw words or characters.
This article is your foundation. Every topic that follows (attention, embeddings, inference) builds on understanding how raw text becomes the numerical sequences that models actually process.
π― Core concept: Tokenization is the bridge between human language and model computation. Understanding how and why text gets split into tokens will help you reason about everything from API costs to why some languages perform worse than others in AI systems.
Before diving into algorithms, understand the fundamental tension:
| Approach | Vocabulary Size | Avg Sequence Length | Problem |
|---|---|---|---|
| Character-level | ~256 | Very long (5β7Γ word count) | Sequences too long for attention; hard to learn word semantics |
| Word-level | 500K+ | Short | Huge embedding table; can't handle misspellings, neologisms, or morphology |
| Subword-level | 32Kβ200K | Balanced | β Best of both worlds |
Subword tokenization solves this by learning a vocabulary of frequently co-occurring character sequences. Common words like "the" stay whole. Rare words like "unhappiness" get split into meaningful pieces: ["un", "happi", "ness"]. This is the approach used by every modern LLM.
BPE is the dominant tokenization algorithm in modern LLMs. Originally a data compression algorithm invented by Philip Gage in 1994[1], it was adapted for NLP by Sennrich, Haddow & Birch (2016)[2] and later refined into byte-level BPE by Radford et al. (2019)[3] for GPT-2.
The algorithm is elegantly simple: iteratively merge the most frequent pair.
Here is the complete training algorithm. It takes a corpus of text and a target vocabulary size as input, and outputs the final vocabulary along with the ordered list of merge rules that will be used to tokenize new text:
python1def merge_pair(ids: list, pair: tuple, new_id: int) -> list: 2 """Replaces all occurrences of a pair with a new_id in a list of ids.""" 3 new_ids = [] 4 i = 0 5 while i < len(ids): 6 if i + 1 < len(ids) and (ids[i], ids[i+1]) == pair: 7 new_ids.append(new_id) 8 i += 2 9 else: 10 new_ids.append(ids[i]) 11 i += 1 12 return new_ids 13 14def train_bpe(corpus: str, vocab_size: int): 15 """Train a BPE tokenizer from a text corpus.""" 16 17 # Step 1: Encode text as UTF-8 bytes (byte-level BPE) 18 # This gives us a base vocabulary of exactly 256 byte values 19 token_ids = list(corpus.encode("utf-8")) # e.g., [72, 101, 108, ...] 20 21 merges = {} # (pair) -> new_token_id 22 vocab = {i: bytes([i]) for i in range(256)} # Base: all byte values 23 24 num_merges = vocab_size - 256 25 26 for i in range(num_merges): 27 # Step 2: Count all adjacent pairs 28 pair_counts = {} 29 for j in range(len(token_ids) - 1): 30 pair = (token_ids[j], token_ids[j + 1]) 31 pair_counts[pair] = pair_counts.get(pair, 0) + 1 32 33 if not pair_counts: 34 break 35 36 # Step 3: Find the most frequent pair 37 best_pair = max(pair_counts, key=pair_counts.get) 38 39 # Step 4: Create new token and replace all occurrences 40 new_id = 256 + i 41 token_ids = merge_pair(token_ids, best_pair, new_id) 42 43 merges[best_pair] = new_id 44 vocab[new_id] = vocab[best_pair[0]] + vocab[best_pair[1]] 45 46 return vocab, merges # Ordered merge rules are the "model"
Let's trace BPE on a tiny corpus to build intuition:
text1Corpus: "low low low low lowest lowest newer newer newer wider wider" 2 3Step 0: Initial tokens (characters): 4 ['l','o','w',' ','l','o','w',' ','l','o','w',' ','l','o','w',' ', 5 'l','o','w','e','s','t',' ','l','o','w','e','s','t',' ', 6 'n','e','w','e','r',' ','n','e','w','e','r',' ','n','e','w','e','r',' ', 7 'w','i','d','e','r',' ','w','i','d','e','r'] 8 9Step 1: Most frequent pair: ('l','o') appears 6 times 10 Merge: 'l' + 'o' β 'lo' 11 12Step 2: Most frequent pair: ('lo','w') appears 6 times 13 Merge: 'lo' + 'w' β 'low' 14 15Step 3: Most frequent pair: ('e','r') appears 5 times 16 Merge: 'e' + 'r' β 'er' 17 18Step 4: Most frequent pair: ('e','s') appears 2 times 19 Merge: 'e' + 's' β 'es' 20 21Step 5: Most frequent pair: ('es','t') appears 2 times 22 Merge: 'es' + 't' β 'est' 23 24Final vocabulary includes: {..., 'lo', 'low', 'er', 'es', 'est', ...}
π‘ Key Insight: The merge rules capture linguistic patterns.
"est"emerges as a suffix,"er"as another. BPE discovers morphology without being told about it.
Once trained, the merge rules are applied in order to new text:
text1Input: "lowest" 2 3Step 0: ['l', 'o', 'w', 'e', 's', 't'] β start with bytes 4Merge 1 ('l','o' β 'lo'): ['lo', 'w', 'e', 's', 't'] 5Merge 2 ('lo','w' β 'low'): ['low', 'e', 's', 't'] 6Merge 3 ('e','s' β 'es'): ['low', 'es', 't'] 7Merge 4 ('es','t' β 'est'): ['low', 'est'] 8 9Final tokens: ['low', 'est'] β IDs: [4521, 478]
GPT-2 introduced a critical innovation: byte-level BPE[3]. Instead of starting from Unicode characters, it starts from raw UTF-8 bytes (a base vocabulary of exactly 256). This guarantees:
β οΈ The tradeoff: Non-ASCII characters (like Chinese characters or emoji) require 2β4 bytes each, so they initially get split into multiple tokens before merges can compress them. This is one root cause of the multilingual "tokenization tax" discussed later.
GPT-2, GPT-3, GPT-4, GPT-4o (via tiktoken), LLaMA/Llama 2/3, Mistral, Code Llama, StarCoder. Virtually every modern autoregressive LLM uses BPE.
WordPiece was developed by Schuster & Nakajima (2012)[4] at Google for Japanese and Korean voice search, then popularized by BERT (Devlin et al., 2019)[5].
The critical distinction: WordPiece merges are chosen to maximize likelihood of the training corpus, not just raw frequency. The merge score is:
Reading the formula: divide the frequency of the merged pair by the product of the individual frequencies. This is essentially pointwise mutual information (PMI), which measures how much more likely two tokens appear together than you'd expect by chance. A pair like ("q", "u") scores high because "qu" is far more common than the individual frequencies of "q" and "u" would predict. Unlike BPE (which just picks the most frequent pair), WordPiece picks the pair with the strongest statistical association.
Consider this scenario in a corpus:
text1Token frequencies: "t" appears 1000x, "h" appears 800x, "th" appears 500x 2 "x" appears 10x, "z" appears 8x, "xz" appears 7x 3 4BPE picks: "th" (frequency 500 > 7) β raw count wins 5WordPiece: "xz" (score: 7/(10Γ8) = 0.0875 vs "th": 500/(1000Γ800) = 0.000625)
WordPiece would merge "xz" first because the tokens almost always co-occur. It's a much stronger signal than the common-but-independent "th".
## prefix conventionWordPiece uses ## to mark continuation tokens (tokens that don't start a new word):
text1Input: "unhappiness" 2 3WordPiece tokenization: 4 ["un", "##happi", "##ness"] 5 6The ## tells the model: "this token continues the previous word" 7Without ##: the model couldn't distinguish "un happy" from "unhappy"
Unlike BPE (which replays merge rules in order), WordPiece uses greedy longest-match at inference. This function takes a single word and the learned vocabulary as input. It repeatedly finds the longest valid subword starting from the current position and outputs a list of matching subword tokens:
python1def wordpiece_tokenize(word: str, vocab: set) -> list: 2 """WordPiece greedy longest-match tokenization.""" 3 tokens = [] 4 start = 0 5 while start < len(word): 6 end = len(word) 7 found = False 8 while start < end: 9 substr = word[start:end] 10 if start > 0: 11 substr = "##" + substr # Continuation prefix 12 if substr in vocab: 13 tokens.append(substr) 14 found = True 15 break 16 end -= 1 17 if not found: 18 tokens.append("[UNK]") # Unknown token fallback 19 start += 1 20 else: 21 start = end 22 return tokens
β οΈ Critical difference: If WordPiece can't tokenize a word at all, it falls back to
[UNK]. BPE never produces unknown tokens (since it falls back to individual bytes).
| Aspect | BPE | WordPiece |
|---|---|---|
| Merge criterion | Most frequent pair | Highest likelihood gain (PMI) |
| Subword prefix | No special marker | ## for continuations |
| Inference tokenization | Replay merge rules in order | Greedy longest-match |
| Unknown tokens | Never (byte fallback) | [UNK] if no match |
| Used by | GPT, LLaMA, Mistral | BERT, DistilBERT, ELECTRA |
| Dominant in | Generative / autoregressive | Encoder / masked LMs |
SentencePiece, developed by Kudo & Richardson (2018)[6] at Google, isn't a new tokenization algorithm. It's a framework that makes tokenization truly language-independent by removing the need for pre-tokenization.
π§© Analogy, universal translator: Think of traditional tokenizers like English-language spell checkers. They work great for English but choke on Japanese or Thai because they assume words are separated by spaces. SentencePiece is more like a universal translator that doesn't assume anything about how languages work. It treats all text as a raw stream of characters, making it equally capable with any writing system.
Traditional tokenizers (BPE, WordPiece) assume you can split text into words first:
text1English: "I love cats" β ["I", "love", "cats"] β subtokenize each 2Japanese: "η«γε₯½γγ§γ" β ??? No spaces to split on! 3Thai: "ΰΈΰΈ±ΰΈΰΈ£ΰΈ±ΰΈΰΉΰΈ‘ΰΈ§" β ??? Also no spaces! 4German: "Lebensversicherungsgesellschaft" β One giant compound word
SentencePiece solves this by treating the entire input as a raw character stream, whitespace included:
text1SentencePiece: "I love cats" β "βIβloveβcats" β subtokenize directly 2 "η«γε₯½γγ§γ" β "βη«γε₯½γγ§γ" β subtokenize directly
The β (Unicode U+2581, "lower one eighth block") explicitly represents whitespace as a visible token. This means:
βthe (word start) differs from the (mid-word).SentencePiece supports two training algorithms:
Standard BPE, but operating on the raw character stream (including β). This is what LLaMA and T5 use.
The Unigram model, introduced by Kudo (2018)[7], takes the opposite approach from BPE:
πͺ Analogy, sculptor's block of marble: BPE is like building a word from individual letter tiles: you start small and keep gluing pieces together. Unigram is like Michelangelo sculpting David. You start with a massive block of marble (a huge vocabulary) and chisel away the pieces that matter least, revealing the optimal vocabulary hidden inside.
The Unigram algorithm:
π‘ Why Unigram matters: Unlike BPE (which produces a single deterministic segmentation), Unigram can sample from multiple valid segmentations. This is called subword regularization. During training, you expose the model to different ways of splitting the same word, which acts as a powerful data augmentation technique and improves robustness[7].
Implementing SentencePiece in production is straightforward using the official Python library. The following example demonstrates how to train a new tokenizer directly on raw text files and then use it to encode and decode both English and Japanese inputs:
python1import sentencepiece as spm 2 3# Training: language-independent, works on raw text files 4spm.SentencePieceTrainer.train( 5 input='corpus.txt', 6 model_prefix='my_tokenizer', 7 vocab_size=32000, 8 model_type='bpe', # or 'unigram' 9 character_coverage=0.9995, # % of characters to cover 10 byte_fallback=True, # Handle unknown chars via UTF-8 bytes 11) 12 13# Usage 14sp = spm.SentencePieceProcessor(model_file='my_tokenizer.model') 15 16print(sp.encode("Hello world!", out_type=str)) 17# ['βHello', 'βworld', '!'] 18 19print(sp.encode("η«γε₯½γγ§γ", out_type=str)) 20# ['βη«', 'γ', 'ε₯½γ', 'γ§γ'] 21 22# Perfect reconstruction 23print(sp.decode(['βHello', 'βworld', '!'])) 24# "Hello world!"
Vocabulary size is one of the most consequential design decisions in building a tokenizer. It directly controls the tradeoff between sequence compression (fewer tokens per text) and embedding table size (memory cost).
| Vocab Size | Fertility (EN) | Fertility (ZH) | Embedding Memory | Representative Model |
|---|---|---|---|---|
| 8K | ~1.8 tok/word | ~5.0 tok/char | 16 MB | Small research models |
| 32K | ~1.3 tok/word | ~2.5 tok/char | 64 MB | LLaMA 1 & 2, T5 |
| 50K | ~1.2 tok/word | ~2.0 tok/char | 100 MB | GPT-2 (50,257) |
| 100K | ~1.1 tok/word | ~1.5 tok/char | 200 MB | GPT-4 (100,256, cl100k_base) |
| 200K | ~1.05 tok/word | ~1.2 tok/char | 400 MB | GPT-4o (199,997, o200k_base) |
Fertility = average number of tokens per word (or per character for logographic scripts). Lower is better for efficiency.
π Industry trend: Vocabulary sizes are growing. GPT-2 (2019) used 50K tokens. GPT-4 (2023) doubled to 100K. GPT-4o (2024) doubled again to 200K. This reflects the push for better multilingual support and code handling.
text1Larger vocabulary: 2 β Fewer tokens per text (more fits in context window) 3 β Better representation of rare words and non-English scripts 4 β Lower API cost per character 5 6 β Larger embedding table (vocab_size Γ hidden_dim bytes) 7 β Rare tokens get less training signal β undertrained embeddings 8 β Softmax over larger vocabulary is slower
The empirical sweet spot for modern LLMs has settled around 32Kβ128K for primarily English models and 100Kβ200K for multilingual models[8].
One of the most critical aspects of modern LLM performance is tokenization bias, the systematic disadvantage that non-English languages face due to English-centric tokenizer training.
π° Analogy, currency exchange rates: Imagine a vending machine that only accepts US quarters. If you're American, a candy bar costs 4 quarters. But if you're Japanese, you first have to exchange your yen into quarters, and the exchange rate is terrible, so the same candy bar costs you 15 quarters. That's the tokenization tax: non-English languages pay more tokens for the same meaning, which means higher API costs, shorter effective context windows, and slower generation.
When a tokenizer is trained primarily on English data, non-English text fragments into far more tokens:
| Language | Script | GPT-2 (50K, EN-centric) | GPT-4o (200K, multilingual) |
|---|---|---|---|
| English | Latin | 1.0Γ (baseline) | 1.0Γ (baseline) |
| French | Latin | 1.5Γ | 1.1Γ |
| German | Latin | 1.8Γ | 1.2Γ |
| Chinese | CJK | 3.5Γ | 1.4Γ |
| Hindi | Devanagari | 4.2Γ | 1.8Γ |
| Korean | Hangul | 3.8Γ | 1.5Γ |
| Swahili | Latin | 2.5Γ | 1.6Γ |
Multipliers show token count relative to English for equivalent semantic content[9][10].
Petrov et al. (2023)[9] demonstrated that this "tokenization tax" has cascading effects:
π¬ Research finding: A recent study on the "Token Tax" (2025)[10] found that fertility reliably predicts accuracy on multilingual benchmarks: higher fertility β lower accuracy, consistently across 10 LLMs tested on 16 African languages. Doubling token count quadruples training cost and time.
o200k_base encoding significantly reduces CJK (Chinese, Japanese, and Korean) and Indic token inflation compared to cl100k_basecharacter_coverage parameter ensures a minimum percentage of characters get their own tokenIn production systems, raw user input is messy. Engineers must handle Unicode pitfalls to ensure consistent tokenization:
Python's built-in unicodedata library is the standard tool for handling Unicode variations. The following snippet demonstrates the difference between composed and decomposed forms, and how NFKC normalization can convert ligatures back into standard characters:
python1import unicodedata 2 3text = "cafΓ©" # But which "Γ©"? 4 5# Composed form (NFC): Γ© is a single code point (U+00E9) 6nfc = unicodedata.normalize('NFC', text) # "cafΓ©" - 4 chars 7 8# Decomposed form (NFD): Γ© = e + βΜ (two code points) 9nfd = unicodedata.normalize('NFD', text) # "cafΓ©" - 5 chars 10 11# NFKC: Also decomposes compatibility characters 12ligature = "ο¬nance" # ο¬ is a single ligature character 13nfkc = unicodedata.normalize('NFKC', ligature) # "finance"
β οΈ Production critical: If you don't normalize before tokenizing, the same word can produce different token sequences depending on how the Unicode was encoded. Most production tokenizers apply NFKC normalization by default.
| Edge Case | Example | Problem | Solution |
|---|---|---|---|
| Composed vs decomposed | Γ© (1 char) vs e+Β΄ (2 chars) | Same visual, different tokens | NFC normalization |
| Homoglyphs | Π° (Cyrillic) vs a (Latin) | Looks identical, different encodings | NFKC + homoglyph detection |
| Zero-width chars | U+200B (zero-width space) | Invisible, can bypass content filters | Strip during preprocessing |
| Emoji sequences | π¨βπ©βπ§βπ¦ = 7 code points | One visual glyph, many bytes | Handle via ZWJ-aware segmentation |
| RTL/Bidi text | Arabic mixed with English | Display order β logical order | Apply Unicode bidi algorithm |
| Surrogate pairs | Characters outside BMP | Can cause errors in UTF-16 systems | Use UTF-8 internally |
When implementing tokenization in production, engineers rarely write these algorithms from scratch. The open-source ecosystem provides highly optimized, production-ready libraries that handle the heavy lifting. The choice of library often depends on the target model and the specific deployment environment.
For OpenAI models, tiktoken is the industry standard due to its raw speed and precise alignment with their APIs. When working with open-source models like LLaMA or Mistral, the Hugging Face tokenizers library provides maximum flexibility. For custom model training, especially in multilingual contexts, SentencePiece remains the go-to framework. The following examples demonstrate how to initialize and use these three major libraries to tokenize a simple string:
python1# βββ tiktoken (OpenAI) ββββββββββββββββββββββββββ 2# Rust-based, optimized for OpenAI models 3import tiktoken 4 5enc = tiktoken.encoding_for_model("gpt-4o") # o200k_base 6tokens = enc.encode("Hello, world!") # [13225, 11, 2375, 0] 7text = enc.decode(tokens) # "Hello, world!" 8num_tokens = len(tokens) # 4 9 10# βββ HuggingFace tokenizers ββββββββββββββββββββ 11# Rust-based, supports all architectures 12from transformers import AutoTokenizer 13 14tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B") 15out = tok("Hello, world!", return_tensors="pt") 16# out.input_ids: tensor([[1, 15043, 29892, 3186, 29991]]) 17 18# βββ SentencePiece βββββββββββββββββββββββββββββ 19# C++-based, for training custom tokenizers 20import sentencepiece as spm 21 22sp = spm.SentencePieceProcessor(model_file='llama.model') 23pieces = sp.encode("Hello, world!", out_type=str) 24# ['βHello', ',', 'βworld', '!']
| Library | Language | Speed | Best For |
|---|---|---|---|
| tiktoken | Rust+Python | β‘ Fastest | Token counting for OpenAI API; cost estimation |
| HF tokenizers | Rust+Python | β‘ Very fast | Training & inference with any model |
| SentencePiece | C++ | π Fast | Training custom tokenizers; multilingual |
To synthesize the differences between these three major approaches, it helps to look at them side-by-side. The fundamental divergence lies in their directionality (bottom-up merging vs top-down pruning) and their decision criteria (raw frequency vs statistical likelihood).
| Feature | BPE | WordPiece | Unigram (SentencePiece) |
|---|---|---|---|
| Direction | Bottom-up (merge) | Bottom-up (merge) | Top-down (prune) |
| Merge criterion | Frequency | Likelihood (PMI) | N/A (pruning by likelihood impact) |
| Pre-tokenization | Required (unless byte-level) | Required | Not required |
| Unknown tokens | None (byte fallback) | [UNK] | None (byte fallback) |
| Deterministic | Yes | Yes | Yes (Viterbi), but can sample |
| Subword regularization | Via BPE-dropout | No | Native (multiple segmentations) |
| Primary models | GPT, LLaMA, Mistral | BERT, ELECTRA | T5, XLNet, ALBERT |
| Training speed | Fast | Fast | Slower (EM iterations) |
| Compression ratio | Best (~25-29% better than Unigram) | Good | Good |
Code tokenization requires specific handling distinct from natural language because code has strict syntax rules where whitespace and punctuation matter deeply.
Standard BPE or WordPiece trained on English text often fails on code:
getUserByEmailAddress might be split into ['get', 'User', 'By', 'Email', 'Address'] (good) or ['getU', 'serB', 'yEm', 'ailA', 'ddress'] (bad). CamelCase splitting is crucial.===, !==, ->, :: should be atomic tokens, not fragmented into ['=', '=', '='].3.14159) should not be split arbitrarily.Recent tokenizers like GPT-4o's o200k_base explicitly add code-specific tokens to the vocabulary to reduce fragmentation.[11] The example below compares how standard and optimized tokenizers process the same line of Python code, highlighting how specialized vocabularies preserve semantic boundaries:
python1# Example: Tokenizing code with GPT-4's cl100k_base (standard) vs o200k_base (optimized) 2 3code = " def calculate_pi():" 4 5# Standard (GPT-4): 5 tokens 6# [' ', 'def', ' calculate', '_', 'pi', '():'] (fragments the function name) 7 8# Optimized (GPT-4o): 4 tokens 9# [' ', 'def', ' calculate_pi', '():'] (preserves the identifier)
π‘ Key insight: Reducing the number of tokens per line of code directly increases the context window's effective size for repository-level reasoning. This is why models like StarCoder and Code Llama use specialized BPE training on GitHub data.
β whitespace marker, supporting both BPE and Unigram modes without pre-tokenization.##: This is a specific convention of WordPiece (and BERT), not a universal property of subword tokenization.A New Algorithm for Data Compression.
Gage, P. Β· 1994
Neural Machine Translation of Rare Words with Subword Units.
Sennrich, R., Haddow, B., & Birch, A. Β· 2016 Β· ACL 2016
Language Models are Unsupervised Multitask Learners.
Radford, A., et al. Β· 2019
Japanese and Korean Voice Search.
Schuster, M. & Nakajima, K. Β· 2012
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. Β· 2019 Β· NAACL 2019
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.
Kudo, T. & Richardson, J. Β· 2018 Β· EMNLP 2018
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.
Kudo, T. Β· 2018 Β· ACL 2018
Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch.
Raschka, S. Β· 2025
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models.
Petrov, A., et al. Β· 2023 Β· EMNLP 2023
The Token Tax: Systematic Bias in Multilingual Tokenization.
Lundin, J. M., Zhang, A., et al. Β· 2025 Β· arXiv preprint
GPT-4o System Card.
OpenAI. Β· 2024 Β· arXiv preprint