Master tokenization algorithms (BPE, WordPiece, SentencePiece), understand vocabulary size tradeoffs, and analyze the multilingual tokenization tax.
Imagine trying to teach a computer to read. You can't just feed it raw text, because it doesn't know what letters, words, or sentences are. You need to break the text into small, meaningful pieces that the computer can work with. This process is called tokenization, and it's the very first step in how every language model processes your input.
Think of it like breaking a sentence into Scrabble tiles. You need pieces that are small enough to be flexible, but large enough to carry meaning. The word "unhappiness" might become ["un", "happi", "ness"]. Each piece is meaningful on its own, and together they reconstruct the whole word. Every large language model (LLM) sees the world through these subword tokens, not raw words or characters.
This article is your foundation. Every topic that follows, like attention, embeddings, and inference, builds on understanding how raw text becomes the numerical sequences models actually process.
π‘ Key insight: Tokenization is the bridge between human language and model computation. Understanding how and why text gets split into tokens will help you reason about everything from API costs to why some languages perform worse than others in AI systems.
Before looking at the algorithms, it's important to understand the core problem:
| Approach | Vocabulary Size | Avg Sequence Length | Problem |
|---|---|---|---|
| Character-level | ~256 | Very long (5β7Γ word count) | Sequences are too long for attention; it's hard to learn word semantics |
| Word-level | 500K+ | Short | Huge embedding table; can't handle misspellings, neologisms, or morphology |
| Subword-level | 32Kβ200K | Balanced | β Best of both worlds |
Subword tokenization solves this by learning a vocabulary of frequently co-occurring character sequences. Common words like "the" stay whole. Rare words like "unhappiness" get split into meaningful pieces: ["un", "happi", "ness"]. This is the approach used by every modern LLM.
BPE is the dominant tokenization algorithm in modern LLMs. Originally a data compression algorithm invented by Philip Gage in 1994[1], it was adapted for Natural Language Processing (NLP) by Sennrich, Haddow & Birch (2016)[2] and later refined into byte-level BPE by Radford et al. (2019)[3] for GPT-2.
The algorithm is simple: it iteratively merges the most frequent pair.
Here's the complete training algorithm. It takes a corpus of text and a target vocabulary size as input, and outputs the final vocabulary along with the ordered list of merge rules that will be used to tokenize new text:
python1def merge_pair(ids: list, pair: tuple, new_id: int) -> list: 2 """Replaces all occurrences of a pair with a new_id in a list of ids.""" 3 new_ids = [] 4 i = 0 5 while i < len(ids): 6 if i + 1 < len(ids) and (ids[i], ids[i+1]) == pair: 7 new_ids.append(new_id) 8 i += 2 9 else: 10 new_ids.append(ids[i]) 11 i += 1 12 return new_ids 13 14def train_bpe(corpus: str, vocab_size: int) -> tuple[dict, dict]: 15 """Train a BPE tokenizer from a text corpus.""" 16 17 # Step 1: Encode text as UTF-8 bytes (byte-level BPE) 18 # This gives us a base vocabulary of exactly 256 byte values 19 token_ids = list(corpus.encode("utf-8")) # e.g., [72, 101, 108, ...] 20 21 merges = {} # (pair) -> new_token_id 22 vocab = {i: bytes([i]) for i in range(256)} # Base: all byte values 23 24 num_merges = vocab_size - 256 25 26 for i in range(num_merges): 27 # Step 2: Count all adjacent pairs 28 pair_counts = {} 29 for j in range(len(token_ids) - 1): 30 pair = (token_ids[j], token_ids[j + 1]) 31 pair_counts[pair] = pair_counts.get(pair, 0) + 1 32 33 if not pair_counts: 34 break 35 36 # Step 3: Find the most frequent pair 37 best_pair = max(pair_counts, key=pair_counts.get) 38 39 # Step 4: Create new token and replace all occurrences 40 new_id = 256 + i 41 token_ids = merge_pair(token_ids, best_pair, new_id) 42 43 merges[best_pair] = new_id 44 vocab[new_id] = vocab[best_pair[0]] + vocab[best_pair[1]] 45 46 return vocab, merges # Ordered merge rules are the "model"
Let's trace BPE on a tiny corpus to see how it works. We input a simple repeating phrase, and we observe how the algorithm incrementally merges characters into longer subwords:
text1Corpus: "low low low low lowest lowest newer newer newer wider wider" 2 3Step 0: Initial tokens (characters): 4 ['l','o','w',' ','l','o','w',' ','l','o','w',' ','l','o','w',' ', 5 'l','o','w','e','s','t',' ','l','o','w','e','s','t',' ', 6 'n','e','w','e','r',' ','n','e','w','e','r',' ','n','e','w','e','r',' ', 7 'w','i','d','e','r',' ','w','i','d','e','r'] 8 9Step 1: Most frequent pair: ('l','o') appears 6 times 10 Merge: 'l' + 'o' β 'lo' 11 12Step 2: Most frequent pair: ('lo','w') appears 6 times 13 Merge: 'lo' + 'w' β 'low' 14 15Step 3: Most frequent pair: ('e','r') appears 5 times 16 Merge: 'e' + 'r' β 'er' 17 18Step 4: Most frequent pair: ('e','s') appears 2 times 19 Merge: 'e' + 's' β 'es' 20 21Step 5: Most frequent pair: ('es','t') appears 2 times 22 Merge: 'es' + 't' β 'est' 23 24Final vocabulary includes: {..., 'lo', 'low', 'er', 'es', 'est', ...}
π‘ Key insight: The merge rules capture linguistic patterns.
"est"emerges as a suffix,"er"as another. BPE discovers morphology without being told about it.
Once trained, the merge rules are applied in order to new text. This process takes a raw string, splits it into characters (or bytes in modern implementations), and sequentially applies the learned merges to output the final subword tokens and their IDs:
The following step-by-step example demonstrates this inference process. We input the raw word "lowest" and systematically apply the sequence of learned merge rules to output the final subword tokens (low and est) and their corresponding IDs:
text1Input: "lowest" 2 3Step 0: ['l', 'o', 'w', 'e', 's', 't'] β start with characters 4Merge 1 ('l','o' β 'lo'): ['lo', 'w', 'e', 's', 't'] 5Merge 2 ('lo','w' β 'low'): ['low', 'e', 's', 't'] 6Merge 3 ('e','s' β 'es'): ['low', 'es', 't'] 7Merge 4 ('es','t' β 'est'): ['low', 'est'] 8 9Final tokens: ['low', 'est'] β IDs: [vocab['low'], vocab['est']]
Note: This example uses character-level BPE for clarity. Real implementations like GPT-2 use byte-level BPE, where the base vocabulary starts at 256 byte values instead of individual characters. The merge logic is identical; only the base token unit changes.
GPT-2 introduced a critical innovation: byte-level BPE[3]. Instead of starting from Unicode characters, it starts from raw UTF-8 bytes (a base vocabulary of exactly 256). This guarantees:
β οΈ Warning: Non-ASCII characters (like Chinese characters or emoji) require 2β4 bytes each, so they initially get split into multiple tokens before merges can compress them. This is one root cause of the multilingual "tokenization tax" (discussed later).
Because of these advantages, byte-level BPE has become the industry standard for generative models. GPT-2, OpenAI's tiktoken-based models, Qwen, Mistral, Code Llama, and StarCoder all rely on it. Virtually every modern autoregressive LLM uses BPE.
WordPiece was developed by Schuster & Nakajima (2012)[4] at Google for Japanese and Korean voice search, then popularized by BERT (Devlin et al., 2019)[5].
The key difference is that WordPiece merges are chosen to maximize the likelihood of the training corpus, not just the raw frequency. The merge score is:
This formula divides the frequency of the merged pair by the product of the individual frequencies. This is essentially pointwise mutual information (PMI), which measures how much more likely two tokens appear together than you'd expect by chance. A pair like ("q", "u") scores high because "qu" is far more common than the individual frequencies of "q" and "u" would predict. Unlike BPE (which just picks the most frequent pair), WordPiece picks the pair with the strongest statistical association.
Let's look at a scenario that shows the difference. Given the following token frequencies, BPE chooses the most frequent pair, while WordPiece chooses the one with the highest statistical association:
text1Token frequencies: "t" appears 1000x, "h" appears 800x, "th" appears 500x 2 "x" appears 10x, "z" appears 8x, "xz" appears 7x 3 4BPE picks: "th" (frequency 500 > 7) β raw count wins 5WordPiece: "xz" (score: 7/(10Γ8) = 0.0875 vs "th": 500/(1000Γ800) = 0.000625)
WordPiece would merge "xz" first because the tokens almost always co-occur. It's a much stronger signal than the common-but-independent "th".
## prefix conventionThis convention uses ## to mark subwords that don't start a new word, which preserves word boundaries. For example, tokenizing "unhappiness" produces a list where ## shows which pieces attach to the previous token:
text1Input: "unhappiness" 2 3WordPiece tokenization: 4 ["un", "##happi", "##ness"] 5 6The ## tells the model: "this token continues the previous word" 7Without ##: the model couldn't distinguish "un happy" from "unhappy"
Unlike BPE, which replays merge rules in order, WordPiece uses a greedy longest-match approach at inference. The function below takes a word and the vocabulary, then repeatedly finds the longest valid subword from the start of the remaining string:
python1def wordpiece_tokenize(word: str, vocab: set) -> list: 2 """WordPiece greedy longest-match tokenization.""" 3 tokens = [] 4 start = 0 5 while start < len(word): 6 end = len(word) 7 found = False 8 while start < end: 9 substr = word[start:end] 10 if start > 0: 11 substr = "##" + substr # Continuation prefix 12 if substr in vocab: 13 tokens.append(substr) 14 found = True 15 break 16 end -= 1 17 if not found: 18 tokens.append("<UNK>") # Unknown token fallback 19 start += 1 20 else: 21 start = end 22 return tokens
β οΈ Warning: If WordPiece can't tokenize a word at all, it falls back to
[UNK](unknown token). BPE never produces unknown tokens because it falls back to individual bytes.
| Aspect | BPE | WordPiece |
|---|---|---|
| Merge criterion | Most frequent pair | Highest likelihood gain (PMI) |
| Subword prefix | No special marker | ## for continuations |
| Inference tokenization | Replay merge rules in order | Greedy longest-match |
| Unknown tokens | Never (byte fallback) | [UNK] if no match |
| Used by | GPT, Qwen3.5, Mistral | BERT, DistilBERT, ELECTRA |
| Dominant in | Generative / autoregressive | Encoder / masked LMs |
SentencePiece, developed by Kudo & Richardson (2018)[6] at Google, isn't a new tokenization algorithm. It's a framework that makes tokenization truly language-independent by removing the need for pre-tokenization.
π‘ Analogy: Think of traditional tokenizers like English-language spell checkers. They work great for English but choke on Japanese or Thai because they assume words are separated by spaces. SentencePiece is more like a universal translator that doesn't assume anything about how languages work. It treats all text as a raw stream of characters, making it equally capable with any writing system.
Traditional tokenizers like BPE and WordPiece assume text can be split into words first. This works for English, but fails for languages without spaces or with many compound words:
text1English: "I love cats" β ["I", "love", "cats"] β subtokenize each 2Japanese: "η«γε₯½γγ§γ" β ??? No spaces to split on! 3Thai: "ΰΈΰΈ±ΰΈΰΈ£ΰΈ±ΰΈΰΉΰΈ‘ΰΈ§" β ??? Also no spaces! 4German: "Lebensversicherungsgesellschaft" β One giant compound word
SentencePiece solves this by treating the entire input as a raw character stream, including whitespace. It encodes spaces with a special character (β), creating a continuous sequence that can be tokenized without language-specific rules:
text1SentencePiece: "I love cats" β "βIβloveβcats" β subtokenize directly 2 "η«γε₯½γγ§γ" β "βη«γε₯½γγ§γ" β subtokenize directly
The β (Unicode U+2581, "lower one eighth block") explicitly represents whitespace as a visible token. This means:
βthe (word start) differs from the (mid-word continuation).SentencePiece supports two training algorithms:
Standard BPE, but operating on the raw character stream (including β). This is what Qwen3.5 and T5 use.
The Unigram model, introduced by Kudo (2018)[7], takes the opposite approach from BPE:
π‘ Analogy: BPE is like building a word from individual letter tiles: you start small and keep gluing pieces together. Unigram is like Michelangelo sculpting David. You start with a massive block of marble (a huge vocabulary) and chisel away the pieces that matter least, revealing the optimal vocabulary hidden inside.
The Unigram algorithm:
π‘ Key insight: Unlike BPE (which produces a single deterministic segmentation), Unigram can sample from multiple valid segmentations. This is called subword regularization. During training, you expose the model to different ways of splitting the same word, which acts as a powerful data augmentation technique and improves robustness.[7]
Using SentencePiece in production is straightforward with the official Python library. The example below shows how to train a tokenizer on raw text files and then use it to encode and decode new inputs. The training process creates a .model file that can perfectly reverse the tokenization:
python1import sentencepiece as spm 2 3# Training: language-independent, works on raw text files 4spm.SentencePieceTrainer.train( 5 input='corpus.txt', 6 model_prefix='my_tokenizer', 7 vocab_size=32000, 8 model_type='bpe', # or 'unigram' 9 character_coverage=0.9995, # % of characters to cover 10 byte_fallback=True, # Handle unknown chars via UTF-8 bytes 11) 12 13# Usage 14sp = spm.SentencePieceProcessor(model_file='my_tokenizer.model') 15 16print(sp.encode("Hello world!", out_type=str)) 17# ['βHello', 'βworld', '!'] 18 19print(sp.encode("η«γε₯½γγ§γ", out_type=str)) 20# ['βη«', 'γ', 'ε₯½γ', 'γ§γ'] 21 22# Perfect reconstruction 23print(sp.decode(['βHello', 'βworld', '!'])) 24# "Hello world!"
SentencePiece's primary advantage is that a single library and model file handles any language without code changes. In BPE mode, it powers models like Qwen3.5, T5, mBART, and ALBERT. In Unigram mode, it's used by XLNet, ALBERT, and some variants of mBART. It remains the standard framework for researchers training custom tokenizers from scratch, especially for multilingual or non-English datasets.
Vocabulary size is one of the most consequential design decisions in building a tokenizer. It directly controls the tradeoff between sequence compression (fewer tokens per text) and embedding table size (memory cost).
| Vocab Size | Fertility (EN) | Fertility (ZH) | Embedding Memory | Representative Model |
|---|---|---|---|---|
| 8K | ~1.8 tok/word | ~5.0 tok/char | 16 MB | Small research models |
| 32K | ~1.3 tok/word | ~2.5 tok/char | 64 MB | Qwen3.5, T5 |
| 50K | ~1.2 tok/word | ~2.0 tok/char | 100 MB | GPT-2 (50,257) |
| 100K | ~1.1 tok/word | ~1.5 tok/char | 200 MB | GPT-4 / GPT-4o (cl100k_base) |
| 200K | ~1.05 tok/word | ~1.2 tok/char | 400 MB | OpenAI o200k_base models |
Fertility = average number of tokens per word (or per character for logographic scripts). Lower is better for efficiency.
π‘ Key insight: Vocabulary sizes are growing. GPT-2 used 50K tokens, and modern models have expanded to 100K+ tokens. Larger vocabularies reduce sequence length (more fits in context) and improve multilingual support, at the cost of larger embedding tables and slightly slower softmax during inference.
It might seem like a bigger vocabulary is always better, but it comes with significant trade-offs:
The empirical sweet spot for modern LLMs has settled around 32Kβ128K for primarily English models and 100Kβ200K for multilingual models.[8]
One of the most critical aspects of modern LLM performance is tokenization bias, the systematic disadvantage that non-English languages face due to English-centric tokenizer training.
π‘ Analogy: Imagine a vending machine that only accepts US quarters. If you're American, a candy bar costs 4 quarters. But if you're Japanese, you first have to exchange your yen into quarters, and the exchange rate is terrible, so the same candy bar costs you 15 quarters. That's the tokenization tax: non-English languages pay more tokens for the same meaning, which means higher API costs, shorter effective context windows, and slower generation.
When a tokenizer is trained primarily on English data, non-English text fragments into far more tokens:
| Language | Script | GPT-2 (50K, EN-centric) | o200k_base-style tokenizer (200K, multilingual) |
|---|---|---|---|
| English | Latin | 1.0Γ (baseline) | 1.0Γ (baseline) |
| French | Latin | 1.5Γ | 1.1Γ |
| German | Latin | 1.8Γ | 1.2Γ |
| Chinese | CJK | 3.5Γ | 1.4Γ |
| Hindi | Devanagari | 4.2Γ | 1.8Γ |
| Korean | Hangul | 3.8Γ | 1.5Γ |
| Swahili | Latin | 2.5Γ | 1.6Γ |
Multipliers show token count relative to English for equivalent semantic content.[9][10]
Petrov et al. (2023)[9] demonstrated that this "tokenization tax" has cascading effects:
π¬ Research insight: A recent study on the "Token Tax" (2025)[10] found that fertility reliably predicts accuracy on multilingual benchmarks: higher fertility β lower accuracy, consistently across 10 LLMs tested on 16 African languages. Doubling token count quadruples training cost and time.
o200k_base encoding significantly reduces CJK (Chinese, Japanese, and Korean) and Indic token inflation compared to cl100k_base.character_coverage parameter ensures a minimum percentage of characters get their own token.In production systems, raw user input is messy. Engineers must handle Unicode pitfalls to ensure consistent tokenization:
Python's built-in unicodedata library is the standard tool for handling Unicode variations. The following snippet demonstrates the difference between composed and decomposed forms. It takes a raw, potentially inconsistent string as input and outputs a normalized string (such as via NFKC), ensuring that visually identical characters always produce the same tokens:
python1import unicodedata 2 3text = "cafΓ©" # But which "Γ©"? 4 5# Composed form (NFC): Γ© is a single code point (U+00E9) 6nfc = unicodedata.normalize('NFC', text) # "cafΓ©" - 4 chars 7 8# Decomposed form (NFD): Γ© = e + βΜ (two code points) 9nfd = unicodedata.normalize('NFD', text) # "cafΓ©" - 5 chars 10 11# NFKC: Also decomposes compatibility characters 12ligature = "ο¬nance" # ο¬ is a single ligature character 13nfkc = unicodedata.normalize('NFKC', ligature) # "finance"
π― Production tip: If you don't normalize before tokenizing, the same word can produce different token sequences depending on how the Unicode was encoded. Most production tokenizers apply NFKC normalization by default.
| Edge Case | Example | Problem | Solution |
|---|---|---|---|
| Composed vs decomposed | Γ© (1 char) vs e+Β΄ (2 chars) | Same visual, different tokens | NFC normalization |
| Homoglyphs | Π° (Cyrillic) vs a (Latin) | Looks identical, different encodings | NFKC + homoglyph detection |
| Zero-width chars | U+200B (zero-width space) | Invisible, can bypass content filters | Strip during preprocessing |
| Emoji sequences | π¨βπ©βπ§βπ¦ = 7 code points | One visual glyph, many bytes | Handle via Zero Width Joiner (ZWJ)-aware segmentation |
| RTL/Bidi text | Arabic mixed with English | Display order β logical order | Apply Unicode bidi algorithm |
| Surrogate pairs | Characters outside the Basic Multilingual Plane (BMP) | Can cause errors in UTF-16 systems | Use UTF-8 internally |
When implementing tokenization in production, engineers rarely write these algorithms from scratch. The open-source ecosystem provides highly optimized, production-ready libraries that handle the heavy lifting. The choice of library often depends on the target model and the specific deployment environment.
For OpenAI models, tiktoken is the industry standard due to its raw speed and precise alignment with their APIs. When working with open-source models like Qwen3.5 or Mistral, the Hugging Face tokenizers library provides maximum flexibility. For custom model training, especially in multilingual contexts, SentencePiece remains the go-to framework. The following examples demonstrate how to initialize and use these three major libraries. In each case, the library takes raw text as input and outputs a sequence of integer token IDs that a model can process, as well as providing methods to decode those IDs back into readable text:
python1# βββ tiktoken (OpenAI) ββββββββββββββββββββββββββ 2# Rust-based, optimized for OpenAI models 3import tiktoken 4 5enc = tiktoken.get_encoding("o200k_base") # 200K vocab, used by newer OpenAI tokenizers 6tokens = enc.encode("Hello, world!") # [13225, 11, 2375, 0] 7text = enc.decode(tokens) # "Hello, world!" 8num_tokens = len(tokens) # 4 9 10# βββ HuggingFace tokenizers ββββββββββββββββββββ 11# Rust-based, supports all architectures 12from transformers import AutoTokenizer 13 14tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B") 15out = tok("Hello, world!", return_tensors="pt") 16# out.input_ids: tensor([[8774, 11, 1917, 0]]) 17 18# βββ SentencePiece βββββββββββββββββββββββββββββ 19# C++-based, for training custom tokenizers 20import sentencepiece as spm 21 22sp = spm.SentencePieceProcessor(model_file='my_tokenizer.model') 23pieces = sp.encode("Hello, world!", out_type=str) 24# ['βHello', ',', 'βworld', '!']
| Library | Language | Speed | Best For |
|---|---|---|---|
| tiktoken | Rust+Python | β‘ Fastest | Token counting for OpenAI API; cost estimation |
| HF tokenizers | Rust+Python | β‘ Very fast | Training & inference with any model |
| SentencePiece | C++ | π Fast | Training custom tokenizers; multilingual |
To summarize the differences, it helps to look at these three approaches side-by-side. The key distinctions are their directionality (bottom-up merging vs. top-down pruning) and their decision criteria (raw frequency vs. statistical likelihood).
| Feature | BPE | WordPiece | Unigram (SentencePiece) |
|---|---|---|---|
| Direction | Bottom-up (merge) | Bottom-up (merge) | Top-down (prune) |
| Merge criterion | Frequency | Likelihood (PMI) | N/A (pruning by likelihood impact) |
| Pre-tokenization | Required (unless byte-level) | Required | Not required |
| Unknown tokens | None (byte fallback) | [UNK] | None (byte fallback) |
| Deterministic | Yes | Yes | Yes (Viterbi), but can sample |
| Subword regularization | Via BPE-dropout | No | Native (multiple segmentations) |
| Primary models | GPT, Qwen3.5, Mistral | BERT, ELECTRA | T5, XLNet, ALBERT |
| Training speed | Fast | Fast | Slower (EM iterations) |
| Compression ratio | Good | Good | Good |
Code tokenization requires specific handling distinct from natural language because code has strict syntax rules where whitespace and punctuation matter deeply.
Standard BPE or WordPiece trained on English text often fails on code:
getUserByEmailAddress might be split into ['get', 'User', 'By', 'Email', 'Address'] (good) or ['getU', 'serB', 'yEm', 'ailA', 'ddress'] (bad). CamelCase splitting is crucial.===, !==, ->, :: should be atomic tokens, not fragmented into ['=', '=', '='].3.14159) shouldn't be split arbitrarily.o200k_base & StarCoder)Recent tokenizers like OpenAI's o200k_base explicitly add code-specific tokens to the vocabulary to reduce fragmentation. The example below compares how a smaller tokenizer and the larger, code-optimized one handle the same Python snippet. The larger tokenizer preserves the function name as a single unit, avoiding the fragmentation that would otherwise occur:
python1# Example: Tokenizing code with a smaller tokenizer vs OpenAI's o200k_base 2 3code = " def calculate_pi():" 4 5# Smaller tokenizer (~50K vocab): fragments the function name 6# [' ', 'def', ' calculate', '_', 'pi', '():'] (6 tokens) 7 8# o200k_base (~200K vocab): preserves common identifiers 9# [' ', 'def', ' calculate_pi', '():'] (4 tokens)
π‘ Key insight: Reducing the number of tokens per line of code directly increases the context window's effective size for repository-level reasoning. This is why models like StarCoder and Code Llama use specialized BPE training on GitHub data.
β whitespace marker, supporting both BPE and Unigram modes without pre-tokenization.##: This is a specific convention of WordPiece (and BERT), not a universal property of subword tokenization.A New Algorithm for Data Compression.
Gage, P. Β· 1994
Neural Machine Translation of Rare Words with Subword Units.
Sennrich, R., Haddow, B., & Birch, A. Β· 2016 Β· ACL 2016
Language Models are Unsupervised Multitask Learners.
Radford, A., et al. Β· 2019
Japanese and Korean Voice Search.
Schuster, M. & Nakajima, K. Β· 2012
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. Β· 2019 Β· NAACL 2019
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.
Kudo, T. & Richardson, J. Β· 2018 Β· EMNLP 2018
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.
Kudo, T. Β· 2018 Β· ACL 2018
Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch.
Raschka, S. Β· 2025
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models.
Petrov, A., et al. Β· 2023
The Token Tax: Systematic Bias in Multilingual Tokenization.
Lundin, J. M., Zhang, A., et al. Β· 2025 Β· arXiv preprint