Tokenization & Vocabulary Design
Canonical Papers
Neural Machine Translation of Rare Words with Subword Units
Read paper →SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Read paper →ByT5: Towards a token-free future with pre-trained byte-to-byte models
Read paper →Core Mathematics
Tokenization is the discrete interface between raw text/bytes and a transformer. The model never sees “characters” or “words” — it sees token IDs from a vocabulary .
---
## 1) Tokenization is a segmentation into vocabulary items
A tokenizer maps a string/byte sequence into tokens :
Then each token becomes an integer ID that indexes the embedding table.
---
## 2) Unigram tokenization as maximum-likelihood segmentation (SentencePiece-style)
Unigram models treat tokenization itself as probabilistic inference: choose the segmentation that maximizes token prior likelihood:
This is a Viterbi / dynamic programming problem in practice: you search over segmentations and pick the highest-probability path.
---
## 3) BPE (byte-pair encoding) builds the vocabulary by merges
BPE starts from base symbols (characters or bytes), then repeatedly merges the most frequent adjacent pair:
Each merge increases vocabulary size (one new token) and usually decreases sequence length on texts where that pair is common.
---
## 4) Vocabulary size is a real model parameter (embedding + softmax head)
Token IDs index the embedding matrix, and logits are produced over the same vocabulary:
So parameters scale roughly like:
Tradeoff: larger vocab can mean fewer tokens (cheaper context / KV cache), but larger embedding + output layers (more parameters / memory).
---
### Why this is a “foundation” Tokenization silently shapes: - what patterns become “single atoms” (code, whitespace, common substrings), - how expensive prompts are (tokens per character), - multilingual/Unicode behavior (bytes vs scripts vs normalization), - and even security surfaces (invisible characters, homoglyphs, normalization).
Interactive Visualization
Why It Matters for Modern Models
- Tokens are the *unit of compute and cost*: prompt price, latency, and context usage are measured in tokens, not characters.
- Tokenizer design reshapes capability: code, math, and multilingual text can become easy or painfully fragmented depending on subword boundaries.
- Long-context engineering (#30) depends on token counts: “128k context” is 128k tokens, and tokenization determines how much text fits.
- Vocabulary size is an architectural knob: embedding + softmax head scale with |V|, trading parameters for shorter sequences.
- Unicode edge cases (normalization, invisible characters) affect reliability and safety: two visually identical strings can tokenize differently.
Missing Intuition
What is still poorly explained in textbooks and papers:
- “Tokens are not words”: the model’s atoms are whatever the tokenizer decided (often mixing whitespace, punctuation, and subwords).
- BPE is compression-by-frequency: it merges what’s common in the training distribution; rare strings (especially identifiers / numbers) can explode into many tokens.
- Unigram tokenization is inference: it chooses the *most likely segmentation* under token priors, not necessarily the longest tokens.
- Byte-level tokenization is robust but expensive: non-ASCII scripts and emoji expand into multiple bytes (more tokens).
- Normalization matters: different Unicode forms (NFC/NFKC, zero-width chars, NBSP) can change token boundaries and costs without changing what you “see.”