33Representations

🔤Tokenization & Vocabulary Design

Canonical Papers

Neural Machine Translation of Rare Words with Subword Units

Sennrich, Haddow, Birch2016ACL

Read paper →

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Kudo & Richardson2018EMNLP (System Demonstrations)

Read paper →

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Xue et al.2021arXiv

Read paper →

Core Mathematics

Tokenization is the discrete interface between raw text/bytes and a transformer. The model never sees “characters” or “words” — it sees token IDs from a vocabulary $\mathcal V$ .

---

## 1) Tokenization is a segmentation into vocabulary items

A tokenizer maps a string/byte sequence $x$ into tokens $t_1,\dots,t_n$ :

x = \mathrm{concat}(t_1,\dots,t_n), \qquad t_i \in \mathcal V

Then each token becomes an integer ID $\mathrm{id}(t_i) \in \{0,\dots,|\mathcal V|-1\}$ that indexes the embedding table.

---

## 2) Unigram tokenization as maximum-likelihood segmentation (SentencePiece-style)

Unigram models treat tokenization itself as probabilistic inference: choose the segmentation that maximizes token prior likelihood:

\hat{t}_{1:n} = \arg\max_{t:\;\mathrm{concat}(t)=x}\;\sum_{i=1}^{n} \log p(t_i)

This is a Viterbi / dynamic programming problem in practice: you search over segmentations and pick the highest-probability path.

---

## 3) BPE (byte-pair encoding) builds the vocabulary by merges

BPE starts from base symbols (characters or bytes), then repeatedly merges the most frequent adjacent pair:

(a,b)^* = \arg\max_{(a,b)}\; \mathrm{count}(a\,b) \qquad\Rightarrow\qquad a\;b \to ab

Each merge increases vocabulary size (one new token) and usually decreases sequence length on texts where that pair is common.

---

## 4) Vocabulary size is a real model parameter (embedding + softmax head)

Token IDs index the embedding matrix, and logits are produced over the same vocabulary:

W_{\text{embed}} \in \mathbb{R}^{|\mathcal V|\times d}, \qquad W_{\text{out}} \in \mathbb{R}^{|\mathcal V|\times d}

So parameters scale roughly like:

\text{params}_{\text{token}} \approx 2\,|\mathcal V|\,d

Tradeoff: larger vocab can mean fewer tokens (cheaper context / KV cache), but larger embedding + output layers (more parameters / memory).

---

### Why this is a “foundation” Tokenization silently shapes: - what patterns become “single atoms” (code, whitespace, common substrings), - how expensive prompts are (tokens per character), - multilingual/Unicode behavior (bytes vs scripts vs normalization), - and even security surfaces (invisible characters, homoglyphs, normalization).

Key Equation

\hat{t}_{1:n} = \arg\max_{t:\;\mathrm{concat}(t)=x}\;\sum_{i=1}^{n} \log p(t_i)

Interactive Visualization

Why It Matters for Modern Models

Tokens are the *unit of compute and cost*: prompt price, latency, and context usage are measured in tokens, not characters.
Tokenizer design reshapes capability: code, math, and multilingual text can become easy or painfully fragmented depending on subword boundaries.
Long-context engineering (#30) depends on token counts: “128k context” is 128k tokens, and tokenization determines how much text fits.
Vocabulary size is an architectural knob: embedding + softmax head scale with |V|, trading parameters for shorter sequences.
Unicode edge cases (normalization, invisible characters) affect reliability and safety: two visually identical strings can tokenize differently.

Missing Intuition

What is still poorly explained in textbooks and papers:

“Tokens are not words”: the model’s atoms are whatever the tokenizer decided (often mixing whitespace, punctuation, and subwords).
BPE is compression-by-frequency: it merges what’s common in the training distribution; rare strings (especially identifiers / numbers) can explode into many tokens.
Unigram tokenization is inference: it chooses the *most likely segmentation* under token priors, not necessarily the longest tokens.
Byte-level tokenization is robust but expensive: non-ASCII scripts and emoji expand into multiple bytes (more tokens).
Normalization matters: different Unicode forms (NFC/NFKC, zero-width chars, NBSP) can change token boundaries and costs without changing what you “see.”

Connections

Prerequisites

ℒML/CE/KL ◎Embeddings ⚡Efficiency 📏Long Context

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

↔️ Mathematical Dual

∀Tokens ↔ Codelength→Theory

⚠️ Breaks When

📏Token explosion→Long Context ℒOOV tokens→ML/CE/KL ↗Token units shift→Scaling

≈ Analogy

🔍Codebook sparsity→Sparse Autoencoders ⊕Compression tradeoff→Superposition 🖼️Patch = token→Multimodal VLP ↗Compression efficiency→Scaling 🖼️Everything is tokens→Multimodal VLP

🔄 Same Technique

ℒViterbi segmentation→ML/CE/KL