33Representations

🔤Tokenization & Vocabulary Design

Canonical Papers

Neural Machine Translation of Rare Words with Subword Units

Sennrich, Haddow, Birch2016ACL
Read paper →

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Kudo & Richardson2018EMNLP (System Demonstrations)
Read paper →

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Xue et al.2021arXiv
Read paper →

Core Mathematics

Tokenization is the discrete interface between raw text/bytes and a transformer. The model never sees “characters” or “words” — it sees token IDs from a vocabulary V\mathcal V.

---

## 1) Tokenization is a segmentation into vocabulary items

A tokenizer maps a string/byte sequence xx into tokens t1,,tnt_1,\dots,t_n:

x=concat(t1,,tn),tiVx = \mathrm{concat}(t_1,\dots,t_n), \qquad t_i \in \mathcal V

Then each token becomes an integer ID id(ti){0,,V1}\mathrm{id}(t_i) \in \{0,\dots,|\mathcal V|-1\} that indexes the embedding table.

---

## 2) Unigram tokenization as maximum-likelihood segmentation (SentencePiece-style)

Unigram models treat tokenization itself as probabilistic inference: choose the segmentation that maximizes token prior likelihood:

t^1:n=argmaxt:  concat(t)=x  i=1nlogp(ti)\hat{t}_{1:n} = \arg\max_{t:\;\mathrm{concat}(t)=x}\;\sum_{i=1}^{n} \log p(t_i)

This is a Viterbi / dynamic programming problem in practice: you search over segmentations and pick the highest-probability path.

---

## 3) BPE (byte-pair encoding) builds the vocabulary by merges

BPE starts from base symbols (characters or bytes), then repeatedly merges the most frequent adjacent pair:

(a,b)=argmax(a,b)  count(ab)a  bab(a,b)^* = \arg\max_{(a,b)}\; \mathrm{count}(a\,b) \qquad\Rightarrow\qquad a\;b \to ab

Each merge increases vocabulary size (one new token) and usually decreases sequence length on texts where that pair is common.

---

## 4) Vocabulary size is a real model parameter (embedding + softmax head)

Token IDs index the embedding matrix, and logits are produced over the same vocabulary:

WembedRV×d,WoutRV×dW_{\text{embed}} \in \mathbb{R}^{|\mathcal V|\times d}, \qquad W_{\text{out}} \in \mathbb{R}^{|\mathcal V|\times d}

So parameters scale roughly like:

paramstoken2Vd\text{params}_{\text{token}} \approx 2\,|\mathcal V|\,d

Tradeoff: larger vocab can mean fewer tokens (cheaper context / KV cache), but larger embedding + output layers (more parameters / memory).

---

### Why this is a “foundation” Tokenization silently shapes: - what patterns become “single atoms” (code, whitespace, common substrings), - how expensive prompts are (tokens per character), - multilingual/Unicode behavior (bytes vs scripts vs normalization), - and even security surfaces (invisible characters, homoglyphs, normalization).

Key Equation
t^1:n=argmaxt:  concat(t)=x  i=1nlogp(ti)\hat{t}_{1:n} = \arg\max_{t:\;\mathrm{concat}(t)=x}\;\sum_{i=1}^{n} \log p(t_i)

Interactive Visualization

Why It Matters for Modern Models

  • Tokens are the *unit of compute and cost*: prompt price, latency, and context usage are measured in tokens, not characters.
  • Tokenizer design reshapes capability: code, math, and multilingual text can become easy or painfully fragmented depending on subword boundaries.
  • Long-context engineering (#30) depends on token counts: “128k context” is 128k tokens, and tokenization determines how much text fits.
  • Vocabulary size is an architectural knob: embedding + softmax head scale with |V|, trading parameters for shorter sequences.
  • Unicode edge cases (normalization, invisible characters) affect reliability and safety: two visually identical strings can tokenize differently.

Missing Intuition

What is still poorly explained in textbooks and papers:

  • “Tokens are not words”: the model’s atoms are whatever the tokenizer decided (often mixing whitespace, punctuation, and subwords).
  • BPE is compression-by-frequency: it merges what’s common in the training distribution; rare strings (especially identifiers / numbers) can explode into many tokens.
  • Unigram tokenization is inference: it chooses the *most likely segmentation* under token priors, not necessarily the longest tokens.
  • Byte-level tokenization is robust but expensive: non-ASCII scripts and emoji expand into multiple bytes (more tokens).
  • Normalization matters: different Unicode forms (NFC/NFKC, zero-width chars, NBSP) can change token boundaries and costs without changing what you “see.”

Connections