Learn how raw text is converted into numerical representations that neural networks can process. Master the tokenization algorithms (BPE, WordPiece, SentencePiece) that determine a model's vocabulary, understand how word embeddings capture semantic meaning in vector space, and explore positional encodings that give transformers a sense of word order.
Before an LLM can process text, it must convert characters into numbers. This seemingly simple step---tokenization---has profound implications for model performance, multilingual capability, and computational efficiency.
Tokenization sits at the intersection of linguistics and engineering. Break words into too-small pieces (individual characters) and the model must learn spelling from scratch. Keep words whole and the vocabulary explodes, making the output layer impossibly large. Modern tokenizers like BPE find a middle ground: common words stay whole ("the", "and") while rare words are split into meaningful subwords ("unhappiness" -> "un", "happiness").
Once tokenized, each token is mapped to a dense vector---an embedding---that captures its semantic meaning. Words with similar meanings cluster together in this high-dimensional space. Finally, since transformers process all tokens in parallel (unlike sequential RNNs), we must inject positional information so the model knows word order.
This chapter covers:
Click any topic to jump in
The dominant tokenization algorithm — iteratively merging frequent character pairs to build a subword vocabulary.
Alternative tokenizers using likelihood-based merges and language-agnostic byte processing.
How vocabulary size, special tokens, and byte fallback affect efficiency and multilingual fairness.
Tokens become dense vectors that encode meaning and position
Dense vector representations that capture semantic similarity — from Word2Vec to contextual embeddings.
Injecting order information into transformers — sinusoidal, learned, RoPE, and ALiBi approaches.
Byte Pair Encoding, originally a data compression algorithm, was adapted for NLP by Sennrich et al. (2016). It builds a vocabulary by iteratively merging the most frequent pairs of characters or subwords. BPE is the tokenization method used by GPT-2, GPT-3, GPT-4, Llama, and most modern LLMs.
Key insight: BPE creates a vocabulary that naturally balances between character-level and word-level tokenization. Frequent words are single tokens; rare words decompose into meaningful subword units.
BPE greedily maximizes compression: at each step, merging the most frequent pair reduces total sequence length by tokens. After merges from base vocabulary , the vocabulary is . The algorithm approximates minimum description length (MDL): where is the encoded corpus length under vocabulary .
Given the corpus: "low lower lowest", trace the first 3 BPE merges.
GPT-2 introduced byte-level BPE: start with 256 byte values instead of Unicode characters. This guarantees any text can be tokenized (no unknown tokens) and handles any language or encoding. The base vocabulary is fixed at 256 bytes, and merges build from there to the target vocabulary size.
Starting from 256 byte values guarantees universal coverage because any Unicode character encodes as 1-4 UTF-8 bytes. The worst-case expansion is 4x: a 4-byte character becomes 4 tokens. The expected expansion for English is approximately x (ASCII maps to single bytes), while CJK characters average x (3-byte UTF-8). This asymmetry is the root cause of tokenizer bias across languages.
Why does byte-level BPE never produce an "unknown token"?
Fewer tokens per text means shorter sequences, which means less compute (self-attention is quadratic in sequence length). Efficient tokenization directly reduces training and inference costs. A vocabulary that tokenizes English text into ~1.3 tokens per word is near-optimal for English.
Self-attention cost scales as where is sequence length. A tokenizer that reduces by factor saves in attention compute. For example, BPE compressing 5000 characters to 1300 tokens: attention cost is of character-level cost. This quadratic saving is why tokenizer efficiency directly impacts training and inference budgets.
"International" is 1 token in GPT-4 but 3 tokens in a character-level model. What is the compute difference?
BPE tokenization can create surprising artifacts. Numbers may be split inconsistently ("1234" -> "12", "34" or "1", "234"). Spaces are handled specially (GPT uses a special character). Leading spaces affect tokenization. These artifacts can impact model behavior, especially for arithmetic and code.
BPE tokenization is context-dependent at boundaries: "123" might tokenize as ["123"] but "1234" as ["12", "34"] because the merge history creates different split points. The number of possible tokenizations of a string of length with vocabulary is exponential: up to segmentations. BPE's deterministic greedy approach selects just one, which may not be optimal for downstream tasks.
Why do LLMs sometimes struggle with counting letters in a word like "strawberry"?
A BPE tokenizer with vocabulary size 32,000 tokenizes English at 1.3 tokens/word but Japanese at 3.5 tokens/word. Why? What are the implications?