Week 1-2

Chapter 2: Tokenization & Embeddings

Learn how raw text is converted into numerical representations that neural networks can process. Master the tokenization algorithms (BPE, WordPiece, SentencePiece) that determine a model's vocabulary, understand how word embeddings capture semantic meaning in vector space, and explore positional encodings that give transformers a sense of word order.

Chapter Overview

Before an LLM can process text, it must convert characters into numbers. This seemingly simple step---tokenization---has profound implications for model performance, multilingual capability, and computational efficiency.

Tokenization sits at the intersection of linguistics and engineering. Break words into too-small pieces (individual characters) and the model must learn spelling from scratch. Keep words whole and the vocabulary explodes, making the output layer impossibly large. Modern tokenizers like BPE find a middle ground: common words stay whole ("the", "and") while rare words are split into meaningful subwords ("unhappiness" -> "un", "happiness").

Once tokenized, each token is mapped to a dense vector---an embedding---that captures its semantic meaning. Words with similar meanings cluster together in this high-dimensional space. Finally, since transformers process all tokens in parallel (unlike sequential RNNs), we must inject positional information so the model knows word order.

This chapter covers:

Byte Pair Encoding (BPE): The dominant tokenization algorithm used by GPT, Llama, and most modern LLMs
WordPiece & SentencePiece: Alternative algorithms used by BERT and multilingual models
Token Vocabularies: How vocabulary size affects model performance and efficiency
Word Embeddings: Dense vector representations that capture semantic relationships
Positional Encodings: How transformers understand token order without recurrence

Chapter Roadmap

Click any topic to jump in

Byte Pair Encoding

The dominant tokenization algorithm — iteratively merging frequent character pairs to build a subword vocabulary.

BPE AlgorithmByte-Level BPETokenization and Compute CostTokenization Artifacts

WordPiece & SentencePiece

Alternative tokenizers using likelihood-based merges and language-agnostic byte processing.

WordPiece AlgorithmSentencePieceUnigram Language ModelBPE vs WordPiece vs Unigram

Design decisions

Token Vocabularies

How vocabulary size, special tokens, and byte fallback affect efficiency and multilingual fairness.

Vocabulary Size TradeoffsSpecial TokensVocabulary and Multilingual EfficiencyByte Fallback

From tokens to vectors

Tokens become dense vectors that encode meaning and position

Word Embeddings

Dense vector representations that capture semantic similarity — from Word2Vec to contextual embeddings.

From One-Hot to Dense EmbeddingsSemantic SimilarityWord AnalogiesContextual Embeddings

Positional Encodings

Injecting order information into transformers — sinusoidal, learned, RoPE, and ALiBi approaches.

Sinusoidal Positional EncodingLearned Positional EmbeddingsRotary Position Embedding (RoPE)ALiBi: Attention with Linear Biases

Byte Pair Encoding, originally a data compression algorithm, was adapted for NLP by Sennrich et al. (2016). It builds a vocabulary by iteratively merging the most frequent pairs of characters or subwords. BPE is the tokenization method used by GPT-2, GPT-3, GPT-4, Llama, and most modern LLMs.

Key insight: BPE creates a vocabulary that naturally balances between character-level and word-level tokenization. Frequent words are single tokens; rare words decompose into meaningful subword units.

In this topic

1BPE Algorithm

2Byte-Level BPE

3Tokenization and Compute Cost

4Tokenization Artifacts

1 of 4

BPE Algorithm

Start with a vocabulary of all individual characters. 2. Count all adjacent pairs in the training corpus. 3. Merge the most frequent pair into a new token. 4. Repeat until desired vocabulary size is reached. Each merge creates a new subword unit that captures a common pattern in the language.

Mathematical Intuition

BPE greedily maximizes compression: at each step, merging the most frequent pair reduces total sequence length by $\text{freq}(a,b) - 1$ tokens. After $M$ merges from base vocabulary $V_0$ , the vocabulary is $|V| = |V_0| + M$ . The algorithm approximates minimum description length (MDL): $\mathcal{L}(D, V) = |V| \log |V| + |D|_V \log |V|$ where $|D|_V$ is the encoded corpus length under vocabulary $V$ .

Example:

Given the corpus: "low lower lowest", trace the first 3 BPE merges.

2 of 4

Byte-Level BPE

GPT-2 introduced byte-level BPE: start with 256 byte values instead of Unicode characters. This guarantees any text can be tokenized (no unknown tokens) and handles any language or encoding. The base vocabulary is fixed at 256 bytes, and merges build from there to the target vocabulary size.

Mathematical Intuition

Starting from 256 byte values guarantees universal coverage because any Unicode character encodes as 1-4 UTF-8 bytes. The worst-case expansion is 4x: a 4-byte character becomes 4 tokens. The expected expansion for English is approximately $1.0$ x (ASCII maps to single bytes), while CJK characters average $3$ x (3-byte UTF-8). This asymmetry is the root cause of tokenizer bias across languages.

Example:

Why does byte-level BPE never produce an "unknown token"?

3 of 4

Tokenization and Compute Cost

$\text{Compute} \propto \text{sequence length}^2 \times \text{model dimension}$

Fewer tokens per text means shorter sequences, which means less compute (self-attention is quadratic in sequence length). Efficient tokenization directly reduces training and inference costs. A vocabulary that tokenizes English text into ~1.3 tokens per word is near-optimal for English.

Mathematical Intuition

Self-attention cost scales as $O(n^2 d)$ where $n$ is sequence length. A tokenizer that reduces $n$ by factor $k$ saves $k^2$ in attention compute. For example, BPE compressing 5000 characters to 1300 tokens: attention cost is $(1300/5000)^2 \approx 6.8\%$ of character-level cost. This quadratic saving is why tokenizer efficiency directly impacts training and inference budgets.

Example:

"International" is 1 token in GPT-4 but 3 tokens in a character-level model. What is the compute difference?

4 of 4

Tokenization Artifacts

BPE tokenization can create surprising artifacts. Numbers may be split inconsistently ("1234" -> "12", "34" or "1", "234"). Spaces are handled specially (GPT uses a special character). Leading spaces affect tokenization. These artifacts can impact model behavior, especially for arithmetic and code.

Mathematical Intuition

BPE tokenization is context-dependent at boundaries: "123" might tokenize as ["123"] but "1234" as ["12", "34"] because the merge history creates different split points. The number of possible tokenizations of a string of length $n$ with vocabulary $V$ is exponential: up to $2^{n-1}$ segmentations. BPE's deterministic greedy approach selects just one, which may not be optimal for downstream tasks.

Example:

Why do LLMs sometimes struggle with counting letters in a word like "strawberry"?

Theory Exercise

Problem:

A BPE tokenizer with vocabulary size 32,000 tokenizes English at 1.3 tokens/word but Japanese at 3.5 tokens/word. Why? What are the implications?

Hints:

Think about the training data distribution for BPE
Consider what happens to languages underrepresented in training
Think about cost implications for different languages

Papers

Neural Machine Translation of Rare Words with Subword Units (BPE)Efficient Estimation of Word Representations (Word2Vec)

Blogs

The Illustrated BPE — Hugging Face NLP Course Tokenizers: How Machines Read — Jay Alammar

Introduction to LLMs

Transformer Architecture

Chapter 2: Tokenization & Embeddings

Chapter Overview

Chapter Roadmap

Byte Pair Encoding

WordPiece & SentencePiece

Token Vocabularies

Word Embeddings

Positional Encodings

Byte Pair Encoding (BPE)

In this topic

BPE Algorithm

Byte-Level BPE

Tokenization and Compute Cost

Tokenization Artifacts

Theory Exercise

Problem:

Hints:

WordPiece & SentencePiece

Token Vocabularies

Word Embeddings

Positional Encodings