Dive deep into the architecture that powers every modern LLM. Understand self-attention---the mechanism that allows tokens to attend to each other regardless of distance---multi-head attention for capturing different relationship types, feed-forward networks, layer normalization, residual connections, and how they combine into a full transformer block.
The Transformer architecture (Vaswani et al., 2017) is the foundation of all modern LLMs. Its key innovation---self-attention---allows every token to directly interact with every other token in a sequence, eliminating the information bottleneck of recurrent architectures.
A transformer block consists of two main sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network. Each sub-layer is wrapped with a residual connection and layer normalization. Stacking dozens (or hundreds) of these blocks creates the deep networks we call LLMs.
Understanding the transformer at a mathematical level is essential for: debugging model behavior, implementing efficient inference, designing architectural improvements, and understanding why certain prompting strategies work. Each component has a clear purpose, and their interplay creates a system far more powerful than the sum of its parts.
This chapter covers:
Click any topic to jump in
The core mechanism — every token computes weighted attention over all other tokens via queries, keys, and values.
Multi-head attention captures diverse patterns; FFN stores and transforms knowledge
Parallel attention heads that each learn different relationship types — syntax, semantics, position.
Per-token nonlinear transformations that store factual knowledge and apply complex feature mappings.
Normalization and skip connections make deep stacking possible
Stabilizing deep network training by normalizing activations — LayerNorm, RMSNorm, Pre-Norm vs Post-Norm.
Skip connections that enable gradient flow through 100+ layers by preserving the identity path.
Assembling all components into the Pre-Norm block that gets stacked dozens of times in modern LLMs.
Self-attention is the mechanism by which each token computes a weighted combination of all other tokens' representations. It answers the question: "Which other tokens should I pay attention to when updating my representation?" This enables capturing long-range dependencies that RNNs struggle with.
Key insight: Self-attention computes pairwise relationships between all tokens in a single step, giving every token direct access to the full sequence context.
Each token's embedding is projected into three vectors: a query (what am I looking for?), a key (what do I contain?), and a value (what information do I provide?). The projections are learned parameters. The query-key dot product determines attention weights; the values are what gets aggregated.
The projection matrices transform each token into three roles. The attention score is a bilinear form that measures compatibility between tokens and through the learned metric . This is more expressive than simple dot-product similarity because the model learns what aspects of tokens should determine attention.
In "The cat sat on the mat", how does the word "sat" use Q, K, V?
The attention function: (1) Compute dot products to get raw attention scores. (2) Scale by to prevent softmax saturation. (3) Apply softmax to get attention weights (probabilities that sum to 1). (4) Multiply by to get the weighted combination of values. The scaling factor is critical---without it, large dot products push softmax into regions with vanishing gradients.
Self-attention computes . The scaling prevents dot products from growing with dimension, keeping softmax gradients useful. Without scaling, if and are random vectors with entries , then has variance . For : std , and is nearly one-hot, producing vanishing gradients for all but the top-scoring key.
Why divide by ? What happens without scaling when ?
In decoder-only models (GPT, Llama, Claude), each token can only attend to previous tokens and itself. This is enforced by adding to attention scores for future positions before softmax, which zeros out those attention weights. This enables autoregressive generation: token is predicted using only tokens through .
The causal mask ensures for , because . This enforces the autoregressive property: cannot depend on . The mask is the only difference between encoder (bidirectional) and decoder (causal) attention — the same attention mechanism with a different mask produces fundamentally different model capabilities.
In the sequence "I love cats", which tokens can "love" attend to?
Self-attention computes an attention matrix, making it quadratic in sequence length. For tokens, this matrix has 10 billion entries. This is the primary bottleneck for long-context LLMs. Solutions include: FlashAttention (memory-efficient), sparse attention, linear attention, and sliding window attention.
Computing the attention matrix requires multiply-adds. Storing this matrix takes floats. For (Llama's context): the matrix has entries, requiring GB in FP32. FlashAttention avoids materializing this matrix by computing attention in tiles that fit in SRAM, reducing memory from to while maintaining exact computation.
A model processes 8K tokens with . How does doubling context to 16K affect attention compute?
Explain why self-attention can capture long-range dependencies that RNNs cannot, using a concrete example.