Week 3-4

Chapter 3: Transformer Architecture

Dive deep into the architecture that powers every modern LLM. Understand self-attention---the mechanism that allows tokens to attend to each other regardless of distance---multi-head attention for capturing different relationship types, feed-forward networks, layer normalization, residual connections, and how they combine into a full transformer block.

Chapter Overview

The Transformer architecture (Vaswani et al., 2017) is the foundation of all modern LLMs. Its key innovation---self-attention---allows every token to directly interact with every other token in a sequence, eliminating the information bottleneck of recurrent architectures.

A transformer block consists of two main sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network. Each sub-layer is wrapped with a residual connection and layer normalization. Stacking dozens (or hundreds) of these blocks creates the deep networks we call LLMs.

Understanding the transformer at a mathematical level is essential for: debugging model behavior, implementing efficient inference, designing architectural improvements, and understanding why certain prompting strategies work. Each component has a clear purpose, and their interplay creates a system far more powerful than the sum of its parts.

This chapter covers:

Self-Attention: The core mechanism that enables tokens to exchange information
Multi-Head Attention: Running multiple attention patterns in parallel
Feed-Forward Networks: Per-token nonlinear transformations
Layer Normalization: Stabilizing training of deep networks
Residual Connections: Enabling gradient flow through many layers
Full Transformer Block: How all components fit together

Chapter Roadmap

Click any topic to jump in

Self-Attention

The core mechanism — every token computes weighted attention over all other tokens via queries, keys, and values.

Queries, Keys, and ValuesScaled Dot-Product AttentionCausal (Masked) AttentionAttention Complexity

Parallel heads and per-token processing

Multi-head attention captures diverse patterns; FFN stores and transforms knowledge

Multi-Head Attention

Parallel attention heads that each learn different relationship types — syntax, semantics, position.

Multi-Head Attention FormulaWhat Different Heads LearnGrouped Query Attention (GQA)Multi-Query Attention (MQA)

Feed-Forward Networks

Per-token nonlinear transformations that store factual knowledge and apply complex feature mappings.

Position-wise FFNActivation FunctionsSwiGLU ActivationFFN as Knowledge Storage

Training stability mechanisms

Normalization and skip connections make deep stacking possible

Layer Normalization

Stabilizing deep network training by normalizing activations — LayerNorm, RMSNorm, Pre-Norm vs Post-Norm.

Layer NormalizationPre-Norm vs Post-NormRMSNormWhy Not BatchNorm?

Residual Connections

Skip connections that enable gradient flow through 100+ layers by preserving the identity path.

Residual Connection FormulaGradient FlowResidual Stream Interpretation

Complete architecture

Full Transformer Block

Assembling all components into the Pre-Norm block that gets stacked dozens of times in modern LLMs.

Pre-Norm Transformer BlockParameter Count BreakdownKV Cache for InferenceScaling Depth vs Width

Self-attention is the mechanism by which each token computes a weighted combination of all other tokens' representations. It answers the question: "Which other tokens should I pay attention to when updating my representation?" This enables capturing long-range dependencies that RNNs struggle with.

Key insight: Self-attention computes pairwise relationships between all tokens in a single step, giving every token direct access to the full sequence context.

In this topic

1Queries, Keys, and Values

2Scaled Dot-Product Attention

3Causal (Masked) Attention

4Attention Complexity

1 of 4

Queries, Keys, and Values

$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$

Each token's embedding $x_i$ is projected into three vectors: a query (what am I looking for?), a key (what do I contain?), and a value (what information do I provide?). The projections $W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}$ are learned parameters. The query-key dot product determines attention weights; the values are what gets aggregated.

Mathematical Intuition

The projection matrices $W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}$ transform each token into three roles. The attention score $q_i^T k_j = x_i^T W_Q^T W_K x_j$ is a bilinear form that measures compatibility between tokens $i$ and $j$ through the learned metric $W_Q^T W_K$ . This is more expressive than simple dot-product similarity because the model learns what aspects of tokens should determine attention.

Example:

In "The cat sat on the mat", how does the word "sat" use Q, K, V?

2 of 4

Scaled Dot-Product Attention

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

The attention function: (1) Compute dot products $QK^T$ to get raw attention scores. (2) Scale by $\sqrt{d_k}$ to prevent softmax saturation. (3) Apply softmax to get attention weights (probabilities that sum to 1). (4) Multiply by $V$ to get the weighted combination of values. The scaling factor $\sqrt{d_k}$ is critical---without it, large dot products push softmax into regions with vanishing gradients.

Mathematical Intuition

Self-attention computes $\text{softmax}(QK^T/\sqrt{d_k})V$ . The $\sqrt{d_k}$ scaling prevents dot products from growing with dimension, keeping softmax gradients useful. Without scaling, if $q$ and $k$ are random vectors with entries $\sim N(0,1)$ , then $q^T k$ has variance $d_k$ . For $d_k = 64$ : std $\approx 8$ , and $\text{softmax}([8, -8, 2, ...])$ is nearly one-hot, producing vanishing gradients for all but the top-scoring key.

Example:

Why divide by $\sqrt{d_k}$ ? What happens without scaling when $d_k = 512$ ?

3 of 4

Causal (Masked) Attention

$\text{mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \ -\infty & \text{if } j > i \end{cases}$

In decoder-only models (GPT, Llama, Claude), each token can only attend to previous tokens and itself. This is enforced by adding $-\infty$ to attention scores for future positions before softmax, which zeros out those attention weights. This enables autoregressive generation: token $t$ is predicted using only tokens $1$ through $t-1$ .

Mathematical Intuition

The causal mask $M_{ij} = -\infty \cdot \mathbb{1}[j > i]$ ensures $\text{softmax}(\ldots + M)_{ij} = 0$ for $j > i$ , because $e^{-\infty} = 0$ . This enforces the autoregressive property: $P(x_t \mid x_{<t})$ cannot depend on $x_{>t}$ . The mask is the only difference between encoder (bidirectional) and decoder (causal) attention — the same attention mechanism with a different mask produces fundamentally different model capabilities.

Example:

In the sequence "I love cats", which tokens can "love" attend to?

4 of 4

Attention Complexity

$\text{Time: } O(n^2 \cdot d), \quad \text{Memory: } O(n^2 + n \cdot d)$

Self-attention computes an $n \times n$ attention matrix, making it quadratic in sequence length. For $n = 100K$ tokens, this matrix has 10 billion entries. This is the primary bottleneck for long-context LLMs. Solutions include: FlashAttention (memory-efficient), sparse attention, linear attention, and sliding window attention.

Mathematical Intuition

Computing the $n \times n$ attention matrix $QK^T$ requires $n^2 d$ multiply-adds. Storing this matrix takes $n^2$ floats. For $n = 128{,}000$ (Llama's context): the matrix has $1.64 \times 10^{10}$ entries, requiring $\sim 64$ GB in FP32. FlashAttention avoids materializing this matrix by computing attention in tiles that fit in SRAM, reducing memory from $O(n^2)$ to $O(n)$ while maintaining exact computation.

Example:

A model processes 8K tokens with $d = 4096$ . How does doubling context to 16K affect attention compute?

Theory Exercise

Problem:

Explain why self-attention can capture long-range dependencies that RNNs cannot, using a concrete example.

Hints:

Think about the path length between distant tokens
Consider the vanishing gradient problem in RNNs
Think about the attention matrix as a shortcut

Papers

Attention Is All You Need FlashAttention: Fast and Memory-Efficient Exact Attention

Blogs

The Illustrated Transformer — Jay Alammar Attention Mechanism Explained — Lilian Weng

Tokenization & Embeddings

Pretraining

Chapter 3: Transformer Architecture

Chapter Overview

Chapter Roadmap

Self-Attention

Multi-Head Attention

Feed-Forward Networks

Layer Normalization

Residual Connections

Full Transformer Block

Self-Attention Mechanism

In this topic

Queries, Keys, and Values

Scaled Dot-Product Attention

Causal (Masked) Attention

Attention Complexity

Theory Exercise

Problem:

Hints:

Multi-Head Attention

Feed-Forward Networks

Layer Normalization

Residual Connections

Full Transformer Block