Chapter 8: Advanced Kernels & Optimization

Bring it all together with the kernels that matter in real models: fused layer normalization, memory-efficient seeded dropout, the ideas behind fused attention (FlashAttention), persistent kernels, and a roadmap for going further with Triton.

Chapter Overview

You now have the full toolkit — the hardware model, the Triton programming model, fusion, reductions, tiling, and tuning. This final chapter applies it to the kernels that show up in every transformer, and points you toward what's next.

Fused layer normalization combines the softmax-style row reduction (mean, variance) with element-wise scale-and-shift in a single pass — the same fusion principle, applied again. Low-memory dropout shows a beautiful trick: instead of storing a giant random mask, regenerate it on the fly from a seed, trading a little compute for a lot of memory. Fused attention (FlashAttention) is the capstone idea: it combines tiling, the online-softmax running-max/running-sum trick from Chapter 6, and fusion to compute attention without ever materializing the $O(N^2)$ score matrix — the optimization that made long-context transformers practical. Persistent kernels keep programs resident and loop over work to amortize launch overhead.

The meta-message: you won't memorize these kernels — you'll recognize the principles in them. Every one is keep-data-on-chip, reuse-it, fuse-the-chain, hide-the-latency, tune-empirically. That's the durable skill this course was building toward.

This chapter covers:

Fused Layer Normalization: reduction + scale/shift in one pass
Low-Memory Dropout: regenerate the mask from a seed
Fused Attention (FlashAttention): attention without the N×N matrix
Persistent Kernels: amortizing launch overhead
Where to Go Next: extending your Triton skills

Chapter Roadmap

Click any topic to jump in

Fused Layer Normalization

Per-row mean/variance reduction + affine transform fused into one DRAM round-trip, like softmax.

Reduction + Affine in One Pass

Another recompute trade-off

Low-Memory Dropout

Regenerate the random mask from a seed instead of storing it — recompute to save memory.

Regenerate the Mask from a Seed

Fusion + online softmax + tiling combined

Fused Attention (FlashAttention)

Tiling + online softmax compute attention with O(N) memory — no N×N score matrix in DRAM.

Attention Without the N×N Matrix

Amortizing overhead

Persistent Kernels

Launch SM-filling programs that loop over work to amortize launch overhead and reuse state.

Resident Programs That Loop Over Work

Carrying the principles forward

Where to Go Next

Apply the principles: diagnose the regime, fuse memory-bound chains, tune against the roofline.

Extending Your Triton Skills

Layer norm is softmax's cousin: a per-row reduction (mean and variance) followed by element-wise normalization and an affine transform. Fusing it into one pass is the same move you learned for softmax, applied to a new computation.

1 of 1

Reduction + Affine in One Pass

$y = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}\,\gamma + \beta, \quad \mu = \frac{1}{N}\sum_i x_i, \quad \sigma^2 = \frac{1}{N}\sum_i (x_i - \mu)^2$

Layer norm normalizes each feature row to zero mean and unit variance, then applies a learned scale $\gamma$ and shift $\beta$ . Like softmax, it's a per-row operation: one program per row, load the row into SRAM, compute the mean and variance with on-chip reductions (tl.sum), normalize, apply the affine transform, and store — all in a single DRAM round-trip.

Unfused, this would be several passes (mean pass, variance pass, normalize pass), each round-tripping through DRAM. Fused, it's read-once / write-once, just like softmax. The numerically careful version computes variance as $E[x^2] - E[x]^2$ or via Welford's algorithm to avoid a second pass over the data, and adds $\epsilon$ inside the square root for stability. The backward pass is also commonly fused, reusing the saved $\mu$ and $1/\sqrt{\sigma^2+\epsilon}$ .

Mathematical Intuition

Computing variance as $\sigma^2 = \frac{1}{N}\sum x_i^2 - \mu^2$ needs only the running sums $\sum x_i$ and $\sum x_i^2$ , both obtainable in a single pass over the row — so mean and variance come from one load, not two. Welford's online formula is more numerically stable for large $N$ but the single-pass moment method suffices when values are well-scaled; either keeps the kernel at one DRAM round-trip.

Example:

Why is layer norm a memory-bound operation that benefits from fusion, and how does its kernel structure compare to the fused softmax kernel?

Matrix Multiplication