Bring it all together with the kernels that matter in real models: fused layer normalization, memory-efficient seeded dropout, the ideas behind fused attention (FlashAttention), persistent kernels, and a roadmap for going further with Triton.
You now have the full toolkit — the hardware model, the Triton programming model, fusion, reductions, tiling, and tuning. This final chapter applies it to the kernels that show up in every transformer, and points you toward what's next.
Fused layer normalization combines the softmax-style row reduction (mean, variance) with element-wise scale-and-shift in a single pass — the same fusion principle, applied again. Low-memory dropout shows a beautiful trick: instead of storing a giant random mask, regenerate it on the fly from a seed, trading a little compute for a lot of memory. Fused attention (FlashAttention) is the capstone idea: it combines tiling, the online-softmax running-max/running-sum trick from Chapter 6, and fusion to compute attention without ever materializing the score matrix — the optimization that made long-context transformers practical. Persistent kernels keep programs resident and loop over work to amortize launch overhead.
The meta-message: you won't memorize these kernels — you'll recognize the principles in them. Every one is keep-data-on-chip, reuse-it, fuse-the-chain, hide-the-latency, tune-empirically. That's the durable skill this course was building toward.
This chapter covers:
Click any topic to jump in
Per-row mean/variance reduction + affine transform fused into one DRAM round-trip, like softmax.
Regenerate the random mask from a seed instead of storing it — recompute to save memory.
Tiling + online softmax compute attention with O(N) memory — no N×N score matrix in DRAM.
Launch SM-filling programs that loop over work to amortize launch overhead and reuse state.
Apply the principles: diagnose the regime, fuse memory-bound chains, tune against the roofline.
Layer norm is softmax's cousin: a per-row reduction (mean and variance) followed by element-wise normalization and an affine transform. Fusing it into one pass is the same move you learned for softmax, applied to a new computation.
Layer norm normalizes each feature row to zero mean and unit variance, then applies a learned scale and shift . Like softmax, it's a per-row operation: one program per row, load the row into SRAM, compute the mean and variance with on-chip reductions (tl.sum), normalize, apply the affine transform, and store — all in a single DRAM round-trip.
Unfused, this would be several passes (mean pass, variance pass, normalize pass), each round-tripping through DRAM. Fused, it's read-once / write-once, just like softmax. The numerically careful version computes variance as or via Welford's algorithm to avoid a second pass over the data, and adds inside the square root for stability. The backward pass is also commonly fused, reusing the saved and .
Computing variance as needs only the running sums and , both obtainable in a single pass over the row — so mean and variance come from one load, not two. Welford's online formula is more numerically stable for large but the single-pass moment method suffices when values are well-scaled; either keeps the kernel at one DRAM round-trip.
Why is layer norm a memory-bound operation that benefits from fusion, and how does its kernel structure compare to the fused softmax kernel?