Chapter 6: Kernel Fusion: Fused Softmax

Learn the single most important optimization for memory-bound workloads: fusing many operations into one kernel. Count the DRAM round-trips that make unfused code slow, implement softmax as a one-pass row-wise reduction, make it numerically stable, and measure the fused-vs-unfused speedup.

Chapter Overview

Softmax appears everywhere — attention, classification heads, mixture weights — and it's the perfect vehicle for learning kernel fusion, the optimization that defines practical Triton work.

A naive softmax in PyTorch runs as a chain of separate kernels: subtract the max, exponentiate, sum, divide. Each kernel reads its input from DRAM and writes its output back to DRAM, so the data makes the slow round-trip multiple times. For a memory-bound operation, that traffic — not the arithmetic — is the entire runtime.

A fused kernel loads each row from DRAM exactly once, performs the whole computation (max → exp → sum → divide) while the row sits in fast on-chip SRAM, and writes the result once. The math is identical; the memory traffic collapses by a factor equal to the number of fused passes. That's the whole game.

Along the way you'll implement softmax as a row-wise reduction (each program owns one row), and you'll make it numerically stable by subtracting the row max before exponentiating — without which large logits overflow to infinity. By the end you'll understand why fusion is the first thing a Triton programmer reaches for.

This chapter covers:

The Cost of DRAM Round-Trips: why unfused chains are slow
What Is Kernel Fusion?: one load, all the work, one store
The Row-Wise Reduction Pattern: a program per row
Numerically Stable Softmax: subtracting the max
Fused vs Unfused: measuring the win

Chapter Roadmap

Click any topic to jump in

Cost of DRAM Round-Trips

Unfused chains move the data to DRAM and back per op; traffic = runtime for memory-bound work.

Counting the Traffic

The fix for the traffic

What Is Kernel Fusion?

One load, all the work in SRAM, one store — intermediates never hit DRAM.

One Load, All the Work, One Store

How to fuse a reduction

Row-Wise Reduction Pattern

One program per row; tl.max/tl.sum reduce on-chip. Generalizes to layer norm & attention.

A Program per Row

Making the reduction correct

Numerically Stable Softmax

Subtract the row max before exp — shift-invariant, prevents overflow, essentially free.

Subtract the Max

Proving the speedup

Fused vs Unfused

Measure the win: a fused softmax nears the bandwidth roofline by moving the minimum bytes.

Measuring the Win

Every time data leaves the chip and comes back, you pay the slow DRAM latency and consume precious bandwidth. For memory-bound ops, the number of these round-trips is a direct proxy for runtime.

1 of 1

Counting the Traffic

Consider softmax done as separate ops over an $M \times N$ matrix: (1) read X, compute row max, write nothing useful but PyTorch may materialize intermediates; (2) read X again, subtract max, exp, write a temporary; (3) read the temporary, sum per row; (4) read again, divide, write output. Each pass moves $\approx MN$ values to or from DRAM. The fused version moves the matrix in once and out once — roughly $2MN$ values total, versus the $6$ – $8 MN$ of the unfused chain.

For a memory-bound op (and softmax, being low arithmetic-intensity, is exactly that), runtime $\approx \frac{\text{bytes moved}}{\text{bandwidth}}$ . Cutting the bytes by 3–4× cuts the runtime by 3–4×. The arithmetic (a few exps and adds per element) is essentially free by comparison — the data movement is the cost.

Mathematical Intuition

If an operation is a chain of $k$ memory-bound passes over $n$ elements, unfused traffic is $\approx 2kn$ (read+write per pass) while fused traffic is $\approx 2n$ . The speedup ceiling for a bandwidth-bound op is therefore $\approx k$ — it grows linearly with how many passes you collapse into one kernel.

Example:

An unfused softmax over a 4096×4096 float32 matrix makes 4 passes that each read+write the matrix. How many bytes hit DRAM, and how much does fusing to a single load+store save?

Memory Access, Autotuning & Benchmarking

Matrix Multiplication

Chapter 6: Kernel Fusion: Fused Softmax

Chapter Overview

Chapter Roadmap

Cost of DRAM Round-Trips

What Is Kernel Fusion?

Row-Wise Reduction Pattern

Numerically Stable Softmax

Fused vs Unfused

The Cost of DRAM Round-Trips

Counting the Traffic

What Is Kernel Fusion?

The Row-Wise Reduction Pattern

Numerically Stable Softmax

Fused vs Unfused