Write the kernel that powers deep learning. Move from a memory-drowning naive matmul to a tiled kernel that stages blocks in SRAM, accumulates across the inner-K loop, reorders programs for L2 cache reuse (super-grouping), and approaches cuBLAS-level performance.
Matrix multiplication is the beating heart of deep learning — every linear layer, attention projection, and convolution-as-matmul runs through it. It's also the canonical compute-bound kernel, the opposite of softmax: with the right tiling it reuses each loaded value many times, pushing arithmetic intensity high enough to saturate the GPU's ALUs and Tensor Cores.
The journey of this chapter is the journey of GPU optimization in miniature. A naive matmul re-reads A and B from DRAM for every output element, drowning in memory traffic. A tiled matmul loads a block of A and a block of B into SRAM once and reuses them for a whole tile of outputs, accumulating the result across an inner-K loop. That single change moves the kernel from memory-bound to compute-bound. Then super-grouping — reordering which output tiles neighboring programs compute — keeps the reused A/B tiles hot in L2 cache, squeezing out the last big factor and bringing a hand-written Triton matmul within striking distance of cuBLAS.
This is the chapter where everything connects: the memory hierarchy (Ch 2), the tile/program model (Ch 3–4), and the roofline (Ch 5) all converge in one kernel.
This chapter covers:
Click any topic to jump in
Naive re-reads A and B from DRAM per output; tiling reuses on-chip blocks and raises intensity.
Each program owns a BMxBN tile; a float32 accumulator stays on-chip across the K loop.
Stream BK-blocks, acc += tl.dot(a, b); num_stages pipelines loads behind compute.
Reorder programs so shared A/B blocks stay hot in L2 — fewer DRAM re-reads, no extra FLOPs.
Tiling + mixed precision + pipelining + grouping + autotune → within a few % of cuBLAS.
The textbook three-nested-loop matmul is correct but catastrophically slow on a GPU, because it re-reads the same data from DRAM over and over. Tiling fixes the data reuse and is the difference between memory-bound and compute-bound.
Computing for and matrices requires FLOPs. The naive approach assigns each output its own loop over , reading a full row of A and column of B from DRAM — so each element of A is re-read times and each element of B is re-read times. The kernel is buried in redundant global-memory traffic: arithmetic intensity is low and it runs nowhere near peak.
Tiling restructures the work: partition C into BM×BN output tiles, and compute each tile by looping over K in steps of BK. At each step you load a BM×BK block of A and a BK×BN block of B into SRAM once, then every output in the tile reuses those on-chip blocks. Each loaded value now serves an entire tile of outputs instead of one — collapsing DRAM reads by roughly the tile dimension and raising arithmetic intensity into the compute-bound regime.
Naive arithmetic intensity is FLOP/byte — bandwidth-bound. With BM×BN tiles and BK depth, each DRAM load of a block is reused -ish times, giving . Choosing tiles so (the ridge point) is exactly what turns matmul compute-bound.
For C = A·B with M=N=K=4096, how many times does the naive kernel re-read each element of A from DRAM, and how does a 128×128 output tile change that?