Chapter 7: Matrix Multiplication

Write the kernel that powers deep learning. Move from a memory-drowning naive matmul to a tiled kernel that stages blocks in SRAM, accumulates across the inner-K loop, reorders programs for L2 cache reuse (super-grouping), and approaches cuBLAS-level performance.

Chapter Overview

Matrix multiplication is the beating heart of deep learning — every linear layer, attention projection, and convolution-as-matmul runs through it. It's also the canonical compute-bound kernel, the opposite of softmax: with the right tiling it reuses each loaded value many times, pushing arithmetic intensity high enough to saturate the GPU's ALUs and Tensor Cores.

The journey of this chapter is the journey of GPU optimization in miniature. A naive matmul re-reads A and B from DRAM for every output element, drowning in memory traffic. A tiled matmul loads a block of A and a block of B into SRAM once and reuses them for a whole tile of outputs, accumulating the result across an inner-K loop. That single change moves the kernel from memory-bound to compute-bound. Then super-grouping — reordering which output tiles neighboring programs compute — keeps the reused A/B tiles hot in L2 cache, squeezing out the last big factor and bringing a hand-written Triton matmul within striking distance of cuBLAS.

This is the chapter where everything connects: the memory hierarchy (Ch 2), the tile/program model (Ch 3–4), and the roofline (Ch 5) all converge in one kernel.

This chapter covers:

Naive vs Tiled Matmul: why the obvious version is slow
Blocking & SRAM Accumulation: load once, reuse many times
The Inner-K Loop: accumulating partial products
L2 Cache Super-Grouping: program order for cache reuse
Approaching cuBLAS: how close can you get?

Chapter Roadmap

Click any topic to jump in

Naive vs Tiled Matmul

Naive re-reads A and B from DRAM per output; tiling reuses on-chip blocks and raises intensity.

Reuse Is Everything

How tiling is implemented

Blocking & SRAM Accumulation

Each program owns a BMxBN tile; a float32 accumulator stays on-chip across the K loop.

Keep the Accumulator On-Chip

The loop that accumulates

The Inner-K Loop

Stream BK-blocks, acc += tl.dot(a, b); num_stages pipelines loads behind compute.

Accumulating Partial Products

Squeezing the cache

L2 Cache Super-Grouping

Reorder programs so shared A/B blocks stay hot in L2 — fewer DRAM re-reads, no extra FLOPs.

Program Order for Cache Reuse

Reaching peak performance

Approaching cuBLAS

Tiling + mixed precision + pipelining + grouping + autotune → within a few % of cuBLAS.

Putting It All Together

The textbook three-nested-loop matmul is correct but catastrophically slow on a GPU, because it re-reads the same data from DRAM over and over. Tiling fixes the data reuse and is the difference between memory-bound and compute-bound.

1 of 1

Reuse Is Everything

$C_{ij} = \sum_{k=0}^{K-1} A_{ik} B_{kj}$

Computing $C = AB$ for $M\times K$ and $K\times N$ matrices requires $2MNK$ FLOPs. The naive approach assigns each output $C_{ij}$ its own loop over $k$ , reading a full row of A and column of B from DRAM — so each element of A is re-read $N$ times and each element of B is re-read $M$ times. The kernel is buried in redundant global-memory traffic: arithmetic intensity is low and it runs nowhere near peak.

Tiling restructures the work: partition C into BM×BN output tiles, and compute each tile by looping over K in steps of BK. At each step you load a BM×BK block of A and a BK×BN block of B into SRAM once, then every output in the tile reuses those on-chip blocks. Each loaded value now serves an entire tile of outputs instead of one — collapsing DRAM reads by roughly the tile dimension and raising arithmetic intensity into the compute-bound regime.

Mathematical Intuition

Naive arithmetic intensity is $I \approx \frac{2MNK}{(MK + KN)\cdot N \text{-ish reuse}} = O(1)$ FLOP/byte — bandwidth-bound. With BM×BN tiles and BK depth, each DRAM load of a block is reused $\min(BM, BN)$ -ish times, giving $I = O(\text{tile dim})$ . Choosing tiles so $I > I^*$ (the ridge point) is exactly what turns matmul compute-bound.

Example:

For C = A·B with M=N=K=4096, how many times does the naive kernel re-read each element of A from DRAM, and how does a 128×128 output tile change that?

Kernel Fusion: Fused Softmax

Advanced Kernels & Optimization