Chapter 5: Memory Access, Autotuning & Benchmarking

Make memory-bound kernels fast: write coalesced access patterns, let @triton.autotune search block sizes and num_warps for you, benchmark properly with triton.testing, read the roofline to know your ceiling, and diagnose why a kernel is slow.

Chapter Overview

Your vector-add kernel is correct — but is it fast? Most simple kernels are memory-bound, so their speed is decided almost entirely by how efficiently they move data, not by the arithmetic. This chapter is about turning a working kernel into a fast one and, just as importantly, measuring whether you actually succeeded.

Three skills come together here. First, coalesced memory access: structuring offsets so a warp's reads hit contiguous addresses and saturate DRAM bandwidth. Second, autotuning: instead of guessing BLOCK_SIZE and num_warps, you hand Triton a list of candidate configurations and @triton.autotune benchmarks them on real hardware and caches the winner per input shape. Third, benchmarking and the roofline: using triton.testing.do_bench to measure honestly (with warmup and synchronization), converting that to effective bandwidth or FLOP/s, and placing the result on the roofline to see how much performance is left.

The theme is intellectual honesty: a kernel is only as fast as your benchmark proves it is, and you only know it's 'good enough' when it's near the roofline ceiling for its arithmetic intensity.

This chapter covers:

Coalesced Memory Access: hitting peak bandwidth
Autotuning Kernels: let the compiler pick the config
Benchmarking with triton.testing: measuring without fooling yourself
Reading the Roofline: how close to the ceiling are you?
Debugging Performance: finding the real bottleneck

Chapter Roadmap

Click any topic to jump in

Coalesced Memory Access

Build offsets along the contiguous axis so a warp's reads become one wide transaction.

Contiguous Offsets by Construction

Searching the config space

Autotuning Kernels

@triton.autotune benchmarks candidate configs per shape and caches the fastest.

@triton.autotune and Configs

Proving it actually got faster

Benchmarking

do_bench warms up, repeats, and syncs — measure bandwidth/FLOP/s, not enqueue time.

Measuring Honestly

Knowing the ceiling

Reading the Roofline

Plot I vs performance to see your ceiling and which knob (memory vs compute) to pull.

Where Does Your Kernel Sit?

Fixing the real bottleneck

Debugging Performance

Top-down: confirm the bound, then attack only the binding constraint.

A Systematic Checklist

For a memory-bound kernel, the access pattern is the performance. Triton makes coalescing easy — build offsets along the contiguous dimension — but it's just as easy to accidentally stride and lose most of your bandwidth.

1 of 1

Contiguous Offsets by Construction

When you write offs = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE) and load x_ptr + offs, consecutive lanes read consecutive addresses — perfectly coalesced. The hardware fuses the warp's reads into the minimum number of wide transactions, hitting near-peak bandwidth (recall Chapter 2).

The danger appears in multi-dimensional kernels. For a row-major 2D tensor, indexing along the last (contiguous) axis stays coalesced; indexing along the first axis multiplies offsets by the row stride, scattering them. The fix is to build your offset expression so that the fastest-varying tl.arange runs along the contiguous dimension — e.g. row[:, None] * stride + col[None, :] with col contiguous. When the natural access is strided (like a transpose), load a tile coalesced into SRAM and rearrange it on-chip, where there's no coalescing penalty.

Mathematical Intuition

Effective bandwidth $= \text{peak} \times \frac{\text{useful bytes}}{\text{fetched bytes}}$ . Coalesced access makes the ratio $\approx 1$ . A stride that lands one used 4-byte element per 128-byte transaction makes it $\frac{4}{128} = 3\%$ . Since a memory-bound kernel's runtime is (bytes fetched)/(bandwidth), uncoalesced access multiplies runtime by the inverse of that ratio.

Example:

Two kernels normalize a row-major matrix. Kernel A assigns each program a row and reads it with arange along columns; Kernel B assigns each program a column and reads down it with arange × n_cols. Which is coalesced and why does it matter for a memory-bound op?

Your First Kernel: Vector Addition

Kernel Fusion: Fused Softmax

Chapter 5: Memory Access, Autotuning & Benchmarking

Chapter Overview

Chapter Roadmap

Coalesced Memory Access

Autotuning Kernels

Benchmarking

Reading the Roofline

Debugging Performance

Coalesced Memory Access

Contiguous Offsets by Construction

Autotuning Kernels

Benchmarking with triton.testing

Reading the Roofline

Debugging Performance