Make memory-bound kernels fast: write coalesced access patterns, let @triton.autotune search block sizes and num_warps for you, benchmark properly with triton.testing, read the roofline to know your ceiling, and diagnose why a kernel is slow.
Your vector-add kernel is correct — but is it fast? Most simple kernels are memory-bound, so their speed is decided almost entirely by how efficiently they move data, not by the arithmetic. This chapter is about turning a working kernel into a fast one and, just as importantly, measuring whether you actually succeeded.
Three skills come together here. First, coalesced memory access: structuring offsets so a warp's reads hit contiguous addresses and saturate DRAM bandwidth. Second, autotuning: instead of guessing BLOCK_SIZE and num_warps, you hand Triton a list of candidate configurations and @triton.autotune benchmarks them on real hardware and caches the winner per input shape. Third, benchmarking and the roofline: using triton.testing.do_bench to measure honestly (with warmup and synchronization), converting that to effective bandwidth or FLOP/s, and placing the result on the roofline to see how much performance is left.
The theme is intellectual honesty: a kernel is only as fast as your benchmark proves it is, and you only know it's 'good enough' when it's near the roofline ceiling for its arithmetic intensity.
This chapter covers:
Click any topic to jump in
Build offsets along the contiguous axis so a warp's reads become one wide transaction.
@triton.autotune benchmarks candidate configs per shape and caches the fastest.
do_bench warms up, repeats, and syncs — measure bandwidth/FLOP/s, not enqueue time.
Plot I vs performance to see your ceiling and which knob (memory vs compute) to pull.
Top-down: confirm the bound, then attack only the binding constraint.
For a memory-bound kernel, the access pattern is the performance. Triton makes coalescing easy — build offsets along the contiguous dimension — but it's just as easy to accidentally stride and lose most of your bandwidth.
When you write offs = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE) and load x_ptr + offs, consecutive lanes read consecutive addresses — perfectly coalesced. The hardware fuses the warp's reads into the minimum number of wide transactions, hitting near-peak bandwidth (recall Chapter 2).
The danger appears in multi-dimensional kernels. For a row-major 2D tensor, indexing along the last (contiguous) axis stays coalesced; indexing along the first axis multiplies offsets by the row stride, scattering them. The fix is to build your offset expression so that the fastest-varying tl.arange runs along the contiguous dimension — e.g. row[:, None] * stride + col[None, :] with col contiguous. When the natural access is strided (like a transpose), load a tile coalesced into SRAM and rearrange it on-chip, where there's no coalescing penalty.
Effective bandwidth . Coalesced access makes the ratio . A stride that lands one used 4-byte element per 128-byte transaction makes it . Since a memory-bound kernel's runtime is (bytes fetched)/(bandwidth), uncoalesced access multiplies runtime by the inverse of that ratio.
Two kernels normalize a row-major matrix. Kernel A assigns each program a row and reads it with arange along columns; Kernel B assigns each program a column and reads down it with arange × n_cols. Which is coalesced and why does it matter for a memory-bound op?