Build the canonical Triton kernel end to end: the anatomy of a @triton.jit function, computing per-program offsets with program_id and arange, loading and storing tiles through pointers, masking the ragged tail, choosing BLOCK_SIZE, and launching the grid from PyTorch.
Vector addition — out = x + y — is the 'hello world' of GPU programming. It's trivial math, which is exactly the point: with the arithmetic out of the way, every line of the kernel is about the mechanics of GPU programming that you'll reuse in every kernel afterward.
By the end of this chapter you will understand each line of the canonical add_kernel: how tl.program_id tells a program which slice of the data it owns, how tl.arange builds a vector of offsets, how pointer arithmetic plus tl.load/tl.store move whole tiles between DRAM and on-chip memory, why the last program needs a mask, and how BLOCK_SIZE and triton.cdiv set up the launch grid.
This is the template. Fused softmax, layer norm, and matmul are all elaborations of the same skeleton: figure out which tile you own, load it, compute on it in fast memory, store it back. Master vector add and the rest of the course is variations on a theme.
This chapter covers:
Click any topic to jump in
The reusable five-step skeleton: id → offsets → mask → load → compute+store.
program_id × BLOCK_SIZE + arange gives the tile of indices this program owns.
tl.load/tl.store move tiles; mask = offs < n keeps the ragged tail safe.
The first perf knob — trades per-program overhead against occupancy and register use.
Host side: allocate output, grid = cdiv(n, B), launch with [grid]; sync to time it.
Almost every elementwise Triton kernel follows the same five-step skeleton. Learn it once and you can read — and write — most kernels you'll meet.
A typical Triton kernel does, in order: (1) get the program id; (2) compute the offsets of the tile this program owns; (3) build a mask for any out-of-bounds offsets; (4) tl.load the input tiles into on-chip memory; (5) compute, then tl.store the result back. The kernel body is pure tile algebra — no thread indices, no barriers.
The inputs are passed as pointers (PyTorch tensors decay to their data pointer when launched), plus runtime sizes and tl.constexpr tile dimensions. The kernel writes its results in place to an output pointer; Triton kernels return nothing. This skeleton is what makes Triton readable: you can scan any kernel and immediately find the offset computation, the loads, the math, and the store.
The skeleton separates addressing (steps 1–3, pure index arithmetic independent of the math) from data movement (4 and the store) from computation (5). This separation is why Triton can optimize addressing and movement automatically while you only ever touch the computation line — the part that's actually unique to your kernel.
Sketch the skeleton of a kernel that computes out = relu(x) for a 1D tensor. Which steps differ from vector add?