Chapter 4: Your First Kernel: Vector Addition

Build the canonical Triton kernel end to end: the anatomy of a @triton.jit function, computing per-program offsets with program_id and arange, loading and storing tiles through pointers, masking the ragged tail, choosing BLOCK_SIZE, and launching the grid from PyTorch.

Chapter Overview

Vector addition — out = x + y — is the 'hello world' of GPU programming. It's trivial math, which is exactly the point: with the arithmetic out of the way, every line of the kernel is about the mechanics of GPU programming that you'll reuse in every kernel afterward.

By the end of this chapter you will understand each line of the canonical add_kernel: how tl.program_id tells a program which slice of the data it owns, how tl.arange builds a vector of offsets, how pointer arithmetic plus tl.load/tl.store move whole tiles between DRAM and on-chip memory, why the last program needs a mask, and how BLOCK_SIZE and triton.cdiv set up the launch grid.

This is the template. Fused softmax, layer norm, and matmul are all elaborations of the same skeleton: figure out which tile you own, load it, compute on it in fast memory, store it back. Master vector add and the rest of the course is variations on a theme.

This chapter covers:

Anatomy of a Triton Kernel: the reusable skeleton
Program IDs & Offsets: which data does this program own?
Loads, Stores & Masking: moving tiles safely
Choosing BLOCK_SIZE: the first performance knob
Launching the Grid: connecting PyTorch to the kernel

Chapter Roadmap

Click any topic to jump in

Anatomy of a Kernel

The reusable five-step skeleton: id → offsets → mask → load → compute+store.

The Five-Step Skeleton

Filling in the skeleton

Program IDs & Offsets

program_id × BLOCK_SIZE + arange gives the tile of indices this program owns.

From Program ID to a Tile of Indices

Moving the tiles

Loads, Stores & Masking

tl.load/tl.store move tiles; mask = offs < n keeps the ragged tail safe.

Moving Tiles Safely with Masks

Tuning the tile

Choosing BLOCK_SIZE

The first perf knob — trades per-program overhead against occupancy and register use.

The First Performance Knob

Calling it from PyTorch

Launching the Grid

Host side: allocate output, grid = cdiv(n, B), launch with [grid]; sync to time it.

Connecting Host Code to the Kernel

Almost every elementwise Triton kernel follows the same five-step skeleton. Learn it once and you can read — and write — most kernels you'll meet.

1 of 1

The Five-Step Skeleton

A typical Triton kernel does, in order: (1) get the program id; (2) compute the offsets of the tile this program owns; (3) build a mask for any out-of-bounds offsets; (4) tl.load the input tiles into on-chip memory; (5) compute, then tl.store the result back. The kernel body is pure tile algebra — no thread indices, no barriers.

The inputs are passed as pointers (PyTorch tensors decay to their data pointer when launched), plus runtime sizes and tl.constexpr tile dimensions. The kernel writes its results in place to an output pointer; Triton kernels return nothing. This skeleton is what makes Triton readable: you can scan any kernel and immediately find the offset computation, the loads, the math, and the store.

Mathematical Intuition

The skeleton separates addressing (steps 1–3, pure index arithmetic independent of the math) from data movement (4 and the store) from computation (5). This separation is why Triton can optimize addressing and movement automatically while you only ever touch the computation line — the part that's actually unique to your kernel.

Example:

Sketch the skeleton of a kernel that computes out = relu(x) for a 1D tensor. Which steps differ from vector add?

Introduction to Triton

Memory Access, Autotuning & Benchmarking

Chapter 4: Your First Kernel: Vector Addition

Chapter Overview

Chapter Roadmap

Anatomy of a Kernel

Program IDs & Offsets

Loads, Stores & Masking

Choosing BLOCK_SIZE

Launching the Grid

Anatomy of a Triton Kernel

The Five-Step Skeleton

Program IDs & Offsets

Loads, Stores & Masking

Choosing BLOCK_SIZE

Launching the Grid