Meet Triton: what it is and why it exists, how it compares to writing raw CUDA or staying in PyTorch, its block-oriented programming model, the vocabulary of programs / grids / tiles, and the @triton.jit decorator that turns a Python function into a GPU kernel.
You now understand the hardware. The question is how to program it without drowning in low-level detail. Triton is the answer this course is built around: a Python-embedded language and compiler, originally from OpenAI, that lets you write GPU kernels that rival hand-tuned CUDA — in a fraction of the code.
The key idea is that Triton raises the unit of programming from the thread to the block. In CUDA you write code from the viewpoint of a single thread and manually manage indexing, shared memory, and synchronization. In Triton you write code from the viewpoint of a program that operates on whole tiles of data as vectors and matrices. The compiler handles the painful parts — thread mapping, memory coalescing, shared-memory allocation, and synchronization — while leaving you in control of the decisions that actually matter for performance: how big the tiles are and how data moves between DRAM and SRAM.
This chapter is the conceptual bridge between the hardware chapters and the hands-on kernels that follow. By the end you'll be able to read a Triton kernel and understand exactly what each line does and where it runs.
This chapter covers:
Click any topic to jump in
A Python DSL + compiler that emits fast GPU kernels from tile-level code.
Control where it matters, automation where it's painful — between raw CUDA and stock PyTorch ops.
Reason about tiles of data as vectors/matrices; the compiler maps them onto threads.
A program owns a tile; the grid = ceil(work / tile) covers all the data.
Marks a kernel, enables [grid] launch, and specializes on tl.constexpr constants.
Triton is a domain-specific language embedded in Python, plus a compiler that lowers it to fast GPU machine code. You write what looks like NumPy-on-tiles; Triton produces a kernel competitive with expert CUDA.
Triton lets you write GPU kernels as ordinary-looking Python functions decorated with @triton.jit. Inside, you use Triton's tl (triton.language) operations — tl.load, tl.store, tl.arange, tl.dot, tl.sum, etc. — that operate on whole tiles (blocks of elements) rather than scalars.
When the kernel is first called, Triton's compiler lowers this Python into an intermediate representation, applies GPU-specific optimizations (vectorization, memory coalescing, shared-memory staging, instruction scheduling), and emits machine code (PTX for NVIDIA). The result runs on the GPU at near hand-tuned speed. Crucially, you stayed in Python and never wrote a thread index, a __syncthreads, or a shared-memory declaration by hand.
Think of Triton as raising the abstraction level without sacrificing the performance model. A CUDA kernel exposes explicit indexing decisions; a Triton kernel exposes decisions and lets the compiler expand each tile op into the right thread-level SIMT code. You keep the few high-leverage knobs (tile size, memory movement) and shed the many low-leverage ones.
Why can a researcher prototype a custom fused kernel in Triton in an afternoon, when the equivalent CUDA kernel might take days?