Chapter 3: Introduction to Triton

Meet Triton: what it is and why it exists, how it compares to writing raw CUDA or staying in PyTorch, its block-oriented programming model, the vocabulary of programs / grids / tiles, and the @triton.jit decorator that turns a Python function into a GPU kernel.

Chapter Overview

You now understand the hardware. The question is how to program it without drowning in low-level detail. Triton is the answer this course is built around: a Python-embedded language and compiler, originally from OpenAI, that lets you write GPU kernels that rival hand-tuned CUDA — in a fraction of the code.

The key idea is that Triton raises the unit of programming from the thread to the block. In CUDA you write code from the viewpoint of a single thread and manually manage indexing, shared memory, and synchronization. In Triton you write code from the viewpoint of a program that operates on whole tiles of data as vectors and matrices. The compiler handles the painful parts — thread mapping, memory coalescing, shared-memory allocation, and synchronization — while leaving you in control of the decisions that actually matter for performance: how big the tiles are and how data moves between DRAM and SRAM.

This chapter is the conceptual bridge between the hardware chapters and the hands-on kernels that follow. By the end you'll be able to read a Triton kernel and understand exactly what each line does and where it runs.

This chapter covers:

What Is Triton?: a Python DSL + compiler for GPU kernels
Triton vs CUDA vs PyTorch: the right tool for each job
Block-Oriented Programming: thinking in tiles, not threads
Programs, Grids & Tiles: the core vocabulary
The @triton.jit Decorator: how a Python function becomes a kernel

Chapter Roadmap

Click any topic to jump in

What Is Triton?

A Python DSL + compiler that emits fast GPU kernels from tile-level code.

A Python DSL with a Real Compiler

Where Triton fits

Triton vs CUDA vs PyTorch

Control where it matters, automation where it's painful — between raw CUDA and stock PyTorch ops.

The Flexibility–Performance Spectrum

The model it gives you

Block-Oriented Programming

Reason about tiles of data as vectors/matrices; the compiler maps them onto threads.

Tiles, Not Threads

The vocabulary you launch with

Programs, Grids & Tiles

A program owns a tile; the grid = ceil(work / tile) covers all the data.

The Core Vocabulary

How it becomes a kernel

The @triton.jit Decorator

Marks a kernel, enables [grid] launch, and specializes on tl.constexpr constants.

From Python Function to GPU Kernel

Triton is a domain-specific language embedded in Python, plus a compiler that lowers it to fast GPU machine code. You write what looks like NumPy-on-tiles; Triton produces a kernel competitive with expert CUDA.

1 of 1

A Python DSL with a Real Compiler

Triton lets you write GPU kernels as ordinary-looking Python functions decorated with @triton.jit. Inside, you use Triton's tl (triton.language) operations — tl.load, tl.store, tl.arange, tl.dot, tl.sum, etc. — that operate on whole tiles (blocks of elements) rather than scalars.

When the kernel is first called, Triton's compiler lowers this Python into an intermediate representation, applies GPU-specific optimizations (vectorization, memory coalescing, shared-memory staging, instruction scheduling), and emits machine code (PTX for NVIDIA). The result runs on the GPU at near hand-tuned speed. Crucially, you stayed in Python and never wrote a thread index, a __syncthreads, or a shared-memory declaration by hand.

Mathematical Intuition

Think of Triton as raising the abstraction level without sacrificing the performance model. A CUDA kernel exposes $O(\text{threads})$ explicit indexing decisions; a Triton kernel exposes $O(\text{tiles})$ decisions and lets the compiler expand each tile op into the right thread-level SIMT code. You keep the few high-leverage knobs (tile size, memory movement) and shed the many low-leverage ones.

Example:

Why can a researcher prototype a custom fused kernel in Triton in an afternoon, when the equivalent CUDA kernel might take days?

Papers

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations (Tillet et al., 2019)

Blogs

Triton Official Tutorials

GPU Architecture & the Memory Hierarchy

Your First Kernel: Vector Addition