Map the physical GPU: how threads group into warps and blocks, how Streaming Multiprocessors schedule them, the register→shared→L2→global memory hierarchy, why coalesced memory access is critical, and how arithmetic intensity and the roofline decide whether you are compute- or memory-bound.
Chapter 1 gave you the mindset; this chapter gives you the map. To write fast kernels you must know where your data lives and how the hardware moves it.
A GPU is organized as a grid of Streaming Multiprocessors (SMs), each containing many ALUs, a register file, an on-chip scratchpad (shared memory / L1), and warp schedulers. Your kernel launches a grid of blocks; the GPU assigns blocks to SMs, and each SM runs the block's threads as warps. Threads in a block can cooperate through fast shared memory; threads in different blocks essentially cannot.
The single most important structure here is the memory hierarchy. Registers are nearly free but tiny and private. Shared memory (SRAM) is fast and block-local — and in Triton it's the resource you orchestrate every time you 'tile' a computation. L2 is a device-wide cache. Global memory (DRAM) is huge but slow, and its bandwidth is the bottleneck for most kernels. Almost every optimization in this course reduces to: read from DRAM as few times as possible, do as much work as you can while the data is on-chip.
This chapter covers:
Click any topic to jump in
The execution hierarchy: warps of 32 run in lockstep; blocks cooperate via shared memory; blocks are independent.
Where blocks run — registers, schedulers, ALUs, and SRAM. Resource limits set occupancy.
Registers → SRAM → L2 → DRAM. Keep hot data high and reuse it before it falls back down.
Contiguous warp reads become one transaction; strided reads waste most of the bandwidth.
FLOPs/byte decides whether you're memory- or compute-bound — and therefore what to optimize.
The GPU execution model is a three-level hierarchy: threads are grouped into warps, warps into blocks, and blocks into a grid. Each level has different cooperation and scheduling rules, and getting them right is the difference between a fast kernel and a broken one.
Threads are the smallest unit; each has its own registers and program counter. Warps are fixed groups of 32 threads that execute in lockstep (the SIMT unit from Chapter 1). Blocks (a.k.a. thread blocks or CTAs) are groups of warps that run together on a single SM and can cooperate via shared memory and barriers (__syncthreads). The grid is all the blocks for one kernel launch.
Key rules: threads in the same block can share data and synchronize; threads in different blocks cannot (blocks may run in any order, even on different SMs, possibly not at the same time). This independence is what lets the GPU scale — it can schedule blocks however many SMs it has.
In Triton you mostly think one level up: a program corresponds to a block-sized tile of work, and Triton manages the threads within it for you.
An SM has hard limits: max resident threads, warps, blocks, registers, and shared memory. Occupancy = (resident warps) / (max warps). If each block uses registers/thread and bytes of shared memory, the SM can host blocks. Choosing block size and resource usage is really choosing occupancy.
You launch a kernel with a grid of 1024 blocks, each with 256 threads. How many threads and warps is that, and what can block #0's threads do that block #5's cannot?