Chapter 2: GPU Architecture & the Memory Hierarchy

Map the physical GPU: how threads group into warps and blocks, how Streaming Multiprocessors schedule them, the register→shared→L2→global memory hierarchy, why coalesced memory access is critical, and how arithmetic intensity and the roofline decide whether you are compute- or memory-bound.

Chapter Overview

Chapter 1 gave you the mindset; this chapter gives you the map. To write fast kernels you must know where your data lives and how the hardware moves it.

A GPU is organized as a grid of Streaming Multiprocessors (SMs), each containing many ALUs, a register file, an on-chip scratchpad (shared memory / L1), and warp schedulers. Your kernel launches a grid of blocks; the GPU assigns blocks to SMs, and each SM runs the block's threads as warps. Threads in a block can cooperate through fast shared memory; threads in different blocks essentially cannot.

The single most important structure here is the memory hierarchy. Registers are nearly free but tiny and private. Shared memory (SRAM) is fast and block-local — and in Triton it's the resource you orchestrate every time you 'tile' a computation. L2 is a device-wide cache. Global memory (DRAM) is huge but slow, and its bandwidth is the bottleneck for most kernels. Almost every optimization in this course reduces to: read from DRAM as few times as possible, do as much work as you can while the data is on-chip.

This chapter covers:

Threads, Warps & Blocks: the execution hierarchy
Streaming Multiprocessors: where blocks actually run
The Memory Hierarchy: registers → SRAM → L2 → DRAM
Memory Coalescing: turning many small reads into one fast transaction
Arithmetic Intensity & the Roofline: are you compute- or memory-bound?

Chapter Roadmap

Click any topic to jump in

Threads, Warps & Blocks

The execution hierarchy: warps of 32 run in lockstep; blocks cooperate via shared memory; blocks are independent.

The Execution Hierarchy

Where the hierarchy lives

Streaming Multiprocessors

Where blocks run — registers, schedulers, ALUs, and SRAM. Resource limits set occupancy.

Inside an SM

The levels you optimize across

The Memory Hierarchy

Registers → SRAM → L2 → DRAM. Keep hot data high and reuse it before it falls back down.

Registers, SRAM, L2, and DRAM

How you read DRAM efficiently

Memory Coalescing

Contiguous warp reads become one transaction; strided reads waste most of the bandwidth.

Coalesced vs Strided Access

Diagnosing the bottleneck

Arithmetic Intensity & Roofline

FLOPs/byte decides whether you're memory- or compute-bound — and therefore what to optimize.

FLOPs per Byte

The GPU execution model is a three-level hierarchy: threads are grouped into warps, warps into blocks, and blocks into a grid. Each level has different cooperation and scheduling rules, and getting them right is the difference between a fast kernel and a broken one.

1 of 1

The Execution Hierarchy

Threads are the smallest unit; each has its own registers and program counter. Warps are fixed groups of 32 threads that execute in lockstep (the SIMT unit from Chapter 1). Blocks (a.k.a. thread blocks or CTAs) are groups of warps that run together on a single SM and can cooperate via shared memory and barriers (__syncthreads). The grid is all the blocks for one kernel launch.

Key rules: threads in the same block can share data and synchronize; threads in different blocks cannot (blocks may run in any order, even on different SMs, possibly not at the same time). This independence is what lets the GPU scale — it can schedule blocks however many SMs it has.

In Triton you mostly think one level up: a program corresponds to a block-sized tile of work, and Triton manages the threads within it for you.

Mathematical Intuition

An SM has hard limits: max resident threads, warps, blocks, registers, and shared memory. Occupancy = (resident warps) / (max warps). If each block uses $R$ registers/thread and $S$ bytes of shared memory, the SM can host $\min\!\left(\frac{\text{regs}}{R \cdot \text{threads}}, \frac{\text{smem}}{S}, \text{block limit}\right)$ blocks. Choosing block size and resource usage is really choosing occupancy.

Example:

You launch a kernel with a grid of 1024 blocks, each with 256 threads. How many threads and warps is that, and what can block #0's threads do that block #5's cannot?

Why GPUs? The Parallel Computing Mindset

Introduction to Triton

Chapter 2: GPU Architecture & the Memory Hierarchy

Chapter Overview

Chapter Roadmap

Threads, Warps & Blocks

Streaming Multiprocessors

The Memory Hierarchy

Memory Coalescing

Arithmetic Intensity & Roofline

Threads, Warps & Blocks

The Execution Hierarchy

Streaming Multiprocessors

The Memory Hierarchy

Memory Coalescing

Arithmetic Intensity & the Roofline