Understand why GPUs exist, how their design philosophy differs from CPUs, the difference between latency and throughput, the SIMT execution model, when parallel hardware actually helps, and the hard ceiling Amdahl's law puts on speedup.
Before you write a single line of Triton, you need to think like the hardware. GPUs are not 'fast CPUs' — they are a fundamentally different kind of processor built around one idea: do an enormous number of simple operations at the same time.
A modern CPU has a handful of powerful cores, each loaded with branch predictors, deep caches, and out-of-order execution machinery to make a single thread finish as fast as possible. A GPU spends that same silicon budget on thousands of simpler cores. It doesn't try to make any one operation fast — it tries to keep all of those cores busy so that the total amount of work per second is staggering.
This chapter builds the mental model that everything else depends on: the distinction between latency (how long one task takes) and throughput (how much work finishes per second), the SIMT execution model where threads move in lockstep groups called warps, the kinds of problems where this hardware shines, and Amdahl's law, which tells you that the serial part of your program — not the parallel part — ultimately limits your speedup.
This chapter covers:
Click any topic to jump in
Two design philosophies: few powerful cores (latency) vs thousands of simple cores (throughput).
GPUs hide latency with massive parallelism instead of reducing it — occupancy keeps the ALUs fed.
Threads run in lockstep warps of 32; data-dependent branches cause divergence and serialize work.
Data-parallel, high-arithmetic-intensity work wins; serial, branchy, or tiny work belongs on the CPU.
The serial fraction caps speedup at 1/(1-p) — optimization means shrinking the serial part.
CPUs and GPUs are both processors, but they optimize for opposite goals. A CPU is a sprinter built to finish one race as fast as possible; a GPU is a freight system built to move as much cargo as possible. Neither is 'better' — they are tuned for different workloads.
A CPU dedicates most of its transistors to control and cache: branch predictors, out-of-order schedulers, and large multi-level caches that all exist to make a single instruction stream run as fast as possible. It has few cores (4–64), each extremely capable.
A GPU dedicates most of its transistors to arithmetic: thousands of small ALUs grouped into Streaming Multiprocessors. Control logic and caches are minimal and shared across many cores. Any individual GPU core is slower and dumber than a CPU core — but there are thousands of them.
The consequence: CPUs win when work is sequential, branchy, and latency-sensitive. GPUs win when the same operation must be applied to massive amounts of data.
If a chip has area , a CPU spends roughly , while a GPU spends . Peak arithmetic throughput scales with , so for pure-arithmetic workloads the GPU's effective FLOP/s can be 10–100× the CPU's even at a lower clock frequency, because throughput and dominates.
You need to add two arrays of 1,000,000 floats: c[i] = a[i] + b[i]. Why is a GPU dramatically faster here than a CPU?