Back to GPU Basics with Triton Study Plan

Week 1

Chapter 1: Why GPUs? The Parallel Computing Mindset

Understand why GPUs exist, how their design philosophy differs from CPUs, the difference between latency and throughput, the SIMT execution model, when parallel hardware actually helps, and the hard ceiling Amdahl's law puts on speedup.

Chapter Overview

Before you write a single line of Triton, you need to think like the hardware. GPUs are not 'fast CPUs' — they are a fundamentally different kind of processor built around one idea: do an enormous number of simple operations at the same time.

A modern CPU has a handful of powerful cores, each loaded with branch predictors, deep caches, and out-of-order execution machinery to make a single thread finish as fast as possible. A GPU spends that same silicon budget on thousands of simpler cores. It doesn't try to make any one operation fast — it tries to keep all of those cores busy so that the total amount of work per second is staggering.

This chapter builds the mental model that everything else depends on: the distinction between latency (how long one task takes) and throughput (how much work finishes per second), the SIMT execution model where threads move in lockstep groups called warps, the kinds of problems where this hardware shines, and Amdahl's law, which tells you that the serial part of your program — not the parallel part — ultimately limits your speedup.

This chapter covers:

CPUs vs GPUs: two different answers to 'how do we make computers fast?'
Latency vs Throughput: the single most important trade-off in this course
The SIMT Execution Model: warps, lockstep, and divergence
When GPUs Win (and Lose): matching the problem to the hardware
Amdahl's Law: the ceiling on every speedup you will ever measure

Chapter Roadmap

Click any topic to jump in

CPUs vs GPUs

Two design philosophies: few powerful cores (latency) vs thousands of simple cores (throughput).

Two Design Philosophies

How throughput hardware pays off

Latency vs Throughput

GPUs hide latency with massive parallelism instead of reducing it — occupancy keeps the ALUs fed.

Hiding Latency with Parallelism

How threads actually execute

SIMT Execution Model

Threads run in lockstep warps of 32; data-dependent branches cause divergence and serialize work.

Warps and Lockstep

Which problems fit the model

When GPUs Win (and Lose)

Data-parallel, high-arithmetic-intensity work wins; serial, branchy, or tiny work belongs on the CPU.

Data Parallelism and Arithmetic Intensity

The ceiling on any speedup

Amdahl's Law

The serial fraction caps speedup at 1/(1-p) — optimization means shrinking the serial part.

The Speedup Ceiling

CPUs and GPUs are both processors, but they optimize for opposite goals. A CPU is a sprinter built to finish one race as fast as possible; a GPU is a freight system built to move as much cargo as possible. Neither is 'better' — they are tuned for different workloads.

1 of 1

Two Design Philosophies

A CPU dedicates most of its transistors to control and cache: branch predictors, out-of-order schedulers, and large multi-level caches that all exist to make a single instruction stream run as fast as possible. It has few cores (4–64), each extremely capable.

A GPU dedicates most of its transistors to arithmetic: thousands of small ALUs grouped into Streaming Multiprocessors. Control logic and caches are minimal and shared across many cores. Any individual GPU core is slower and dumber than a CPU core — but there are thousands of them.

The consequence: CPUs win when work is sequential, branchy, and latency-sensitive. GPUs win when the same operation must be applied to massive amounts of data.

Mathematical Intuition

If a chip has area $A$ , a CPU spends roughly $A_{\text{ctrl}} + A_{\text{cache}} \gg A_{\text{alu}}$ , while a GPU spends $A_{\text{alu}} \gg A_{\text{ctrl}} + A_{\text{cache}}$ . Peak arithmetic throughput scales with $A_{\text{alu}}$ , so for pure-arithmetic workloads the GPU's effective FLOP/s can be 10–100× the CPU's even at a lower clock frequency, because throughput $\approx N_{\text{cores}} \times f_{\text{clock}} \times \text{FLOP/cycle}$ and $N_{\text{cores}}$ dominates.

Example:

You need to add two arrays of 1,000,000 floats: c[i] = a[i] + b[i]. Why is a GPU dramatically faster here than a CPU?

Papers

NVIDIA CUDA C++ Programming Guide — Introduction

Blogs

Introduction to GPU Programming with Triton

GPU Architecture & the Memory Hierarchy

Chapter 1: Why GPUs? The Parallel Computing Mindset

Chapter Overview

Chapter Roadmap

CPUs vs GPUs

Latency vs Throughput

SIMT Execution Model

When GPUs Win (and Lose)

Amdahl's Law

CPUs vs GPUs

Two Design Philosophies

Latency vs Throughput

The SIMT Execution Model

When GPUs Win (and Lose)

Amdahl's Law & Speedup