An 8-chapter curriculum that takes you from "what is a warp?" to writing fused, tiled GPU kernels in Triton that rival hand-tuned CUDA — all in Python.
Foundations Study Plan
Complete the Foundations study plan first →
Weeks 1-2
Warps, SMs, the memory hierarchy
Weeks 2-5
program_id, offsets, masking, BLOCK_SIZE
Weeks 5-8
Fused softmax, tiled matmul, attention
Why GPUs exist, CPU vs GPU design philosophy, latency vs throughput, the SIMT execution model, when parallel hardware wins, and Amdahl's law.
Threads, warps, blocks and streaming multiprocessors, the register→SRAM→DRAM hierarchy, memory coalescing, bandwidth vs latency, and the golden rule of keeping data high in the hierarchy.
What Triton is and why it exists, Triton vs CUDA vs PyTorch, the block-oriented programming model, programs, grids and tiles, and the @triton.jit decorator.
Building the canonical Triton kernel step by step: program ids, computing offsets, loading and storing with pointers, masking the ragged tail, choosing BLOCK_SIZE, and launching the grid.
Writing memory-bound kernels that hit peak bandwidth: coalesced access patterns, autotuning over block sizes and num_warps, benchmarking with triton.testing, and reading a roofline.
Why fusion is the central optimization on GPUs: counting DRAM round-trips, the row-wise reduction pattern, numerically stable softmax, keeping a row in SRAM, and the fused-vs-unfused speedup.
The kernel that powers deep learning: blocked/tiled matmul, accumulating in SRAM, the inner-k loop, super-grouping program ids for L2 cache reuse, and approaching cuBLAS performance.
Putting it together: fused layer normalization, low-memory dropout with seeded RNG, the ideas behind fused attention (FlashAttention), persistent kernels, and where to go next.
Curriculum designed to take you from GPU fundamentals to writing high-performance Triton kernels that approach cuBLAS-level speed.