A focused 3-chapter path into GPU programming. Write genuine CUDA kernels — threads, shared memory, reductions, atomics, tiled matmul — in plain Python with Numba, and run them on a real NVIDIA GPU.
Foundations Study Plan
Complete the Foundations study plan first →
Week 1
Global index, bounds checks, grid-stride loops
Week 2
to_device, shared memory, __syncthreads()
Week 3
Tree reductions, atomics, tiled matmul
Write real CUDA kernels in Python with @cuda.jit. The thread/block/grid launch hierarchy, computing a thread's global index, bounds checks for the ragged tail, and grid-stride loops.
The GPU's separate memory and how to use it: host↔device transfers with to_device/copy_to_host, device arrays, 2D grids for matrices, and fast per-block shared memory with __syncthreads().
The patterns behind real kernels: combining values without races, shared-memory tree reductions, atomic updates and contention, and tiled matrix multiplication that reuses data from shared memory.
Sharpen your skills with coding challenges and system design problems.
A short, focused path from the GPU execution model to shared-memory reductions and tiled matrix multiplication — all in Python with Numba.