Write Real CUDA Kernels in Python

CUDA with Numba Study Plan

A focused 3-chapter path into GPU programming. Write genuine CUDA kernels — threads, shared memory, reductions, atomics, tiled matmul — in plain Python with Numba, and run them on a real NVIDIA GPU.

3 Chapters3 WeeksWorked Examples & Concept Maps

Recommended Study Path

Phase 1

Prerequisites

Foundations Study Plan

Python Foundations
NumPy & Arrays
Basic Linear Algebra

Complete the Foundations study plan first →

Phase 2

The CUDA Model

Week 1

Ch 1: Threads, Blocks & Grids

Global index, bounds checks, grid-stride loops

Phase 3

GPU Memory

Week 2

Ch 2: Memory & 2D Grids

to_device, shared memory, __syncthreads()

Phase 4

Parallel Patterns

Week 3

Ch 3: Reductions, Atomics & Tiling

Tree reductions, atomics, tiled matmul

Tip: Each chapter includes worked examples, math intuition, and a concept map. Practice the kernels in the CUDA Basics with Numba collection.

Pro chapters (2-3) require a premium subscription.

All Chapters

Thinking in Threads: The CUDA Model with Numba

Write real CUDA kernels in Python with @cuda.jit. The thread/block/grid launch hierarchy, computing a thread's global index, bounds checks for the ragged tail, and grid-stride loops.

Why Numba for CUDAThreads, Blocks & GridsThe Global Thread Index+1

Start Learning

PRO

Moving Data: GPU Memory with Numba

The GPU's separate memory and how to use it: host↔device transfers with to_device/copy_to_host, device arrays, 2D grids for matrices, and fast per-block shared memory with __syncthreads().

Host and Device Memoryto_device & copy_to_host2D Grids for Matrices+1

Pro Only

PRO

Parallel Patterns: Reductions, Atomics & Tiling

The patterns behind real kernels: combining values without races, shared-memory tree reductions, atomic updates and contention, and tiled matrix multiplication that reuses data from shared memory.

The Reduction ProblemTree Reduction in Shared MemoryAtomics+1

Pro Only

Practice Problem Sets

Sharpen your skills with coding challenges and system design problems.

A short, focused path from the GPU execution model to shared-memory reductions and tiled matrix multiplication — all in Python with Numba.