Week 1

Chapter 1: Thinking in Threads: The CUDA Model with Numba

Write your first real CUDA kernels in Python with Numba. Understand the thread/block/grid launch hierarchy, how each thread finds its own global index, why every kernel needs a bounds check, and how grid-stride loops make a kernel work for any input size.

Chapter Overview

CUDA is the language GPUs speak, but you do not need C++ to start speaking it. Numba compiles a subset of Python decorated with @cuda.jit straight to GPU machine code (PTX), so you can write genuine CUDA kernels — with threadIdx, blockIdx, shared memory, and atomics — in plain Python.

The mental shift this chapter asks of you is the one that matters most: on a CPU you write a loop that visits each element one at a time; on a GPU you write the body of the loop once and launch thousands of threads that each run that body on a different element, all at the same time.

This chapter builds the foundation everything else rests on:

Why Numba: real CUDA kernels without leaving Python
Threads, Blocks & Grids: the launch hierarchy you configure on every kernel
The Global Index: how a thread figures out which element it owns
Bounds Checks & Grid-Stride Loops: making a kernel correct for any size, with any launch configuration

Chapter Roadmap

Click any topic to jump in

Why Numba for CUDA

@cuda.jit compiles a Python subset to PTX — real CUDA kernels without a C++ toolchain.

@cuda.jit: Python In, PTX Out

How a launch is shaped

Threads, Blocks & Grids

kernel[blocks, threads](...) configures the launch hierarchy; threadIdx/blockIdx/blockDim/gridDim locate a thread.

The Launch Hierarchy

How a thread finds its data

The Global Index

i = blockIdx.x*blockDim.x + threadIdx.x (or cuda.grid(1)) maps each thread to one element.

From Block Coordinates to a Global Index

Making it correct for any size

Bounds Checks & Grid-Stride Loops

if i < n guards the ragged tail; grid-stride loops decouple launch size from data size.

The Bounds CheckThe Grid-Stride Loop

CUDA C/C++ is the traditional way to program NVIDIA GPUs, but it requires a compiler toolchain and a context switch out of Python. Numba lets you stay in Python and still emit the same kind of GPU code — ideal for learning the model and for accelerating numerical Python.

1 of 1

@cuda.jit: Python In, PTX Out

Numba is a just-in-time compiler. When you decorate a function with @cuda.jit, Numba does not run it as Python — the first time you launch it, Numba compiles that function to PTX (NVIDIA's GPU assembly) and runs it on the device. Inside the kernel you get the real CUDA vocabulary: cuda.threadIdx.x, cuda.blockIdx.x, cuda.shared.array, cuda.syncthreads(), cuda.atomic.add.

The trade-offs versus CUDA C++ are worth knowing. You give up some low-level control and the absolute peak of hand-tuned C++, but you gain the ability to write, launch, and debug a kernel without leaving Python or installing nvcc. The programming model is identical, so everything you learn here transfers directly to CUDA C++.

Mathematical Intuition

A kernel is a function executed $N$ times in parallel, once per thread. If launching the kernel costs a fixed overhead $t_0$ and each thread does work $w$ , the wall-clock time is roughly $t_0 + \frac{N \cdot w}{P}$ where $P$ is the number of threads running concurrently. Numba's one-time compile adds to $t_0$ on the first call only; amortized over many launches it disappears.