Write your first real CUDA kernels in Python with Numba. Understand the thread/block/grid launch hierarchy, how each thread finds its own global index, why every kernel needs a bounds check, and how grid-stride loops make a kernel work for any input size.
CUDA is the language GPUs speak, but you do not need C++ to start speaking it. Numba compiles a subset of Python decorated with @cuda.jit straight to GPU machine code (PTX), so you can write genuine CUDA kernels — with threadIdx, blockIdx, shared memory, and atomics — in plain Python.
The mental shift this chapter asks of you is the one that matters most: on a CPU you write a loop that visits each element one at a time; on a GPU you write the body of the loop once and launch thousands of threads that each run that body on a different element, all at the same time.
This chapter builds the foundation everything else rests on:
Click any topic to jump in
@cuda.jit compiles a Python subset to PTX — real CUDA kernels without a C++ toolchain.
kernel[blocks, threads](...) configures the launch hierarchy; threadIdx/blockIdx/blockDim/gridDim locate a thread.
i = blockIdx.x*blockDim.x + threadIdx.x (or cuda.grid(1)) maps each thread to one element.
if i < n guards the ragged tail; grid-stride loops decouple launch size from data size.
CUDA C/C++ is the traditional way to program NVIDIA GPUs, but it requires a compiler toolchain and a context switch out of Python. Numba lets you stay in Python and still emit the same kind of GPU code — ideal for learning the model and for accelerating numerical Python.
Numba is a just-in-time compiler. When you decorate a function with @cuda.jit, Numba does not run it as Python — the first time you launch it, Numba compiles that function to PTX (NVIDIA's GPU assembly) and runs it on the device. Inside the kernel you get the real CUDA vocabulary: cuda.threadIdx.x, cuda.blockIdx.x, cuda.shared.array, cuda.syncthreads(), cuda.atomic.add.
The trade-offs versus CUDA C++ are worth knowing. You give up some low-level control and the absolute peak of hand-tuned C++, but you gain the ability to write, launch, and debug a kernel without leaving Python or installing nvcc. The programming model is identical, so everything you learn here transfers directly to CUDA C++.
A kernel is a function executed times in parallel, once per thread. If launching the kernel costs a fixed overhead and each thread does work , the wall-clock time is roughly where is the number of threads running concurrently. Numba's one-time compile adds to on the first call only; amortized over many launches it disappears.
What actually happens the first time you call a @cuda.jit kernel, and why is the second call faster?