The GPU has its own memory, separate from your program's. Learn how to move arrays to the device and back, allocate and mutate device arrays, index 2D grids for matrices, and use fast per-block shared memory with a barrier to let threads cooperate.
A kernel can only touch memory that lives on the GPU. Your NumPy arrays live in host (CPU) memory, in a completely separate address space connected to the GPU by the PCIe bus. Before a kernel can run, its inputs must be copied to the device, and after it finishes, the results must be copied back.
Managing that movement deliberately is most of what separates a fast GPU program from a slow one — transfers are expensive, so you want to move data up once, do as much work as possible on the device, and move only the results back.
This chapter covers the memory model and the tools Numba gives you to work with it:
Click any topic to jump in
CPU and GPU have separate address spaces bridged by PCIe; transfers are explicit and costly.
Stage inputs with to_device, allocate outputs with device_array, retrieve with copy_to_host.
cuda.grid(2) gives each thread a (row, col); 2D blocks tile the matrix; guard both dimensions.
Fast per-block on-chip memory; a barrier coordinates cooperative load-then-read across threads.
This chapter is part of PixelBank Premium. Create a free account, then upgrade to read the full lesson — concepts, walkthroughs, and exercises.