Moving Data: GPU Memory with Numba

A kernel can only touch memory that lives on the GPU. Your NumPy arrays live in host (CPU) memory, in a completely separate address space connected to the GPU by the PCIe bus. Before a kernel can run, its inputs must be copied to the device, and after it finishes, the results must be copied back.

Managing that movement deliberately is most of what separates a fast GPU program from a slow one — transfers are expensive, so you want to move data up once, do as much work as possible on the device, and move only the results back.

This chapter covers the memory model and the tools Numba gives you to work with it:

Host vs Device Memory: two separate worlds, bridged by explicit copies
to_device & copy_to_host: staging inputs and retrieving results
2D Grids for Matrices: indexing rows and columns with cuda.grid(2)
Shared Memory & __syncthreads(): fast on-chip memory that lets a block's threads cooperate

Chapter 2: Moving Data: GPU Memory with Numba

Chapter Overview

Chapter Roadmap

Host vs Device Memory

to_device & copy_to_host

2D Grids for Matrices

Shared Memory & __syncthreads()

Sign up to unlock this chapter