📘

CUDA Vector Addition Kernel

Problem Statement

Write a CUDA kernel that adds two 1D float arrays element-wise: c = a + b.

Background

Every CUDA thread computes its own global index. With Numba you get it from cuda.grid(1) (equivalent to blockIdx.x * blockDim.x + threadIdx.x). One thread handles one element, so the launch needs ceil(n / threads) blocks.

Your Task

Implement add_kernel and a run(n=1024) that launches it over a 1D grid and returns whether the result matches a + b.

How it is tested

Your solution must define a top-level function run(...) that allocates the inputs, copies them to the GPU, launches your @cuda.jit kernel, and returns a Python bool from np.allclose(gpu_result, reference). The grader prints run(...); the expected output is True.

Example:

Input:

n = 1024

Output:

True

Reasoning:

The input n = 1024 determines the size of the input arrays a and b, which are 1D float arrays of length $n = 1024$ .
The add_kernel function is launched over a 1D grid with ceil(n / threads) blocks, where each thread handles one element of the arrays, performing element-wise addition: $c_i = a_i + b_i$ .
The result of the kernel launch is stored in the gpu_result array, which is then compared to the reference result computed using NumPy: $reference = a + b$ .
The final output is True if the two results match within a tolerance, as determined by np.allclose(gpu_result, reference).

Constraints:

Use @cuda.jit and i = cuda.grid(1)
Guard with if i < a.size before writing
blocks = (n + threads - 1) // threads

Editor

Python 3.13.1

GPU · T4

Test Results

0/0

Run code to see test results.

📘