Loading...
Write a CUDA kernel that adds two 1D float arrays element-wise: c = a + b.
Every CUDA thread computes its own global index. With Numba you get it from cuda.grid(1) (equivalent to blockIdx.x * blockDim.x + threadIdx.x). One thread handles one element, so the launch needs ceil(n / threads) blocks.
Implement add_kernel and a run(n=1024) that launches it over a 1D grid and returns whether the result matches a + b.
Your solution must define a top-level function run(...) that allocates the inputs, copies them to the GPU, launches your @cuda.jit kernel, and returns a Python bool from np.allclose(gpu_result, reference). The grader prints run(...); the expected output is True.
n = 1024
True
n = 1024 determines the size of the input arrays a and b, which are 1D float arrays of length n=1024.add_kernel function is launched over a 1D grid with ceil(n / threads) blocks, where each thread handles one element of the arrays, performing element-wise addition: ci=ai+bi.gpu_result array, which is then compared to the reference result computed using NumPy: reference=a+b.True if the two results match within a tolerance, as determined by np.allclose(gpu_result, reference).