Loading...
Fill an output array with each element's own global thread index: out[i] = i. This is the CUDA equivalent of np.arange.
Compute the index the long way to prove you understand the mapping: i = blockIdx.x * blockDim.x + threadIdx.x. In Numba that's cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x.
Implement iota_kernel and run(n=1024) returning whether out equals np.arange(n).
Your solution must define a top-level function run(...) that allocates the inputs, copies them to the GPU, launches your @cuda.jit kernel, and returns a Python bool from np.allclose(gpu_result, reference). The grader prints run(...); the expected output is True.
n = 1024
True
n = 1024 determines the size of the output array, which will have 1024 elements.iota_kernel function is launched with a suitable block and thread configuration to cover all n elements, using the formula i = blockIdx.x * blockDim.x + threadIdx.x to calculate the global thread index i.i to the corresponding element in the output array out, effectively creating an array of sequential integers from 0 to n-1.out is compared to the reference array np.arange(n) using np.allclose, which returns True if the two arrays are identical, thus producing the output True.