Loading...
Copy a 1D array using a grid-stride loop so the kernel is correct even when there are fewer total threads than elements: out = x.
A grid-stride loop lets each thread process multiple elements, stepping by the total number of threads (gridDim.x * blockDim.x). This decouples the launch configuration from the data size — the same kernel works for any n.
Implement copy_stride_kernel and run(n=4096) (launched with deliberately few blocks) that returns whether the copy is exact.
Your solution must define a top-level function run(...) that allocates the inputs, copies them to the GPU, launches your @cuda.jit kernel, and returns a Python bool from np.allclose(gpu_result, reference). The grader prints run(...); the expected output is True.
n = 4096 (launched with 8 blocks x 128 threads)
True
n = 4096 is divided into chunks processed by each thread in a grid-stride loop, with a total of 8⋅128=1024 threads.copy_stride_kernel function copies the input array x to the output array out using the grid-stride loop, resulting in an exact copy of the input array.run function compares the GPU result with the reference array using np.allclose, returning True if the copy is exact, which is the case for the given input.