📘

CUDA Grid-Stride Loop

Problem Statement

Copy a 1D array using a grid-stride loop so the kernel is correct even when there are fewer total threads than elements: out = x.

Background

A grid-stride loop lets each thread process multiple elements, stepping by the total number of threads (gridDim.x * blockDim.x). This decouples the launch configuration from the data size — the same kernel works for any n.

Your Task

Implement copy_stride_kernel and run(n=4096) (launched with deliberately few blocks) that returns whether the copy is exact.

How it is tested

Your solution must define a top-level function run(...) that allocates the inputs, copies them to the GPU, launches your @cuda.jit kernel, and returns a Python bool from np.allclose(gpu_result, reference). The grader prints run(...); the expected output is True.

Example:

Input:

n = 4096 (launched with 8 blocks x 128 threads)

Output:

True

Reasoning:

The input size n = 4096 is divided into chunks processed by each thread in a grid-stride loop, with a total of $8 \cdot 128 = 1024$ threads.
Each thread processes multiple elements, stepping by the total number of threads ( $1024$ ), to ensure all $4096$ elements are copied.
The copy_stride_kernel function copies the input array x to the output array out using the grid-stride loop, resulting in an exact copy of the input array.
The run function compares the GPU result with the reference array using np.allclose, returning True if the copy is exact, which is the case for the given input.

Constraints:

start = cuda.grid(1); stride = cuda.gridDim.x * cuda.blockDim.x
Loop i from start to n stepping by stride
Must work when total threads < n

Editor

Python 3.13.1

GPU · T4

Test Results

0/0

Run code to see test results.

📘