📘

CUDA Global Thread Index

Problem Statement

Fill an output array with each element's own global thread index: out[i] = i. This is the CUDA equivalent of np.arange.

Background

Compute the index the long way to prove you understand the mapping: i = blockIdx.x * blockDim.x + threadIdx.x. In Numba that's cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x.

Your Task

Implement iota_kernel and run(n=1024) returning whether out equals np.arange(n).

How it is tested

Your solution must define a top-level function run(...) that allocates the inputs, copies them to the GPU, launches your @cuda.jit kernel, and returns a Python bool from np.allclose(gpu_result, reference). The grader prints run(...); the expected output is True.

Example:

Input:

n = 1024

Output:

True

Reasoning:

The input value n = 1024 determines the size of the output array, which will have 1024 elements.
The iota_kernel function is launched with a suitable block and thread configuration to cover all n elements, using the formula i = blockIdx.x * blockDim.x + threadIdx.x to calculate the global thread index i.
Each thread then assigns its global thread index i to the corresponding element in the output array out, effectively creating an array of sequential integers from 0 to n-1.
The resulting output array out is compared to the reference array np.arange(n) using np.allclose, which returns True if the two arrays are identical, thus producing the output True.

Constraints:

Build i from cuda.blockIdx.x, cuda.blockDim.x, cuda.threadIdx.x
out[i] = i (stored as float32)
Bounds-check with if i < out.size

Editor

Python 3.13.1

GPU · T4

Test Results

0/0

Run code to see test results.