Loading...
Write a kernel that fills an output tensor with its own global indices: out[i] = i. This is the Triton equivalent of torch.arange.
The global index of each lane is exactly the offsets value you compute from program_id and tl.arange. Storing offsets (cast to float) demonstrates that you understand the program/block addressing scheme.
Implement iota_kernel and run(n=1024) returning whether out equals torch.arange(n).
Your solution must define a top-level function run(...) that allocates inputs on the GPU, launches your Triton kernel, and returns a boolean from torch.allclose(triton_out, torch_reference, ...). The grader prints run(...); the expected output is True.
n = 1024
True
n = 1024 determines the size of the output tensor.iota_kernel function is launched, which fills the output tensor with its own global indices: out[i]=i.triton_out is compared to the reference tensor torch.arange(n) using torch.allclose.True if the two tensors are equal within a certain tolerance, indicating that the kernel correctly filled the output tensor with its global indices.