📘

Triton Write Index Kernel

Problem Statement

Write a kernel that fills an output tensor with its own global indices: out[i] = i. This is the Triton equivalent of torch.arange.

Background

The global index of each lane is exactly the offsets value you compute from program_id and tl.arange. Storing offsets (cast to float) demonstrates that you understand the program/block addressing scheme.

Your Task

Implement iota_kernel and run(n=1024) returning whether out equals torch.arange(n).

How it is tested

Your solution must define a top-level function run(...) that allocates inputs on the GPU, launches your Triton kernel, and returns a boolean from torch.allclose(triton_out, torch_reference, ...). The grader prints run(...); the expected output is True.

Example:

Input:

n = 1024

Output:

True

Reasoning:

The input value n = 1024 determines the size of the output tensor.
The iota_kernel function is launched, which fills the output tensor with its own global indices: $out[i] = i$ .
The resulting output tensor triton_out is compared to the reference tensor torch.arange(n) using torch.allclose.
The comparison returns True if the two tensors are equal within a certain tolerance, indicating that the kernel correctly filled the output tensor with its global indices.

Constraints:

out[i] = i (as float)
Derive i from program_id and tl.arange
Mask the tail block

Editor

Python 3.13.1

GPU · T4

Test Results

0/0

Run code to see test results.

📘