Loading...
Write a Triton kernel that adds two 1D float tensors element-wise: out = x + y.
A Triton program instance handles one block of BLOCK_SIZE elements. Use tl.program_id(0) to get the block index, build per-element offsets with tl.arange, guard out-of-bounds lanes with a mask, then tl.load / tl.store.
Implement add_kernel and a run(n=1024) that launches it over a 1D grid and returns whether the result matches x + y.
Your solution must define a top-level function run(...) that allocates inputs on the GPU, launches your Triton kernel, and returns a boolean from torch.allclose(triton_out, torch_reference, ...). The grader prints run(...); the expected output is True.
n = 1024
True
n = 1024 represents the total number of elements in the 1D float tensors x and y.add_kernel function is launched over a 1D grid, with each block handling BLOCK_SIZE elements, and performs element-wise addition: outi=xi+yi.run function allocates x and y on the GPU, launches the add_kernel, and stores the result in triton_out.True indicates that triton_out matches the reference result torch_reference = x + y within a tolerance, as verified by torch.allclose.