📘

Triton Vector Addition Kernel

Problem Statement

Write a Triton kernel that adds two 1D float tensors element-wise: out = x + y.

Background

A Triton program instance handles one block of BLOCK_SIZE elements. Use tl.program_id(0) to get the block index, build per-element offsets with tl.arange, guard out-of-bounds lanes with a mask, then tl.load / tl.store.

Your Task

Implement add_kernel and a run(n=1024) that launches it over a 1D grid and returns whether the result matches x + y.

How it is tested

Your solution must define a top-level function run(...) that allocates inputs on the GPU, launches your Triton kernel, and returns a boolean from torch.allclose(triton_out, torch_reference, ...). The grader prints run(...); the expected output is True.

Example:

Input:

n = 1024

Output:

True

Reasoning:

The input n = 1024 represents the total number of elements in the 1D float tensors x and y.
The add_kernel function is launched over a 1D grid, with each block handling BLOCK_SIZE elements, and performs element-wise addition: $out_i = x_i + y_i$ .
The run function allocates x and y on the GPU, launches the add_kernel, and stores the result in triton_out.
The final output True indicates that triton_out matches the reference result torch_reference = x + y within a tolerance, as verified by torch.allclose.

Constraints:

Use @triton.jit and tl.program_id(0)
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
Mask out-of-bounds lanes with mask = offsets < n
Launch grid = (triton.cdiv(n, BLOCK_SIZE),)

Editor

Python 3.13.1

GPU · T4

Test Results

0/0

Run code to see test results.

📘