📘

Triton Fused Multiply-Add Kernel

Problem Statement

Implement a fused multiply-add kernel: out = a * x + b * y for two 1D tensors and two runtime scalars a, b.

Background

Fusing multiple elementwise operations into one kernel means each element is read and written exactly once, saving memory bandwidth versus separate multiply and add passes.

Your Task

Implement fma_kernel and run(n=1024, a=2.0, b=-1.0).

How it is tested

Your solution must define a top-level function run(...) that allocates inputs on the GPU, launches your Triton kernel, and returns a boolean from torch.allclose(triton_out, torch_reference, ...). The grader prints run(...); the expected output is True.

Example:

Input:

n = 1024, a = 2.0, b = -1.0

Output:

True

Reasoning:

The run function allocates two 1D tensors x and y of size n=1024 on the GPU.
It then launches the fma_kernel with inputs x, y, and scalars a=2.0, b=-1.0, computing the output tensor out as $out = a \cdot x + b \cdot y = 2.0 \cdot x - 1.0 \cdot y$ .
The result is compared to a reference output computed using PyTorch's built-in functions, torch_reference = 2.0 * x - 1.0 * y.
The function returns True if the two outputs are close, as determined by torch.allclose(triton_out, torch_reference), indicating that the Triton kernel produced the correct result.

Constraints:

Single kernel computing ax + by
Two tensor pointers, two scalar args
Mask the tail block

Editor

Python 3.13.1

GPU · T4

Test Results

0/0

Run code to see test results.

📘