Loading...
Implement a fused multiply-add kernel: out = a * x + b * y for two 1D tensors and two runtime scalars a, b.
Fusing multiple elementwise operations into one kernel means each element is read and written exactly once, saving memory bandwidth versus separate multiply and add passes.
Implement fma_kernel and run(n=1024, a=2.0, b=-1.0).
Your solution must define a top-level function run(...) that allocates inputs on the GPU, launches your Triton kernel, and returns a boolean from torch.allclose(triton_out, torch_reference, ...). The grader prints run(...); the expected output is True.
n = 1024, a = 2.0, b = -1.0
True
run function allocates two 1D tensors x and y of size n=1024 on the GPU.fma_kernel with inputs x, y, and scalars a=2.0, b=-1.0, computing the output tensor out as out=a⋅x+b⋅y=2.0⋅x−1.0⋅y.torch_reference = 2.0 * x - 1.0 * y.True if the two outputs are close, as determined by torch.allclose(triton_out, torch_reference), indicating that the Triton kernel produced the correct result.