Loading...
Implement an elementwise square kernel: out = x * x.
A warm-up for elementwise math in registers. Compute the product of the loaded value with itself and store it.
Implement square_kernel and run(n=1024).
Your solution must define a top-level function run(...) that allocates inputs on the GPU, launches your Triton kernel, and returns a boolean from torch.allclose(triton_out, torch_reference, ...). The grader prints run(...); the expected output is True.
n = 1024
True
n = 1024 determines the size of the input array x.square_kernel function computes the elementwise square of x, resulting in an output array out where each element is xi⋅xi=xi2.run function allocates the input array x on the GPU, launches the square_kernel, and computes a reference solution using PyTorch.True if the Triton output triton_out is close to the PyTorch reference solution torch_reference, as determined by torch.allclose.