📘

Triton Square Kernel

EasyTriton Fundamentals

Problem Statement

Implement an elementwise square kernel: out = x * x.

Background

A warm-up for elementwise math in registers. Compute the product of the loaded value with itself and store it.

Your Task

Implement square_kernel and run(n=1024).

How it is tested

Your solution must define a top-level function run(...) that allocates inputs on the GPU, launches your Triton kernel, and returns a boolean from torch.allclose(triton_out, torch_reference, ...). The grader prints run(...); the expected output is True.

Example:

Input:

n = 1024

Output:

True

Reasoning:

The input value n = 1024 determines the size of the input array x.
The square_kernel function computes the elementwise square of x, resulting in an output array out where each element is $x_i \cdot x_i = x_i^2$ .
The run function allocates the input array x on the GPU, launches the square_kernel, and computes a reference solution using PyTorch.
The final output is True if the Triton output triton_out is close to the PyTorch reference solution torch_reference, as determined by torch.allclose.

Constraints:

out = x * x
Mask the tail block

Editor

Python 3.13.1

GPU · T4

Test Results

0/0

Run code to see test results.

📘