Implement multi-head attention by splitting Q, K, V into multiple heads.

Given Q, K, V of shape (n, d) and number of heads h:

Assume d is divisible by h. No linear projections needed — just split and concat.

Input:

Output: Multi-head attention output (n, d), values rounded to 4 decimal places.

[1.7311 2.7311 3.2689 4.2689]
[4.2689 5.2689 6.7311 7.7311]

First, we split the input matrices Q, K, V into 2 heads: Q = $\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$ , $\begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$ , K = $\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$ , $\begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$ , V = $\begin{bmatrix} 1 & 2 \\ 5 & 6 \end{bmatrix}$ , $\begin{bmatrix} 3 & 4 \\ 7 & 8 \end{bmatrix}$
Then, we apply scaled dot-product attention per head: for the first head, $Attention(Q, K, V) = \frac{Q \cdot K^T}{\sqrt{2}} \cdot V = \frac{\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \cdot \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}}{\sqrt{2}} \cdot \begin{bmatrix} 1 & 2 \\ 5 & 6 \end{bmatrix}$ and similarly for the second head
Next, we calculate the attention output for each head and concatenate them: $Output = Concat(Attention_1, Attention_2) = \begin{bmatrix} 1.7311 & 2.7311 \\ 4.2689 & 5.2689 \end{bmatrix}$ , $\begin{bmatrix} 3.2689 & 4.2689 \\ 6.7311 & 7.7311 \end{bmatrix}$
The final output is the concatenated output matrix, rounded to 4 decimal places: $\begin{bmatrix} 1.7311 & 2.7311 & 3.2689 & 4.2689 \\ 4.2689 & 5.2689 & 6.7311 & 7.7311 \end{bmatrix}$

Editor

Python 3.13.1

0/0

Run code to see test results.

📘

Multi-Head Attention