Matrix Multiplication in Transformers

The transformer architecture -- the engine behind GPT, Claude, Gemini, and virtually every modern large language model -- is, at its mathematical core, a carefully orchestrated sequence of matrix multiplications. Understanding this reveals how these models actually process language and generate text.

The Transformer at a Glance

Introduced in the 2017 paper "Attention Is All You Need," the transformer replaced earlier recurrent architectures with a design built entirely on matrix operations. The key components are:

Embedding layer: converts tokens (words or sub-words) into vectors
Self-attention mechanism: determines which tokens should attend to which other tokens
Feed-forward layers: applies further transformations to each position
Output layer: produces predictions for the next token

Every single one of these components relies on matrix multiplication.

Query, Key, and Value: Projecting the Input

The attention mechanism begins by taking an input matrix X (where each row is a token's embedding vector) and projecting it into three separate matrices:

Q = X x W_Q    (Queries)
K = X x W_K    (Keys)
V = X x W_V    (Values)

Here, W_Q, W_K, and W_V are learned weight matrices. Each of these three lines is a matrix multiplication. If X has shape (sequence_length x d_model), and the weight matrices have shape (d_model x d_k), then Q, K, and V each have shape (sequence_length x d_k).

Think of it this way:

Query: "What am I looking for?"
Key: "What do I contain?"
Value: "What information do I provide?"

Computing Attention Scores: Q Times K Transpose

The core of self-attention is computing how much each token should attend to every other token. This is done with a single matrix multiplication:

Scores = Q x K^T

K^T is the transpose of K (rows and columns swapped). If Q is (seq_len x d_k) and K^T is (d_k x seq_len), the result is a (seq_len x seq_len) matrix. Each element [i][j] represents how strongly token i should attend to token j.

This is a dot product between every pair of query and key vectors -- computed all at once through matrix multiplication rather than one pair at a time.

Scaling and Softmax

The raw attention scores can be very large, which destabilizes training. The solution is to scale by the square root of the key dimension:

Scaled_Scores = Scores / sqrt(d_k)

Then softmax is applied row-by-row, converting each row into a probability distribution that sums to 1. After softmax, each row tells us: for this token, what fraction of attention should go to each other token?

Computing Attention Output: Scores Times V

The final attention output is another matrix multiplication:

Attention_Output = softmax(Scaled_Scores) x V

The softmax scores matrix has shape (seq_len x seq_len), and V has shape (seq_len x d_k). The result is (seq_len x d_k) -- a new representation for each token that incorporates information from all the tokens it attended to.

The complete attention formula in one line:

Attention(Q, K, V) = softmax(Q x K^T / sqrt(d_k)) x V

Multi-Head Attention: Parallel Perspectives

A single attention computation captures one "perspective" on the relationships between tokens. Transformers use multi-head attention -- multiple attention heads running in parallel, each with their own W_Q, W_K, and W_V matrices.

Component	Matrix Multiplications per Head	Total for h Heads
Q projection	1	h
K projection	1	h
V projection	1	h
Q x K^T	1	h
Scores x V	1	h
Output projection	--	1
Total	5	5h + 1

With 12 attention heads (common in base models), that is 61 matrix multiplications in a single attention layer. GPT-3 has 96 such layers.

Feed-Forward Layers: Even More Matrix Multiplications

After attention, each token passes through a feed-forward network consisting of two matrix multiplications with an activation function in between:

FFN(x) = ReLU(x x W_1 + b_1) x W_2 + b_2

These weight matrices are typically large. In GPT-3, W_1 has shape (12288 x 49152) and W_2 has shape (49152 x 12288). Each of these multiplications involves billions of individual multiply-add operations.

The Full Pipeline

From input to output, a single forward pass through a transformer is a pipeline of matrix multiplications: token embedding, then for each layer (repeated dozens or hundreds of times) multi-head attention and a feed-forward network, and finally an output projection to produce predictions. A single forward pass through GPT-3 involves roughly tens of thousands of matrix multiplications. The transformer is literally a pipeline of matrix operations with non-linearities sprinkled between them.

Summary

Transformers project inputs into Query, Key, and Value matrices using matrix multiplication
Attention scores are computed as Q times K transpose -- a matrix multiplication that measures token-to-token relevance
Scores are scaled by sqrt(d_k) and passed through softmax to get attention weights
The attention output is another matrix multiplication: softmax(scores) times V
Multi-head attention runs multiple attention computations in parallel, multiplying the number of matrix operations
Feed-forward layers add two more large matrix multiplications per transformer layer
The entire transformer is essentially a pipeline of matrix multiplications with non-linearities between them

With this understanding of how matrix multiplication powers transformers, the natural question becomes: how do we actually compute all of these operations efficiently? That is the subject of our next lesson on computational efficiency and GPUs.