Matrix Multiplication in Transformers
The transformer architecture -- the engine behind GPT, Claude, Gemini, and virtually every modern large language model -- is, at its mathematical core, a carefully orchestrated sequence of matrix multiplications. Understanding this reveals how these models actually process language and generate text.
The Transformer at a Glance
Introduced in the 2017 paper "Attention Is All You Need," the transformer replaced earlier recurrent architectures with a design built entirely on matrix operations. The key components are:
- Embedding layer: converts tokens (words or sub-words) into vectors
- Self-attention mechanism: determines which tokens should attend to which other tokens
- Feed-forward layers: applies further transformations to each position
- Output layer: produces predictions for the next token
Every single one of these components relies on matrix multiplication.
Query, Key, and Value: Projecting the Input
The attention mechanism begins by taking an input matrix X (where each row is a token's embedding vector) and projecting it into three separate matrices:
Q = X x W_Q (Queries)
K = X x W_K (Keys)
V = X x W_V (Values)
Here, W_Q, W_K, and W_V are learned weight matrices. Each of these three lines is a matrix multiplication. If X has shape (sequence_length x d_model), and the weight matrices have shape (d_model x d_k), then Q, K, and V each have shape (sequence_length x d_k).
Think of it this way:
- Query: "What am I looking for?"
- Key: "What do I contain?"
- Value: "What information do I provide?"
Computing Attention Scores: Q Times K Transpose
The core of self-attention is computing how much each token should attend to every other token. This is done with a single matrix multiplication:
Scores = Q x K^T
K^T is the transpose of K (rows and columns swapped). If Q is (seq_len x d_k) and K^T is (d_k x seq_len), the result is a (seq_len x seq_len) matrix. Each element [i][j] represents how strongly token i should attend to token j.
This is a dot product between every pair of query and key vectors -- computed all at once through matrix multiplication rather than one pair at a time.
Scaling and Softmax
The raw attention scores can be very large, which destabilizes training. The solution is to scale by the square root of the key dimension:
Scaled_Scores = Scores / sqrt(d_k)
Then softmax is applied row-by-row, converting each row into a probability distribution that sums to 1. After softmax, each row tells us: for this token, what fraction of attention should go to each other token?
Computing Attention Output: Scores Times V
The final attention output is another matrix multiplication:
Attention_Output = softmax(Scaled_Scores) x V
The softmax scores matrix has shape (seq_len x seq_len), and V has shape (seq_len x d_k). The result is (seq_len x d_k) -- a new representation for each token that incorporates information from all the tokens it attended to.
The complete attention formula in one line:
Attention(Q, K, V) = softmax(Q x K^T / sqrt(d_k)) x V
Multi-Head Attention: Parallel Perspectives
A single attention computation captures one "perspective" on the relationships between tokens. Transformers use multi-head attention -- multiple attention heads running in parallel, each with their own W_Q, W_K, and W_V matrices.
| Component | Matrix Multiplications per Head | Total for h Heads |
|---|---|---|
| Q projection | 1 | h |
| K projection | 1 | h |
| V projection | 1 | h |
| Q x K^T | 1 | h |
| Scores x V | 1 | h |
| Output projection | -- | 1 |
| Total | 5 | 5h + 1 |
With 12 attention heads (common in base models), that is 61 matrix multiplications in a single attention layer. GPT-3 has 96 such layers.
Feed-Forward Layers: Even More Matrix Multiplications
After attention, each token passes through a feed-forward network consisting of two matrix multiplications with an activation function in between:
FFN(x) = ReLU(x x W_1 + b_1) x W_2 + b_2
These weight matrices are typically large. In GPT-3, W_1 has shape (12288 x 49152) and W_2 has shape (49152 x 12288). Each of these multiplications involves billions of individual multiply-add operations.
The Full Pipeline
From input to output, a single forward pass through a transformer is a pipeline of matrix multiplications: token embedding, then for each layer (repeated dozens or hundreds of times) multi-head attention and a feed-forward network, and finally an output projection to produce predictions. A single forward pass through GPT-3 involves roughly tens of thousands of matrix multiplications. The transformer is literally a pipeline of matrix operations with non-linearities sprinkled between them.
Summary
- Transformers project inputs into Query, Key, and Value matrices using matrix multiplication
- Attention scores are computed as Q times K transpose -- a matrix multiplication that measures token-to-token relevance
- Scores are scaled by sqrt(d_k) and passed through softmax to get attention weights
- The attention output is another matrix multiplication: softmax(scores) times V
- Multi-head attention runs multiple attention computations in parallel, multiplying the number of matrix operations
- Feed-forward layers add two more large matrix multiplications per transformer layer
- The entire transformer is essentially a pipeline of matrix multiplications with non-linearities between them
With this understanding of how matrix multiplication powers transformers, the natural question becomes: how do we actually compute all of these operations efficiently? That is the subject of our next lesson on computational efficiency and GPUs.

