Math Inside Transformers and LLMs

Transformers are the architecture behind modern AI's most impressive achievements: ChatGPT, Claude, Gemini, Midjourney, and many more. They are neural networks, but with a specific design that makes them exceptionally good at processing sequences like text, code, and audio. In this lesson, you will see exactly where math appears inside a transformer and how it enables large language models to generate human-like text.

From Neural Networks to Transformers

A standard neural network processes fixed-size inputs: a 784-pixel image or a 10-feature data point. But language is sequential and variable-length. A sentence might be 5 words or 500 words, and the meaning of each word depends on the words around it.

Transformers solve this with a mechanism called attention, which allows every part of the input to interact with every other part. This is the key innovation, and it is built entirely from the math you have been learning.

The Transformer Architecture

A transformer processes text in three stages:

[Tokens] → [Embeddings] → [Attention + Feed-Forward Layers] → [Output Probabilities]
 "The"       vector            many layers of                    P(next token)
 "cat"       vector            matrix operations
 "sat"       vector

Let us trace through each stage.

Stage 1: Token Embeddings (Linear Algebra)

Text is first split into tokens (roughly words or word pieces), then each token is converted into a vector called an embedding:

"The"  → [0.12, -0.45, 0.78, ..., 0.33]    (768 numbers)
"cat"  → [0.21, -0.18, 0.56, ..., -0.12]   (768 numbers)
"sat"  → [0.08,  0.67, -0.23, ..., 0.41]   (768 numbers)

These embeddings are stored in an embedding matrix with shape (vocabulary_size × embedding_dim). For GPT-4, this might be approximately (100,000 × 12,288), containing over 1.2 billion numbers just for this one matrix.

Positional encodings are added to tell the model where each token is in the sequence. Without these, the model would not know that "The cat sat" is different from "sat The cat."

Token embedding + Position encoding = Input to transformer

This is all linear algebra: vector addition and matrix lookups.

Stage 2: Self-Attention (Linear Algebra)

Self-attention is the operation that makes transformers powerful. It allows each token to look at every other token and decide which ones are most relevant to it.

The Q, K, V Framework

For each token, the model creates three vectors by multiplying the token embedding by three different weight matrices:

Query (Q) = embedding × W_Q    "What am I looking for?"
Key (K)   = embedding × W_K    "What do I contain?"
Value (V) = embedding × W_V    "What information do I carry?"

Each of these is a matrix multiplication — pure linear algebra.

Computing Attention Scores

The attention score between two tokens is the dot product of one token's query with another token's key:

Score("sat", "cat") = Q_sat · K_cat

High score → "sat" should pay a lot of attention to "cat"
Low score  → "sat" should mostly ignore this token

For a sequence of n tokens, this produces an n×n matrix of attention scores — every token compared with every other token.

Softmax: Turning Scores into Weights (Probability)

The raw scores are converted into a probability distribution using softmax:

Raw scores for "sat":    [1.2, 3.5, 0.8, 0.1, 2.4]
                          The  cat  sat   on  mat

After softmax:           [0.08, 0.52, 0.05, 0.03, 0.32]

Now "sat" pays 52% of its attention to "cat" and 32% to "mat." These are probabilities — they sum to 1.0 and are all non-negative.

Weighted Combination

Each token's new representation is a weighted sum of all the value vectors, where the weights come from the attention probabilities:

new_sat = 0.08 × V_The + 0.52 × V_cat + 0.05 × V_sat + 0.03 × V_on + 0.32 × V_mat

This is a linear combination (linear algebra) using weights from a probability distribution (probability). The result is a new vector for "sat" that now contains information from the tokens it attended to most.

Multi-Head Attention

Transformers do not run attention just once. They run it multiple times in parallel with different weight matrices (called heads). Each head learns to attend to different types of relationships:

Head 1: Might learn syntactic relationships (subject-verb agreement)
Head 2: Might learn semantic relationships (word meanings)
Head 3: Might learn positional relationships (nearby words)
...
Head 12: Might learn long-range dependencies

The outputs of all heads are concatenated and multiplied by another weight matrix:

MultiHead = Concat(head_1, head_2, ..., head_h) × W_O

More matrix multiplications, more linear algebra.

Stage 3: Feed-Forward Networks (Linear Algebra + Calculus)

After attention, each token's vector passes through a feed-forward network — two matrix multiplications with an activation function in between:

FFN(x) = ReLU(x × W₁ + b₁) × W₂ + b₂

This is exactly the same operation you saw in standard neural networks. In a large language model, the weight matrices W₁ and W₂ are enormous. For a model with 768-dimensional embeddings, W₁ might be (768 × 3072) and W₂ might be (3072 × 768).

Stage 4: Output Probabilities (Probability)

After passing through many layers of attention and feed-forward networks, the final token's vector is multiplied by the embedding matrix (transposed) to produce a score for every token in the vocabulary:

logits = final_vector × Embedding_matrix^T

Result: [score_for_"the", score_for_"cat", score_for_"is", ..., score_for_"Paris"]
         50,000+ scores, one per vocabulary token

Softmax converts these scores into a probability distribution:

P("Paris")     = 0.42
P("a")         = 0.08
P("beautiful") = 0.06
P("the")       = 0.05
...

The model then samples from this distribution to choose the next token. Sampling involves probability theory: should you always pick the most likely token, or should you sometimes pick less likely ones for variety?

The temperature parameter controls this. Low temperature makes the model more deterministic (always picking "Paris"). High temperature makes it more creative (sometimes picking unexpected words).

Training a Transformer (Calculus)

Training a transformer follows the same pattern as training any neural network, but at a much larger scale:

Forward pass: Process a batch of text sequences through all layers
Compute loss: Cross-entropy loss comparing predicted probabilities with actual next tokens
Backward pass: Backpropagation computes gradients for every weight in every layer
Update weights: Gradient descent adjusts all parameters

The difference is scale. A model like GPT-4 has hundreds of billions of parameters, and each training step requires computing gradients for all of them. This requires thousands of GPUs working in parallel, which in turn requires careful mathematical coordination of the distributed computation.

The Math by the Numbers

Here is a summary of the key mathematical operations in a transformer, using a model similar to GPT-3 as an example:

Operation	Math Branch	Scale
Embedding lookup	Linear algebra	100K × 12,288 matrix
Q, K, V projections	Linear algebra	3 matrix multiplications per layer × 96 layers
Attention scores	Linear algebra (dot product)	n² scores per head × 96 heads per layer
Softmax	Probability	n² values per head × 96 heads per layer
Weighted sum	Linear algebra	n vector combinations per head
Feed-forward	Linear algebra	2 large matrix multiplications per layer
Output logits	Linear algebra	12,288 × 100K matrix multiplication
Loss computation	Probability	Cross-entropy over 100K vocabulary
Backpropagation	Calculus	Gradients for 175 billion parameters

Every single one of these operations comes from the three mathematical pillars you have been learning about.

Summary

Transformers and LLMs are built entirely from the mathematics covered in this course:

Linear algebra powers embeddings, attention (Q, K, V projections and dot products), feed-forward layers, and output computation
Probability appears in softmax attention weights, output token probabilities, temperature-based sampling, and the cross-entropy loss
Calculus enables training through backpropagation and gradient descent across all layers

Understanding this math does not just help you use these models — it helps you understand why they work, when they fail, and how they can be improved. The mathematics is not decorative. It is the actual mechanism by which transformers process language and generate text.

Math Inside Transformers and LLMs

From Neural Networks to Transformers

The Transformer Architecture

A transformer processes text in three stages:

[Tokens] → [Embeddings] → [Attention + Feed-Forward Layers] → [Output Probabilities]
 "The"       vector            many layers of                    P(next token)
 "cat"       vector            matrix operations
 "sat"       vector

Let us trace through each stage.

Stage 1: Token Embeddings (Linear Algebra)

Text is first split into tokens (roughly words or word pieces), then each token is converted into a vector called an embedding:

"The"  → [0.12, -0.45, 0.78, ..., 0.33]    (768 numbers)
"cat"  → [0.21, -0.18, 0.56, ..., -0.12]   (768 numbers)
"sat"  → [0.08,  0.67, -0.23, ..., 0.41]   (768 numbers)

Positional encodings are added to tell the model where each token is in the sequence. Without these, the model would not know that "The cat sat" is different from "sat The cat."

Token embedding + Position encoding = Input to transformer

This is all linear algebra: vector addition and matrix lookups.

Stage 2: Self-Attention (Linear Algebra)

Self-attention is the operation that makes transformers powerful. It allows each token to look at every other token and decide which ones are most relevant to it.

The Q, K, V Framework

For each token, the model creates three vectors by multiplying the token embedding by three different weight matrices:

Query (Q) = embedding × W_Q    "What am I looking for?"
Key (K)   = embedding × W_K    "What do I contain?"
Value (V) = embedding × W_V    "What information do I carry?"

Each of these is a matrix multiplication — pure linear algebra.

Computing Attention Scores

The attention score between two tokens is the dot product of one token's query with another token's key:

Score("sat", "cat") = Q_sat · K_cat

High score → "sat" should pay a lot of attention to "cat"
Low score  → "sat" should mostly ignore this token

For a sequence of n tokens, this produces an n×n matrix of attention scores — every token compared with every other token.

Softmax: Turning Scores into Weights (Probability)

The raw scores are converted into a probability distribution using softmax:

Raw scores for "sat":    [1.2, 3.5, 0.8, 0.1, 2.4]
                          The  cat  sat   on  mat

After softmax:           [0.08, 0.52, 0.05, 0.03, 0.32]

Now "sat" pays 52% of its attention to "cat" and 32% to "mat." These are probabilities — they sum to 1.0 and are all non-negative.

Weighted Combination

Each token's new representation is a weighted sum of all the value vectors, where the weights come from the attention probabilities:

new_sat = 0.08 × V_The + 0.52 × V_cat + 0.05 × V_sat + 0.03 × V_on + 0.32 × V_mat

Multi-Head Attention

Transformers do not run attention just once. They run it multiple times in parallel with different weight matrices (called heads). Each head learns to attend to different types of relationships:

Head 1: Might learn syntactic relationships (subject-verb agreement)
Head 2: Might learn semantic relationships (word meanings)
Head 3: Might learn positional relationships (nearby words)
...
Head 12: Might learn long-range dependencies

The outputs of all heads are concatenated and multiplied by another weight matrix:

MultiHead = Concat(head_1, head_2, ..., head_h) × W_O

More matrix multiplications, more linear algebra.

Stage 3: Feed-Forward Networks (Linear Algebra + Calculus)

After attention, each token's vector passes through a feed-forward network — two matrix multiplications with an activation function in between:

FFN(x) = ReLU(x × W₁ + b₁) × W₂ + b₂

Stage 4: Output Probabilities (Probability)

logits = final_vector × Embedding_matrix^T

Result: [score_for_"the", score_for_"cat", score_for_"is", ..., score_for_"Paris"]
         50,000+ scores, one per vocabulary token

Softmax converts these scores into a probability distribution:

P("Paris")     = 0.42
P("a")         = 0.08
P("beautiful") = 0.06
P("the")       = 0.05
...

The temperature parameter controls this. Low temperature makes the model more deterministic (always picking "Paris"). High temperature makes it more creative (sometimes picking unexpected words).

Training a Transformer (Calculus)

Training a transformer follows the same pattern as training any neural network, but at a much larger scale:

Forward pass: Process a batch of text sequences through all layers
Compute loss: Cross-entropy loss comparing predicted probabilities with actual next tokens
Backward pass: Backpropagation computes gradients for every weight in every layer
Update weights: Gradient descent adjusts all parameters

The Math by the Numbers

Here is a summary of the key mathematical operations in a transformer, using a model similar to GPT-3 as an example:

Operation	Math Branch	Scale
Embedding lookup	Linear algebra	100K × 12,288 matrix
Q, K, V projections	Linear algebra	3 matrix multiplications per layer × 96 layers
Attention scores	Linear algebra (dot product)	n² scores per head × 96 heads per layer
Softmax	Probability	n² values per head × 96 heads per layer
Weighted sum	Linear algebra	n vector combinations per head
Feed-forward	Linear algebra	2 large matrix multiplications per layer
Output logits	Linear algebra	12,288 × 100K matrix multiplication
Loss computation	Probability	Cross-entropy over 100K vocabulary
Backpropagation	Calculus	Gradients for 175 billion parameters

Every single one of these operations comes from the three mathematical pillars you have been learning about.

Summary

Transformers and LLMs are built entirely from the mathematics covered in this course:

Linear algebra powers embeddings, attention (Q, K, V projections and dot products), feed-forward layers, and output computation
Probability appears in softmax attention weights, output token probabilities, temperature-based sampling, and the cross-entropy loss
Calculus enables training through backpropagation and gradient descent across all layers

Math Inside Transformers and LLMs

From Neural Networks to Transformers

The Transformer Architecture

Stage 1: Token Embeddings (Linear Algebra)

Stage 2: Self-Attention (Linear Algebra)

The Q, K, V Framework

Computing Attention Scores

Softmax: Turning Scores into Weights (Probability)

Weighted Combination

Multi-Head Attention

Stage 3: Feed-Forward Networks (Linear Algebra + Calculus)

Stage 4: Output Probabilities (Probability)

Training a Transformer (Calculus)

The Math by the Numbers

Summary

Questions & Answers

Math Inside Transformers and LLMs

From Neural Networks to Transformers

The Transformer Architecture

Stage 1: Token Embeddings (Linear Algebra)

Stage 2: Self-Attention (Linear Algebra)

The Q, K, V Framework

Computing Attention Scores

Softmax: Turning Scores into Weights (Probability)

Weighted Combination

Multi-Head Attention

Stage 3: Feed-Forward Networks (Linear Algebra + Calculus)

Stage 4: Output Probabilities (Probability)

Training a Transformer (Calculus)

The Math by the Numbers

Summary

Questions & Answers