Math Inside Neural Networks
You have now seen the three mathematical pillars individually. In this lesson, you will see how they come together inside a neural network, the fundamental building block of modern AI. By tracing data through a network from input to output and back again during training, you will see every mathematical concept in action.
What Is a Neural Network?
A neural network is a mathematical function that takes an input (like an image or a sentence) and produces an output (like a classification or a prediction). It is built from simple, repeating blocks called layers, and each layer performs a sequence of mathematical operations.
Input โ [Layer 1] โ [Layer 2] โ [Layer 3] โ Output
weights weights weights
+ bias + bias + bias
+ ReLU + ReLU + softmax
Let us trace through each step.
Step 1: Input as a Vector (Linear Algebra)
Every input to a neural network is a vector. If you are classifying handwritten digits (the classic MNIST dataset), each image is 28ร28 pixels:
28 ร 28 = 784 pixels
Flatten into a vector:
input = [0.0, 0.0, 0.12, 0.89, 0.94, ..., 0.0] (784 numbers)
Each number is a pixel intensity between 0 (black) and 1 (white). The image has been converted from a visual grid into a mathematical vector that the network can process.
Step 2: Layer Computation (Linear Algebra)
Each layer performs two operations:
Matrix multiplication โ multiply the input vector by the weight matrix:
z = W ยท x + b
Where:
x = input vector (784 numbers)
W = weight matrix (128 ร 784 numbers)
b = bias vector (128 numbers)
z = pre-activation (128 numbers)
This takes 784 input features and produces 128 new features. The weight matrix determines which input features are important and how they should be combined.
Activation function โ apply a non-linear function element by element:
a = ReLU(z)
ReLU(z) = max(0, z)
Before: [-0.5, 1.2, -0.1, 0.8, -2.0, 0.3]
After: [ 0.0, 1.2, 0.0, 0.8, 0.0, 0.3]
This combination of linear transformation (matrix multiply) and non-linear activation is what gives neural networks their power. Without the activation, stacking layers would accomplish nothing more than a single layer.
Step 3: Stacking Layers (Linear Algebra)
A typical network has multiple layers, each transforming the data further:
Layer 1: 784 โ 128 features (Wโ is 128ร784, plus bโ)
Layer 2: 128 โ 64 features (Wโ is 64ร128, plus bโ)
Layer 3: 64 โ 10 features (Wโ is 10ร64, plus bโ)
The final layer produces 10 numbers, one for each digit (0-9). Each layer extracts increasingly abstract features:
- Layer 1 might detect edges and strokes
- Layer 2 might detect curves and corners
- Layer 3 combines these into digit identities
The total number of parameters in this small network:
Layer 1: 128 ร 784 + 128 = 100,480
Layer 2: 64 ร 128 + 64 = 8,256
Layer 3: 10 ร 64 + 10 = 650
Total: 109,386 parameters
Every single one of these 109,386 numbers was learned through training.
Step 4: Output as Probabilities (Probability)
The final layer produces raw scores (logits). The softmax function converts these into probabilities:
Logits: [1.2, 0.3, -0.5, 0.1, 3.8, 0.2, -0.1, 8.1, 0.4, 0.9]
0 1 2 3 4 5 6 7 8 9
After softmax: [0.01, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00, 0.96, 0.00, 0.01]
The network predicts digit "7" with 96% confidence. The softmax function ensures all probabilities sum to 1.0 and every value is between 0 and 1.
Step 5: Measuring Error (Probability + Calculus)
If the true label is "7", we compute the cross-entropy loss:
Loss = -log(P(correct class))
= -log(0.96)
= 0.04
This is a small loss because the model was very confident and correct. If the model had assigned only 10% probability to the correct class:
Loss = -log(0.10)
= 2.30
A much larger loss. The loss function translates probability into a single number that measures how wrong the model is.
Step 6: Computing Gradients (Calculus)
Now comes the learning. We need to compute the gradient: how does the loss change when each of the 109,386 parameters changes?
Backpropagation uses the chain rule to compute these gradients efficiently, working backward from the loss through each layer:
Step 1: How does the loss depend on the softmax output?
โL/โsoftmax โ from the cross-entropy formula
Step 2: How does the softmax output depend on Layer 3's output?
โsoftmax/โzโ โ from the softmax formula
Step 3: How does Layer 3's output depend on Layer 3's weights?
โzโ/โWโ โ this is just the input to Layer 3
Step 4: How does Layer 3's output depend on Layer 2's output?
โzโ/โaโ โ this is just Wโ
... continue backward through all layers ...
The chain rule multiplies all these partial derivatives together:
โL/โWโ = โL/โsoftmax ร โsoftmax/โzโ ร โzโ/โaโ ร โaโ/โzโ ร โzโ/โaโ ร โaโ/โzโ ร โzโ/โWโ
This chain of multiplications is why the algorithm is called "backpropagation" โ the gradient signal propagates backward through the network.
Step 7: Updating Weights (Calculus)
With all gradients computed, every weight is updated using gradient descent:
Wโ = Wโ - learning_rate ร โL/โWโ
Wโ = Wโ - learning_rate ร โL/โWโ
Wโ = Wโ - learning_rate ร โL/โWโ
bโ = bโ - learning_rate ร โL/โbโ
bโ = bโ - learning_rate ร โL/โbโ
bโ = bโ - learning_rate ร โL/โbโ
Each weight moves a tiny amount in the direction that reduces the loss. After seeing thousands of training examples, the weights converge to values that make accurate predictions.
Step 8: Repeat (All Three Pillars)
The training loop repeats steps 1-7 for every example in the training dataset, typically multiple times (epochs):
For each epoch:
For each training example:
Forward pass (linear algebra)
Compute loss (probability)
Backward pass (calculus)
Update weights (calculus)
Evaluate on test set (statistics)
A typical training run might involve millions of forward and backward passes.
The Math in Numbers
For a real-world model like GPT-3:
| Aspect | Quantity | Mathematical Branch |
|---|---|---|
| Parameters | 175 billion | Linear algebra (weight matrices) |
| Forward pass operations | ~300 billion multiplications per token | Linear algebra |
| Gradient computations | 175 billion per training step | Calculus |
| Output vocabulary | 50,257 token probabilities | Probability |
| Training examples | ~300 billion tokens | Statistics |
Every one of these numbers involves the mathematical concepts from this course.
Summary
A neural network brings all three mathematical pillars together:
- Linear algebra represents inputs as vectors, stores knowledge in weight matrices, and performs the core transformations at each layer
- Probability converts raw outputs into meaningful probabilities and defines the loss function that measures prediction quality
- Calculus computes gradients through backpropagation and updates weights through gradient descent
Understanding the math inside a neural network means understanding each of these steps. In the next lesson, you will see how these same concepts scale up to transformers and large language models.

