Math Inside Neural Networks
You have now seen the three mathematical pillars individually. In this lesson, you will see how they come together inside a neural network, the fundamental building block of modern AI. By tracing data through a network from input to output and back again during training, you will see every mathematical concept in action.
What Is a Neural Network?
A neural network is a mathematical function that takes an input (like an image or a sentence) and produces an output (like a classification or a prediction). It is built from simple, repeating blocks called layers, and each layer performs a sequence of mathematical operations.
Input → [Layer 1] → [Layer 2] → [Layer 3] → Output
weights weights weights
+ bias + bias + bias
+ ReLU + ReLU + softmax
Let us trace through each step.
Step 1: Input as a Vector (Linear Algebra)
Every input to a neural network is a vector. If you are classifying handwritten digits (the classic MNIST dataset), each image is 28×28 pixels:
28 × 28 = 784 pixels
Flatten into a vector:
input = [0.0, 0.0, 0.12, 0.89, 0.94, ..., 0.0] (784 numbers)
Each number is a pixel intensity between 0 (black) and 1 (white). The image has been converted from a visual grid into a mathematical vector that the network can process.
Step 2: Layer Computation (Linear Algebra)
Each layer performs two operations:
Matrix multiplication — multiply the input vector by the weight matrix:
z = W · x + b
Where:
x = input vector (784 numbers)
W = weight matrix (128 × 784 numbers)
b = bias vector (128 numbers)
z = pre-activation (128 numbers)
This takes 784 input features and produces 128 new features. The weight matrix determines which input features are important and how they should be combined.
Activation function — apply a non-linear function element by element:
a = ReLU(z)
ReLU(z) = max(0, z)
Before: [-0.5, 1.2, -0.1, 0.8, -2.0, 0.3]
After: [ 0.0, 1.2, 0.0, 0.8, 0.0, 0.3]
This combination of linear transformation (matrix multiply) and non-linear activation is what gives neural networks their power. Without the activation, stacking layers would accomplish nothing more than a single layer.
Step 3: Stacking Layers (Linear Algebra)
A typical network has multiple layers, each transforming the data further:
Layer 1: 784 → 128 features (W₁ is 128×784, plus b₁)
Layer 2: 128 → 64 features (W₂ is 64×128, plus b₂)
Layer 3: 64 → 10 features (W₃ is 10×64, plus b₃)
The final layer produces 10 numbers, one for each digit (0-9). Each layer extracts increasingly abstract features:
- Layer 1 might detect edges and strokes
- Layer 2 might detect curves and corners
- Layer 3 combines these into digit identities
The total number of parameters in this small network:
Layer 1: 128 × 784 + 128 = 100,480
Layer 2: 64 × 128 + 64 = 8,256
Layer 3: 10 × 64 + 10 = 650
Total: 109,386 parameters
Every single one of these 109,386 numbers was learned through training.
Step 4: Output as Probabilities (Probability)
The final layer produces raw scores (logits). The softmax function converts these into probabilities:
Logits: [1.2, 0.3, -0.5, 0.1, 3.8, 0.2, -0.1, 8.1, 0.4, 0.9]
0 1 2 3 4 5 6 7 8 9
After softmax: [0.01, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00, 0.96, 0.00, 0.01]
The network predicts digit "7" with 96% confidence. The softmax function ensures all probabilities sum to 1.0 and every value is between 0 and 1.
Step 5: Measuring Error (Probability + Calculus)
If the true label is "7", we compute the cross-entropy loss:
Loss = -log(P(correct class))
= -log(0.96)
= 0.04
This is a small loss because the model was very confident and correct. If the model had assigned only 10% probability to the correct class:
Loss = -log(0.10)
= 2.30
A much larger loss. The loss function translates probability into a single number that measures how wrong the model is.
Step 6: Computing Gradients (Calculus)
Now comes the learning. We need to compute the gradient: how does the loss change when each of the 109,386 parameters changes?
Backpropagation uses the chain rule to compute these gradients efficiently, working backward from the loss through each layer:
Step 1: How does the loss depend on the softmax output?
∂L/∂softmax → from the cross-entropy formula
Step 2: How does the softmax output depend on Layer 3's output?
∂softmax/∂z₃ → from the softmax formula
Step 3: How does Layer 3's output depend on Layer 3's weights?
∂z₃/∂W₃ → this is just the input to Layer 3
Step 4: How does Layer 3's output depend on Layer 2's output?
∂z₃/∂a₂ → this is just W₃
... continue backward through all layers ...
The chain rule multiplies all these partial derivatives together:
∂L/∂W₁ = ∂L/∂softmax × ∂softmax/∂z₃ × ∂z₃/∂a₂ × ∂a₂/∂z₂ × ∂z₂/∂a₁ × ∂a₁/∂z₁ × ∂z₁/∂W₁
This chain of multiplications is why the algorithm is called "backpropagation" — the gradient signal propagates backward through the network.
Step 7: Updating Weights (Calculus)
With all gradients computed, every weight is updated using gradient descent:
W₁ = W₁ - learning_rate × ∂L/∂W₁
W₂ = W₂ - learning_rate × ∂L/∂W₂
W₃ = W₃ - learning_rate × ∂L/∂W₃
b₁ = b₁ - learning_rate × ∂L/∂b₁
b₂ = b₂ - learning_rate × ∂L/∂b₂
b₃ = b₃ - learning_rate × ∂L/∂b₃
Each weight moves a tiny amount in the direction that reduces the loss. After seeing thousands of training examples, the weights converge to values that make accurate predictions.
Step 8: Repeat (All Three Pillars)
The training loop repeats steps 1-7 for every example in the training dataset, typically multiple times (epochs):
For each epoch:
For each training example:
Forward pass (linear algebra)
Compute loss (probability)
Backward pass (calculus)
Update weights (calculus)
Evaluate on test set (statistics)
A typical training run might involve millions of forward and backward passes.
The Math in Numbers
For a real-world model like GPT-3:
| Aspect | Quantity | Mathematical Branch |
|---|---|---|
| Parameters | 175 billion | Linear algebra (weight matrices) |
| Forward pass operations | ~300 billion multiplications per token | Linear algebra |
| Gradient computations | 175 billion per training step | Calculus |
| Output vocabulary | 50,257 token probabilities | Probability |
| Training examples | ~300 billion tokens | Statistics |
Every one of these numbers involves the mathematical concepts from this course.
Summary
A neural network brings all three mathematical pillars together:
- Linear algebra represents inputs as vectors, stores knowledge in weight matrices, and performs the core transformations at each layer
- Probability converts raw outputs into meaningful probabilities and defines the loss function that measures prediction quality
- Calculus computes gradients through backpropagation and updates weights through gradient descent
Understanding the math inside a neural network means understanding each of these steps. In the next lesson, you will see how these same concepts scale up to transformers and large language models.

