Math Inside Neural Networks
You have now seen the three mathematical pillars individually. In this lesson, you will see how they come together inside a neural network, the fundamental building block of modern AI. By tracing data through a network from input to output and back again during training, you will see every mathematical concept in action.
What Is a Neural Network?
A neural network is a mathematical function that takes an input (like an image or a sentence) and produces an output (like a classification or a prediction). It is built from simple, repeating blocks called layers, and each layer performs a sequence of mathematical operations.
Input → [Layer 1] → [Layer 2] → [Layer 3] → Output
weights weights weights
+ bias + bias + bias
+ ReLU + ReLU + softmax
Let us trace through each step.
Step 1: Input as a Vector (Linear Algebra)
Every input to a neural network is a vector. If you are classifying handwritten digits (the classic MNIST dataset), each image is 28×28 pixels:
28 × 28 = 784 pixels
Flatten into a vector:
input = [0.0, 0.0, 0.12, 0.89, 0.94, ..., 0.0] (784 numbers)
Each number is a pixel intensity between 0 (black) and 1 (white). The image has been converted from a visual grid into a mathematical vector that the network can process.
Step 2: Layer Computation (Linear Algebra)
Each layer performs two operations:
Matrix multiplication — multiply the input vector by the weight matrix:
z = W · x + b
Where:
x = input vector (784 numbers)
W = weight matrix (128 × 784 numbers)
b = bias vector (128 numbers)
z = pre-activation (128 numbers)
This takes 784 input features and produces 128 new features. The weight matrix determines which input features are important and how they should be combined.
Activation function — apply a non-linear function element by element:
a = ReLU(z)
ReLU(z) = max(0, z)
Before: [-0.5, 1.2, -0.1, 0.8, -2.0, 0.3]
After: [ 0.0, 1.2, 0.0, 0.8, 0.0, 0.3]
This combination of linear transformation (matrix multiply) and non-linear activation is what gives neural networks their power. Without the activation, stacking layers would accomplish nothing more than a single layer.
Step 3: Stacking Layers (Linear Algebra)
A typical network has multiple layers, each transforming the data further:
Layer 1: 784 → 128 features (W₁ is 128×784, plus b₁)
Layer 2: 128 → 64 features (W₂ is 64×128, plus b₂)
Layer 3: 64 → 10 features (W₃ is 10×64, plus b₃)
The final layer produces 10 numbers, one for each digit (0-9). Each layer extracts increasingly abstract features:
- Layer 1 might detect edges and strokes
- Layer 2 might detect curves and corners
- Layer 3 combines these into digit identities
The total number of parameters in this small network:
Layer 1: 128 × 784 + 128 = 100,480
Layer 2: 64 × 128 + 64 = 8,256
Layer 3: 10 × 64 + 10 = 650
Total: 109,386 parameters
Every single one of these 109,386 numbers was learned through training.
Step 4: Output as Probabilities (Probability)
The final layer produces raw scores (logits). The softmax function converts these into probabilities:
Logits: [1.2, 0.3, -0.5, 0.1, 3.8, 0.2, -0.1, 8.1, 0.4, 0.9]
0 1 2 3 4 5 6 7 8 9
After softmax: [0.01, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00, 0.96, 0.00, 0.01]
The network predicts digit "7" with 96% confidence. The softmax function ensures all probabilities sum to 1.0 and every value is between 0 and 1.
Step 5: Measuring Error (Probability + Calculus)
If the true label is "7", we compute the cross-entropy loss:
Loss = -log(P(correct class))
= -log(0.96)
= 0.04
This is a small loss because the model was very confident and correct. If the model had assigned only 10% probability to the correct class:
Loss = -log(0.10)
= 2.30
A much larger loss. The loss function translates probability into a single number that measures how wrong the model is.
Step 6: Computing Gradients (Calculus)
Now comes the learning. We need to compute the gradient: how does the loss change when each of the 109,386 parameters changes?
Backpropagation uses the chain rule to compute these gradients efficiently, working backward from the loss through each layer:
Step 1: How does the loss depend on the softmax output?
∂L/∂softmax → from the cross-entropy formula
Step 2: How does the softmax output depend on Layer 3's output?
∂softmax/∂z₃ → from the softmax formula
Step 3: How does Layer 3's output depend on Layer 3's weights?
∂z₃/∂W₃ → this is just the input to Layer 3
Step 4: How does Layer 3's output depend on Layer 2's output?
∂z₃/∂a₂ → this is just W₃
... continue backward through all layers ...
The chain rule multiplies all these partial derivatives together:
∂L/∂W₁ = ∂L/∂softmax × ∂softmax/∂z₃ × ∂z₃/∂a₂ × ∂a₂/∂z₂ × ∂z₂/∂a₁ × ∂a₁/∂z₁ × ∂z₁/∂W₁
This chain of multiplications is why the algorithm is called "backpropagation" — the gradient signal propagates backward through the network.
Step 7: Updating Weights (Calculus)
With all gradients computed, every weight is updated using gradient descent:
W₁ = W₁ - learning_rate × ∂L/∂W₁
W₂ = W₂ - learning_rate × ∂L/∂W₂
W₃ = W₃ - learning_rate × ∂L/∂W₃
b₁ = b₁ - learning_rate × ∂L/∂b₁
b₂ = b₂ - learning_rate × ∂L/∂b₂
b₃ = b₃ - learning_rate × ∂L/∂b₃
Each weight moves a tiny amount in the direction that reduces the loss. After seeing thousands of training examples, the weights converge to values that make accurate predictions.
Step 8: Repeat (All Three Pillars)
The training loop repeats steps 1-7 for every example in the training dataset, typically multiple times (epochs):
For each epoch:
For each training example:
Forward pass (linear algebra)
Compute loss (probability)
Backward pass (calculus)
Update weights (calculus)
Evaluate on test set (statistics)
A typical training run might involve millions of forward and backward passes.
The Math in Numbers
For a real-world model like GPT-3:
| Aspect | Quantity | Mathematical Branch |
|---|---|---|
| Parameters | 175 billion | Linear algebra (weight matrices) |
| Forward pass operations | ~300 billion multiplications per token | Linear algebra |
| Gradient computations | 175 billion per training step | Calculus |
| Output vocabulary | 50,257 token probabilities | Probability |
| Training examples | ~300 billion tokens | Statistics |
Every one of these numbers involves the mathematical concepts from this course.
Summary
A neural network brings all three mathematical pillars together:
- Linear algebra represents inputs as vectors, stores knowledge in weight matrices, and performs the core transformations at each layer
- Probability converts raw outputs into meaningful probabilities and defines the loss function that measures prediction quality
- Calculus computes gradients through backpropagation and updates weights through gradient descent
Understanding the math inside a neural network means understanding each of these steps. In the next lesson, you will see how these same concepts scale up to transformers and large language models.
Discussion
Sign in to join the discussion.

