How Neural Networks Learn
This lesson brings together everything from the previous modules — derivatives, partial derivatives, the chain rule, gradients, gradient descent, and loss functions — to explain how a neural network goes from random parameters to making accurate predictions. The process is systematic and relies entirely on calculus.
A Neural Network Is a Chain of Functions
A simple neural network with one hidden layer computes:
Input → Linear → Activation → Linear → Output → Loss
More precisely:
z₁ = W₁x + b₁ (first linear transformation)
a₁ = σ(z₁) (activation function)
z₂ = W₂a₁ + b₂ (second linear transformation)
ŷ = z₂ (output, for regression)
L = (1/2)(ŷ - y)² (loss)
Each step is a function whose input is the previous step's output. The entire network is a composition of functions — exactly the setting where the chain rule applies.
The Parameters
The trainable parameters are:
- W₁: weight matrix for the first layer
- b₁: bias vector for the first layer
- W₂: weight matrix for the second layer
- b₂: bias vector for the second layer
Training means finding values for these parameters that minimize the loss L over the training data.
The Training Process
Neural network training repeats three phases:
Phase 1: Forward Pass
Push input data through the network to compute a prediction:
x → z₁ → a₁ → z₂ → ŷ → L
Each intermediate value is saved because it will be needed during the backward pass.
Phase 2: Backward Pass (Backpropagation)
Compute the gradient of the loss with respect to every parameter by applying the chain rule from the output back to the input:
L → ŷ → z₂ → a₁ → z₁ → (W₁, b₁, W₂, b₂)
This gives: ∂L/∂W₂, ∂L/∂b₂, ∂L/∂W₁, ∂L/∂b₁.
Phase 3: Parameter Update
Use gradient descent to update each parameter:
W₂ = W₂ - α · ∂L/∂W₂
b₂ = b₂ - α · ∂L/∂b₂
W₁ = W₁ - α · ∂L/∂W₁
b₁ = b₁ - α · ∂L/∂b₁
Then repeat from Phase 1 with the next batch of data.
A Concrete Example
Consider a tiny network:
- Input: x = 1.0
- One hidden neuron with weight w₁ = 0.5, bias b₁ = 0.1, sigmoid activation
- One output neuron with weight w₂ = 0.8, bias b₂ = 0.2
- Target: y = 1.0
Forward Pass
z₁ = w₁ · x + b₁ = 0.5(1.0) + 0.1 = 0.6
a₁ = σ(0.6) = 1/(1 + e⁻⁰·⁶) ≈ 0.6457
z₂ = w₂ · a₁ + b₂ = 0.8(0.6457) + 0.2 = 0.7166
ŷ = z₂ = 0.7166
L = (1/2)(0.7166 - 1.0)² = (1/2)(0.0803) = 0.0401
The model predicted 0.717 when the target is 1.0. The loss is 0.04.
Backward Pass
Now compute the gradient of L with respect to each parameter. We start from the loss and work backward.
Step 1: ∂L/∂ŷ
∂L/∂ŷ = ŷ - y = 0.7166 - 1.0 = -0.2834
Step 2: ∂L/∂w₂ and ∂L/∂b₂ Since ŷ = z₂ = w₂a₁ + b₂:
∂L/∂w₂ = ∂L/∂ŷ · ∂ŷ/∂w₂ = (-0.2834) · a₁ = (-0.2834)(0.6457) = -0.1830
∂L/∂b₂ = ∂L/∂ŷ · ∂ŷ/∂b₂ = (-0.2834) · 1 = -0.2834
Step 3: ∂L/∂a₁ (to chain back further)
∂L/∂a₁ = ∂L/∂ŷ · ∂ŷ/∂a₁ = (-0.2834) · w₂ = (-0.2834)(0.8) = -0.2267
Step 4: ∂L/∂w₁ and ∂L/∂b₁ Since a₁ = σ(z₁) and z₁ = w₁x + b₁:
∂a₁/∂z₁ = σ(z₁)(1 - σ(z₁)) = 0.6457 · 0.3543 = 0.2288
∂L/∂w₁ = ∂L/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂w₁
= (-0.2267)(0.2288)(x)
= (-0.2267)(0.2288)(1.0)
= -0.0519
∂L/∂b₁ = ∂L/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂b₁
= (-0.2267)(0.2288)(1)
= -0.0519
Parameter Update (learning rate α = 0.1)
w₂ = 0.8 - 0.1(-0.1830) = 0.8183
b₂ = 0.2 - 0.1(-0.2834) = 0.2283
w₁ = 0.5 - 0.1(-0.0519) = 0.5052
b₁ = 0.1 - 0.1(-0.0519) = 0.1052
All gradients were negative (the loss decreases if we increase the parameters), so all weights increased. After this update, the model's prediction will be slightly closer to 1.0.
What Makes This Efficient
Notice that when computing ∂L/∂w₁, we reused ∂L/∂a₁ from Step 3. Each backward step passes a "gradient signal" to the previous layer, which multiplies it by its local derivative and passes it further back. This reuse is what makes backpropagation efficient — each intermediate gradient is computed once and reused.
Without this reuse, computing the gradient for early layers would require repeating all the work for later layers, making training exponentially more expensive.
The Gradient Signal
Think of the backward pass as a signal flowing from the loss to each parameter:
L → ∂L/∂ŷ → ∂L/∂z₂ → ∂L/∂a₁ → ∂L/∂z₁ → ∂L/∂w₁
↓
∂L/∂w₂
At each node, the incoming signal is multiplied by the local derivative and passed backward. The final signal at each parameter tells it exactly how to change.
Summary
- Neural networks are compositions of functions: linear transformations and activations
- Training has three phases: forward pass, backward pass, and parameter update
- The forward pass computes the prediction and loss from inputs to output
- The backward pass computes gradients from the loss back to each parameter using the chain rule
- Each backward step multiplies the incoming gradient signal by the local derivative
- Intermediate gradients are computed once and reused, making the process efficient
- After computing all gradients, gradient descent updates every parameter simultaneously
- This process repeats for each batch of training data until the loss converges
The next lesson walks through backpropagation in detail, showing exactly how the chain rule is applied layer by layer in a multi-neuron network.

