How Neural Networks Learn

This lesson brings together everything from the previous modules — derivatives, partial derivatives, the chain rule, gradients, gradient descent, and loss functions — to explain how a neural network goes from random parameters to making accurate predictions. The process is systematic and relies entirely on calculus.

A Neural Network Is a Chain of Functions

A simple neural network with one hidden layer computes:

Input → Linear → Activation → Linear → Output → Loss

More precisely:

z₁ = W₁x + b₁           (first linear transformation)
a₁ = σ(z₁)               (activation function)
z₂ = W₂a₁ + b₂           (second linear transformation)
ŷ = z₂                    (output, for regression)
L = (1/2)(ŷ - y)²         (loss)

Each step is a function whose input is the previous step's output. The entire network is a composition of functions — exactly the setting where the chain rule applies.

The Parameters

The trainable parameters are:

W₁: weight matrix for the first layer
b₁: bias vector for the first layer
W₂: weight matrix for the second layer
b₂: bias vector for the second layer

Training means finding values for these parameters that minimize the loss L over the training data.

The Training Process

Neural network training repeats three phases:

Phase 1: Forward Pass

Push input data through the network to compute a prediction:

x → z₁ → a₁ → z₂ → ŷ → L

Each intermediate value is saved because it will be needed during the backward pass.

Phase 2: Backward Pass (Backpropagation)

Compute the gradient of the loss with respect to every parameter by applying the chain rule from the output back to the input:

L → ŷ → z₂ → a₁ → z₁ → (W₁, b₁, W₂, b₂)

This gives: ∂L/∂W₂, ∂L/∂b₂, ∂L/∂W₁, ∂L/∂b₁.

Phase 3: Parameter Update

Use gradient descent to update each parameter:

W₂ = W₂ - α · ∂L/∂W₂
b₂ = b₂ - α · ∂L/∂b₂
W₁ = W₁ - α · ∂L/∂W₁
b₁ = b₁ - α · ∂L/∂b₁

Then repeat from Phase 1 with the next batch of data.

A Concrete Example

Consider a tiny network:

Input: x = 1.0
One hidden neuron with weight w₁ = 0.5, bias b₁ = 0.1, sigmoid activation
One output neuron with weight w₂ = 0.8, bias b₂ = 0.2
Target: y = 1.0

Forward Pass

z₁ = w₁ · x + b₁ = 0.5(1.0) + 0.1 = 0.6
a₁ = σ(0.6) = 1/(1 + e⁻⁰·⁶) ≈ 0.6457
z₂ = w₂ · a₁ + b₂ = 0.8(0.6457) + 0.2 = 0.7166
ŷ = z₂ = 0.7166
L = (1/2)(0.7166 - 1.0)² = (1/2)(0.0803) = 0.0401

The model predicted 0.717 when the target is 1.0. The loss is 0.04.

Backward Pass

Now compute the gradient of L with respect to each parameter. We start from the loss and work backward.

Step 1: ∂L/∂ŷ

∂L/∂ŷ = ŷ - y = 0.7166 - 1.0 = -0.2834

Step 2: ∂L/∂w₂ and ∂L/∂b₂ Since ŷ = z₂ = w₂a₁ + b₂:

∂L/∂w₂ = ∂L/∂ŷ · ∂ŷ/∂w₂ = (-0.2834) · a₁ = (-0.2834)(0.6457) = -0.1830
∂L/∂b₂ = ∂L/∂ŷ · ∂ŷ/∂b₂ = (-0.2834) · 1 = -0.2834

Step 3: ∂L/∂a₁ (to chain back further)

∂L/∂a₁ = ∂L/∂ŷ · ∂ŷ/∂a₁ = (-0.2834) · w₂ = (-0.2834)(0.8) = -0.2267

Step 4: ∂L/∂w₁ and ∂L/∂b₁ Since a₁ = σ(z₁) and z₁ = w₁x + b₁:

∂a₁/∂z₁ = σ(z₁)(1 - σ(z₁)) = 0.6457 · 0.3543 = 0.2288

∂L/∂w₁ = ∂L/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂w₁
        = (-0.2267)(0.2288)(x)
        = (-0.2267)(0.2288)(1.0)
        = -0.0519

∂L/∂b₁ = ∂L/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂b₁
        = (-0.2267)(0.2288)(1)
        = -0.0519

Parameter Update (learning rate α = 0.1)

w₂ = 0.8 - 0.1(-0.1830)  = 0.8183
b₂ = 0.2 - 0.1(-0.2834)  = 0.2283
w₁ = 0.5 - 0.1(-0.0519)  = 0.5052
b₁ = 0.1 - 0.1(-0.0519)  = 0.1052

All gradients were negative (the loss decreases if we increase the parameters), so all weights increased. After this update, the model's prediction will be slightly closer to 1.0.

What Makes This Efficient

Notice that when computing ∂L/∂w₁, we reused ∂L/∂a₁ from Step 3. Each backward step passes a "gradient signal" to the previous layer, which multiplies it by its local derivative and passes it further back. This reuse is what makes backpropagation efficient — each intermediate gradient is computed once and reused.

Without this reuse, computing the gradient for early layers would require repeating all the work for later layers, making training exponentially more expensive.

The Gradient Signal

Think of the backward pass as a signal flowing from the loss to each parameter:

L → ∂L/∂ŷ → ∂L/∂z₂ → ∂L/∂a₁ → ∂L/∂z₁ → ∂L/∂w₁
      ↓
    ∂L/∂w₂

At each node, the incoming signal is multiplied by the local derivative and passed backward. The final signal at each parameter tells it exactly how to change.

Summary

Neural networks are compositions of functions: linear transformations and activations
Training has three phases: forward pass, backward pass, and parameter update
The forward pass computes the prediction and loss from inputs to output
The backward pass computes gradients from the loss back to each parameter using the chain rule
Each backward step multiplies the incoming gradient signal by the local derivative
Intermediate gradients are computed once and reused, making the process efficient
After computing all gradients, gradient descent updates every parameter simultaneously
This process repeats for each batch of training data until the loss converges

The next lesson walks through backpropagation in detail, showing exactly how the chain rule is applied layer by layer in a multi-neuron network.

How Neural Networks Learn

A Neural Network Is a Chain of Functions

A simple neural network with one hidden layer computes:

Input → Linear → Activation → Linear → Output → Loss

More precisely:

z₁ = W₁x + b₁           (first linear transformation)
a₁ = σ(z₁)               (activation function)
z₂ = W₂a₁ + b₂           (second linear transformation)
ŷ = z₂                    (output, for regression)
L = (1/2)(ŷ - y)²         (loss)

Each step is a function whose input is the previous step's output. The entire network is a composition of functions — exactly the setting where the chain rule applies.

The Parameters

The trainable parameters are:

W₁: weight matrix for the first layer
b₁: bias vector for the first layer
W₂: weight matrix for the second layer
b₂: bias vector for the second layer

Training means finding values for these parameters that minimize the loss L over the training data.

The Training Process

Neural network training repeats three phases:

Phase 1: Forward Pass

Push input data through the network to compute a prediction:

x → z₁ → a₁ → z₂ → ŷ → L

Each intermediate value is saved because it will be needed during the backward pass.

Phase 2: Backward Pass (Backpropagation)

Compute the gradient of the loss with respect to every parameter by applying the chain rule from the output back to the input:

L → ŷ → z₂ → a₁ → z₁ → (W₁, b₁, W₂, b₂)

This gives: ∂L/∂W₂, ∂L/∂b₂, ∂L/∂W₁, ∂L/∂b₁.

Phase 3: Parameter Update

Use gradient descent to update each parameter:

W₂ = W₂ - α · ∂L/∂W₂
b₂ = b₂ - α · ∂L/∂b₂
W₁ = W₁ - α · ∂L/∂W₁
b₁ = b₁ - α · ∂L/∂b₁

Then repeat from Phase 1 with the next batch of data.

A Concrete Example

Consider a tiny network:

Input: x = 1.0
One hidden neuron with weight w₁ = 0.5, bias b₁ = 0.1, sigmoid activation
One output neuron with weight w₂ = 0.8, bias b₂ = 0.2
Target: y = 1.0

Forward Pass

z₁ = w₁ · x + b₁ = 0.5(1.0) + 0.1 = 0.6
a₁ = σ(0.6) = 1/(1 + e⁻⁰·⁶) ≈ 0.6457
z₂ = w₂ · a₁ + b₂ = 0.8(0.6457) + 0.2 = 0.7166
ŷ = z₂ = 0.7166
L = (1/2)(0.7166 - 1.0)² = (1/2)(0.0803) = 0.0401

The model predicted 0.717 when the target is 1.0. The loss is 0.04.

Backward Pass

Now compute the gradient of L with respect to each parameter. We start from the loss and work backward.

Step 1: ∂L/∂ŷ

∂L/∂ŷ = ŷ - y = 0.7166 - 1.0 = -0.2834

Step 2: ∂L/∂w₂ and ∂L/∂b₂ Since ŷ = z₂ = w₂a₁ + b₂:

∂L/∂w₂ = ∂L/∂ŷ · ∂ŷ/∂w₂ = (-0.2834) · a₁ = (-0.2834)(0.6457) = -0.1830
∂L/∂b₂ = ∂L/∂ŷ · ∂ŷ/∂b₂ = (-0.2834) · 1 = -0.2834

Step 3: ∂L/∂a₁ (to chain back further)

∂L/∂a₁ = ∂L/∂ŷ · ∂ŷ/∂a₁ = (-0.2834) · w₂ = (-0.2834)(0.8) = -0.2267

Step 4: ∂L/∂w₁ and ∂L/∂b₁ Since a₁ = σ(z₁) and z₁ = w₁x + b₁:

∂a₁/∂z₁ = σ(z₁)(1 - σ(z₁)) = 0.6457 · 0.3543 = 0.2288

∂L/∂w₁ = ∂L/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂w₁
        = (-0.2267)(0.2288)(x)
        = (-0.2267)(0.2288)(1.0)
        = -0.0519

∂L/∂b₁ = ∂L/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂b₁
        = (-0.2267)(0.2288)(1)
        = -0.0519

Parameter Update (learning rate α = 0.1)

w₂ = 0.8 - 0.1(-0.1830)  = 0.8183
b₂ = 0.2 - 0.1(-0.2834)  = 0.2283
w₁ = 0.5 - 0.1(-0.0519)  = 0.5052
b₁ = 0.1 - 0.1(-0.0519)  = 0.1052

All gradients were negative (the loss decreases if we increase the parameters), so all weights increased. After this update, the model's prediction will be slightly closer to 1.0.

What Makes This Efficient

Without this reuse, computing the gradient for early layers would require repeating all the work for later layers, making training exponentially more expensive.

The Gradient Signal

Think of the backward pass as a signal flowing from the loss to each parameter:

L → ∂L/∂ŷ → ∂L/∂z₂ → ∂L/∂a₁ → ∂L/∂z₁ → ∂L/∂w₁
      ↓
    ∂L/∂w₂

At each node, the incoming signal is multiplied by the local derivative and passed backward. The final signal at each parameter tells it exactly how to change.

Summary

Neural networks are compositions of functions: linear transformations and activations
Training has three phases: forward pass, backward pass, and parameter update
The forward pass computes the prediction and loss from inputs to output
The backward pass computes gradients from the loss back to each parameter using the chain rule
Each backward step multiplies the incoming gradient signal by the local derivative
Intermediate gradients are computed once and reused, making the process efficient
After computing all gradients, gradient descent updates every parameter simultaneously
This process repeats for each batch of training data until the loss converges

The next lesson walks through backpropagation in detail, showing exactly how the chain rule is applied layer by layer in a multi-neuron network.

How Neural Networks Learn

A Neural Network Is a Chain of Functions

The Parameters

The Training Process

Phase 1: Forward Pass

Phase 2: Backward Pass (Backpropagation)

Phase 3: Parameter Update

A Concrete Example

Forward Pass

Backward Pass

Parameter Update (learning rate α = 0.1)

What Makes This Efficient

The Gradient Signal

Summary

Questions & Answers

How Neural Networks Learn

A Neural Network Is a Chain of Functions

The Parameters

The Training Process

Phase 1: Forward Pass

Phase 2: Backward Pass (Backpropagation)

Phase 3: Parameter Update

A Concrete Example

Forward Pass

Backward Pass

Parameter Update (learning rate α = 0.1)

What Makes This Efficient

The Gradient Signal

Summary

Questions & Answers