Backpropagation Step by Step

Backpropagation is the algorithm that computes the gradient of the loss with respect to every weight in a neural network. It is not a separate concept from the chain rule — it is the chain rule applied systematically from the output layer back to the input layer. This lesson walks through the algorithm on a multi-neuron network.

The Network

Consider a network with:

2 input features: x₁, x₂
2 hidden neurons with sigmoid activation
1 output neuron (no activation, for regression)
MSE loss

x₁ ──→ [h₁] ──→ [out] ──→ L
    ╲  ╱    ╲  ╱
     ╲╱      ╲╱
     ╱╲      ╱╲
    ╱  ╲    ╱
x₂ ──→ [h₂] ──┘

Parameters

Layer 1 (input → hidden):

w₁₁ = 0.1    w₁₂ = 0.3    b₁ = 0.1    (weights to h₁)
w₂₁ = 0.2    w₂₂ = 0.4    b₂ = 0.1    (weights to h₂)

Layer 2 (hidden → output):

v₁ = 0.5    v₂ = 0.6    b₃ = 0.1    (weights to output)

Inputs and Target

x₁ = 1.0, x₂ = 0.5, target y = 1.0

Step 1: Forward Pass

Hidden layer pre-activation:

z₁ = w₁₁x₁ + w₁₂x₂ + b₁ = 0.1(1.0) + 0.3(0.5) + 0.1 = 0.35
z₂ = w₂₁x₁ + w₂₂x₂ + b₂ = 0.2(1.0) + 0.4(0.5) + 0.1 = 0.50

Hidden layer activation (sigmoid):

a₁ = σ(0.35) = 1/(1 + e⁻⁰·³⁵) ≈ 0.5866
a₂ = σ(0.50) = 1/(1 + e⁻⁰·⁵⁰) ≈ 0.6225

Output:

ŷ = v₁a₁ + v₂a₂ + b₃ = 0.5(0.5866) + 0.6(0.6225) + 0.1 = 0.7668

Loss (MSE):

L = (1/2)(ŷ - y)² = (1/2)(0.7668 - 1.0)² = (1/2)(0.0544) = 0.0272

Step 2: Backward Pass — Output Layer

Start with the derivative of the loss:

∂L/∂ŷ = ŷ - y = 0.7668 - 1.0 = -0.2332

Gradients for output layer weights:

∂L/∂v₁ = ∂L/∂ŷ · ∂ŷ/∂v₁ = (-0.2332) · a₁ = (-0.2332)(0.5866) = -0.1368
∂L/∂v₂ = ∂L/∂ŷ · ∂ŷ/∂v₂ = (-0.2332) · a₂ = (-0.2332)(0.6225) = -0.1452
∂L/∂b₃ = ∂L/∂ŷ · 1 = -0.2332

Pass the signal back to the hidden layer:

∂L/∂a₁ = ∂L/∂ŷ · ∂ŷ/∂a₁ = (-0.2332) · v₁ = (-0.2332)(0.5) = -0.1166
∂L/∂a₂ = ∂L/∂ŷ · ∂ŷ/∂a₂ = (-0.2332) · v₂ = (-0.2332)(0.6) = -0.1399

Step 3: Backward Pass — Hidden Layer

Through the sigmoid activation:

The sigmoid derivative is σ(z)(1 - σ(z)):

σ'(z₁) = a₁(1 - a₁) = 0.5866(1 - 0.5866) = 0.2425
σ'(z₂) = a₂(1 - a₂) = 0.6225(1 - 0.6225) = 0.2350

Chain rule through the activation:

∂L/∂z₁ = ∂L/∂a₁ · σ'(z₁) = (-0.1166)(0.2425) = -0.0283
∂L/∂z₂ = ∂L/∂a₂ · σ'(z₂) = (-0.1399)(0.2350) = -0.0329

Gradients for hidden layer weights:

∂L/∂w₁₁ = ∂L/∂z₁ · x₁ = (-0.0283)(1.0) = -0.0283
∂L/∂w₁₂ = ∂L/∂z₁ · x₂ = (-0.0283)(0.5) = -0.0141
∂L/∂b₁  = ∂L/∂z₁ · 1   = -0.0283

∂L/∂w₂₁ = ∂L/∂z₂ · x₁ = (-0.0329)(1.0) = -0.0329
∂L/∂w₂₂ = ∂L/∂z₂ · x₂ = (-0.0329)(0.5) = -0.0164
∂L/∂b₂  = ∂L/∂z₂ · 1   = -0.0329

Step 4: Parameter Update

With learning rate α = 0.5:

Output layer:

v₁  = 0.5  - 0.5(-0.1368) = 0.5684
v₂  = 0.6  - 0.5(-0.1452) = 0.6726
b₃  = 0.1  - 0.5(-0.2332) = 0.2166

Hidden layer:

w₁₁ = 0.1  - 0.5(-0.0283) = 0.1141
w₁₂ = 0.3  - 0.5(-0.0141) = 0.3071
b₁  = 0.1  - 0.5(-0.0283) = 0.1141
w₂₁ = 0.2  - 0.5(-0.0329) = 0.2164
w₂₂ = 0.4  - 0.5(-0.0164) = 0.4082
b₂  = 0.1  - 0.5(-0.0329) = 0.1164

All weights increased because all gradients were negative — the loss decreases when weights increase, pushing the prediction closer to the target of 1.0.

The Pattern

Every layer's backward pass follows the same pattern:

1. Receive the gradient signal from the layer above: ∂L/∂(layer output)
2. Multiply by the local derivative of the activation: ∂L/∂z = ∂L/∂a · σ'(z)
3. Compute weight gradients: ∂L/∂W = ∂L/∂z · (input to this layer)ᵀ
4. Compute bias gradients: ∂L/∂b = ∂L/∂z
5. Pass the signal to the previous layer: ∂L/∂(layer input) = Wᵀ · ∂L/∂z

This pattern is the same for every layer, regardless of network depth. You just repeat it.

In Matrix Form

For a layer with weight matrix W, bias b, and activation σ:

Forward:   z = Wx + b,    a = σ(z)
Backward:  δ = ∂L/∂a ⊙ σ'(z)         (element-wise multiply)
           ∂L/∂W = δ · xᵀ
           ∂L/∂b = δ
           ∂L/∂x = Wᵀ · δ             (pass to previous layer)

where ⊙ denotes element-wise multiplication. This matrix formulation is what frameworks implement for efficient GPU computation.

Why It Is Called "Back" Propagation

The algorithm propagates the error signal backward through the network:

Layer 1        Layer 2        Output       Loss
  W₁,b₁  →      W₂,b₂  →      v,b₃  →      L
         ←              ←             ←
      ∂L/∂W₁       ∂L/∂W₂        ∂L/∂v     ∂L/∂ŷ

Forward: left → right (compute predictions)
Backward: right → left (compute gradients)

The forward pass flows data from input to output. The backward pass flows gradients from output to input. Both passes visit every layer exactly once, making backpropagation O(n) in the number of layers.

Summary

Backpropagation applies the chain rule systematically from the output layer to the input layer
At each layer: receive gradient, multiply by local derivative, compute parameter gradients, pass signal backward
The gradient at each weight equals the incoming gradient signal times the layer input
The backward pass reuses intermediate results from the forward pass
In matrix form, the backward pass uses transposed weight matrices to propagate gradients
The algorithm visits each layer once in the forward pass and once in the backward pass
This efficiency is what makes training deep networks with millions of parameters practical

The final lesson discusses backpropagation in practice — the challenges, solutions, and how modern frameworks handle it.

Backpropagation Step by Step

The Network

Consider a network with:

2 input features: x₁, x₂
2 hidden neurons with sigmoid activation
1 output neuron (no activation, for regression)
MSE loss

x₁ ──→ [h₁] ──→ [out] ──→ L
    ╲  ╱    ╲  ╱
     ╲╱      ╲╱
     ╱╲      ╱╲
    ╱  ╲    ╱
x₂ ──→ [h₂] ──┘

Parameters

Layer 1 (input → hidden):

w₁₁ = 0.1    w₁₂ = 0.3    b₁ = 0.1    (weights to h₁)
w₂₁ = 0.2    w₂₂ = 0.4    b₂ = 0.1    (weights to h₂)

Layer 2 (hidden → output):

v₁ = 0.5    v₂ = 0.6    b₃ = 0.1    (weights to output)

Inputs and Target

x₁ = 1.0, x₂ = 0.5, target y = 1.0

Step 1: Forward Pass

Hidden layer pre-activation:

z₁ = w₁₁x₁ + w₁₂x₂ + b₁ = 0.1(1.0) + 0.3(0.5) + 0.1 = 0.35
z₂ = w₂₁x₁ + w₂₂x₂ + b₂ = 0.2(1.0) + 0.4(0.5) + 0.1 = 0.50

Hidden layer activation (sigmoid):

a₁ = σ(0.35) = 1/(1 + e⁻⁰·³⁵) ≈ 0.5866
a₂ = σ(0.50) = 1/(1 + e⁻⁰·⁵⁰) ≈ 0.6225

Output:

ŷ = v₁a₁ + v₂a₂ + b₃ = 0.5(0.5866) + 0.6(0.6225) + 0.1 = 0.7668

Loss (MSE):

L = (1/2)(ŷ - y)² = (1/2)(0.7668 - 1.0)² = (1/2)(0.0544) = 0.0272

Step 2: Backward Pass — Output Layer

Start with the derivative of the loss:

∂L/∂ŷ = ŷ - y = 0.7668 - 1.0 = -0.2332

Gradients for output layer weights:

∂L/∂v₁ = ∂L/∂ŷ · ∂ŷ/∂v₁ = (-0.2332) · a₁ = (-0.2332)(0.5866) = -0.1368
∂L/∂v₂ = ∂L/∂ŷ · ∂ŷ/∂v₂ = (-0.2332) · a₂ = (-0.2332)(0.6225) = -0.1452
∂L/∂b₃ = ∂L/∂ŷ · 1 = -0.2332

Pass the signal back to the hidden layer:

∂L/∂a₁ = ∂L/∂ŷ · ∂ŷ/∂a₁ = (-0.2332) · v₁ = (-0.2332)(0.5) = -0.1166
∂L/∂a₂ = ∂L/∂ŷ · ∂ŷ/∂a₂ = (-0.2332) · v₂ = (-0.2332)(0.6) = -0.1399

Step 3: Backward Pass — Hidden Layer

Through the sigmoid activation:

The sigmoid derivative is σ(z)(1 - σ(z)):

σ'(z₁) = a₁(1 - a₁) = 0.5866(1 - 0.5866) = 0.2425
σ'(z₂) = a₂(1 - a₂) = 0.6225(1 - 0.6225) = 0.2350

Chain rule through the activation:

∂L/∂z₁ = ∂L/∂a₁ · σ'(z₁) = (-0.1166)(0.2425) = -0.0283
∂L/∂z₂ = ∂L/∂a₂ · σ'(z₂) = (-0.1399)(0.2350) = -0.0329

Gradients for hidden layer weights:

∂L/∂w₁₁ = ∂L/∂z₁ · x₁ = (-0.0283)(1.0) = -0.0283
∂L/∂w₁₂ = ∂L/∂z₁ · x₂ = (-0.0283)(0.5) = -0.0141
∂L/∂b₁  = ∂L/∂z₁ · 1   = -0.0283

∂L/∂w₂₁ = ∂L/∂z₂ · x₁ = (-0.0329)(1.0) = -0.0329
∂L/∂w₂₂ = ∂L/∂z₂ · x₂ = (-0.0329)(0.5) = -0.0164
∂L/∂b₂  = ∂L/∂z₂ · 1   = -0.0329

Step 4: Parameter Update

With learning rate α = 0.5:

Output layer:

v₁  = 0.5  - 0.5(-0.1368) = 0.5684
v₂  = 0.6  - 0.5(-0.1452) = 0.6726
b₃  = 0.1  - 0.5(-0.2332) = 0.2166

Hidden layer:

w₁₁ = 0.1  - 0.5(-0.0283) = 0.1141
w₁₂ = 0.3  - 0.5(-0.0141) = 0.3071
b₁  = 0.1  - 0.5(-0.0283) = 0.1141
w₂₁ = 0.2  - 0.5(-0.0329) = 0.2164
w₂₂ = 0.4  - 0.5(-0.0164) = 0.4082
b₂  = 0.1  - 0.5(-0.0329) = 0.1164

All weights increased because all gradients were negative — the loss decreases when weights increase, pushing the prediction closer to the target of 1.0.

The Pattern

Every layer's backward pass follows the same pattern:

1. Receive the gradient signal from the layer above: ∂L/∂(layer output)
2. Multiply by the local derivative of the activation: ∂L/∂z = ∂L/∂a · σ'(z)
3. Compute weight gradients: ∂L/∂W = ∂L/∂z · (input to this layer)ᵀ
4. Compute bias gradients: ∂L/∂b = ∂L/∂z
5. Pass the signal to the previous layer: ∂L/∂(layer input) = Wᵀ · ∂L/∂z

This pattern is the same for every layer, regardless of network depth. You just repeat it.

In Matrix Form

For a layer with weight matrix W, bias b, and activation σ:

Forward:   z = Wx + b,    a = σ(z)
Backward:  δ = ∂L/∂a ⊙ σ'(z)         (element-wise multiply)
           ∂L/∂W = δ · xᵀ
           ∂L/∂b = δ
           ∂L/∂x = Wᵀ · δ             (pass to previous layer)

where ⊙ denotes element-wise multiplication. This matrix formulation is what frameworks implement for efficient GPU computation.

Why It Is Called "Back" Propagation

The algorithm propagates the error signal backward through the network:

Layer 1        Layer 2        Output       Loss
  W₁,b₁  →      W₂,b₂  →      v,b₃  →      L
         ←              ←             ←
      ∂L/∂W₁       ∂L/∂W₂        ∂L/∂v     ∂L/∂ŷ

Forward: left → right (compute predictions)
Backward: right → left (compute gradients)

Summary

Backpropagation applies the chain rule systematically from the output layer to the input layer
At each layer: receive gradient, multiply by local derivative, compute parameter gradients, pass signal backward
The gradient at each weight equals the incoming gradient signal times the layer input
The backward pass reuses intermediate results from the forward pass
In matrix form, the backward pass uses transposed weight matrices to propagate gradients
The algorithm visits each layer once in the forward pass and once in the backward pass
This efficiency is what makes training deep networks with millions of parameters practical

The final lesson discusses backpropagation in practice — the challenges, solutions, and how modern frameworks handle it.

Backpropagation Step by Step

The Network

Parameters

Inputs and Target

Step 1: Forward Pass

Step 2: Backward Pass — Output Layer

Step 3: Backward Pass — Hidden Layer

Step 4: Parameter Update

The Pattern

In Matrix Form

Why It Is Called "Back" Propagation

Summary

Questions & Answers

Backpropagation Step by Step

The Network

Parameters

Inputs and Target

Step 1: Forward Pass

Step 2: Backward Pass — Output Layer

Step 3: Backward Pass — Hidden Layer

Step 4: Parameter Update

The Pattern

In Matrix Form

Why It Is Called "Back" Propagation

Summary

Questions & Answers