Backpropagation Step by Step
Backpropagation is the algorithm that computes the gradient of the loss with respect to every weight in a neural network. It is not a separate concept from the chain rule β it is the chain rule applied systematically from the output layer back to the input layer. This lesson walks through the algorithm on a multi-neuron network.
The Network
Consider a network with:
- 2 input features: xβ, xβ
- 2 hidden neurons with sigmoid activation
- 1 output neuron (no activation, for regression)
- MSE loss
xβ βββ [hβ] βββ [out] βββ L
β² β± β² β±
β²β± β²β±
β±β² β±β²
β± β² β±
xβ βββ [hβ] βββ
Parameters
Layer 1 (input β hidden):
wββ = 0.1 wββ = 0.3 bβ = 0.1 (weights to hβ)
wββ = 0.2 wββ = 0.4 bβ = 0.1 (weights to hβ)
Layer 2 (hidden β output):
vβ = 0.5 vβ = 0.6 bβ = 0.1 (weights to output)
Inputs and Target
xβ = 1.0, xβ = 0.5, target y = 1.0
Step 1: Forward Pass
Hidden layer pre-activation:
zβ = wββxβ + wββxβ + bβ = 0.1(1.0) + 0.3(0.5) + 0.1 = 0.35
zβ = wββxβ + wββxβ + bβ = 0.2(1.0) + 0.4(0.5) + 0.1 = 0.50
Hidden layer activation (sigmoid):
aβ = Ο(0.35) = 1/(1 + eβ»β°Β·Β³β΅) β 0.5866
aβ = Ο(0.50) = 1/(1 + eβ»β°Β·β΅β°) β 0.6225
Output:
Ε· = vβaβ + vβaβ + bβ = 0.5(0.5866) + 0.6(0.6225) + 0.1 = 0.7668
Loss (MSE):
L = (1/2)(Ε· - y)Β² = (1/2)(0.7668 - 1.0)Β² = (1/2)(0.0544) = 0.0272
Step 2: Backward Pass β Output Layer
Start with the derivative of the loss:
βL/βΕ· = Ε· - y = 0.7668 - 1.0 = -0.2332
Gradients for output layer weights:
βL/βvβ = βL/βΕ· Β· βΕ·/βvβ = (-0.2332) Β· aβ = (-0.2332)(0.5866) = -0.1368
βL/βvβ = βL/βΕ· Β· βΕ·/βvβ = (-0.2332) Β· aβ = (-0.2332)(0.6225) = -0.1452
βL/βbβ = βL/βΕ· Β· 1 = -0.2332
Pass the signal back to the hidden layer:
βL/βaβ = βL/βΕ· Β· βΕ·/βaβ = (-0.2332) Β· vβ = (-0.2332)(0.5) = -0.1166
βL/βaβ = βL/βΕ· Β· βΕ·/βaβ = (-0.2332) Β· vβ = (-0.2332)(0.6) = -0.1399
Step 3: Backward Pass β Hidden Layer
Through the sigmoid activation:
The sigmoid derivative is Ο(z)(1 - Ο(z)):
Ο'(zβ) = aβ(1 - aβ) = 0.5866(1 - 0.5866) = 0.2425
Ο'(zβ) = aβ(1 - aβ) = 0.6225(1 - 0.6225) = 0.2350
Chain rule through the activation:
βL/βzβ = βL/βaβ Β· Ο'(zβ) = (-0.1166)(0.2425) = -0.0283
βL/βzβ = βL/βaβ Β· Ο'(zβ) = (-0.1399)(0.2350) = -0.0329
Gradients for hidden layer weights:
βL/βwββ = βL/βzβ Β· xβ = (-0.0283)(1.0) = -0.0283
βL/βwββ = βL/βzβ Β· xβ = (-0.0283)(0.5) = -0.0141
βL/βbβ = βL/βzβ Β· 1 = -0.0283
βL/βwββ = βL/βzβ Β· xβ = (-0.0329)(1.0) = -0.0329
βL/βwββ = βL/βzβ Β· xβ = (-0.0329)(0.5) = -0.0164
βL/βbβ = βL/βzβ Β· 1 = -0.0329
Step 4: Parameter Update
With learning rate Ξ± = 0.5:
Output layer:
vβ = 0.5 - 0.5(-0.1368) = 0.5684
vβ = 0.6 - 0.5(-0.1452) = 0.6726
bβ = 0.1 - 0.5(-0.2332) = 0.2166
Hidden layer:
wββ = 0.1 - 0.5(-0.0283) = 0.1141
wββ = 0.3 - 0.5(-0.0141) = 0.3071
bβ = 0.1 - 0.5(-0.0283) = 0.1141
wββ = 0.2 - 0.5(-0.0329) = 0.2164
wββ = 0.4 - 0.5(-0.0164) = 0.4082
bβ = 0.1 - 0.5(-0.0329) = 0.1164
All weights increased because all gradients were negative β the loss decreases when weights increase, pushing the prediction closer to the target of 1.0.
The Pattern
Every layer's backward pass follows the same pattern:
1. Receive the gradient signal from the layer above: βL/β(layer output)
2. Multiply by the local derivative of the activation: βL/βz = βL/βa Β· Ο'(z)
3. Compute weight gradients: βL/βW = βL/βz Β· (input to this layer)α΅
4. Compute bias gradients: βL/βb = βL/βz
5. Pass the signal to the previous layer: βL/β(layer input) = Wα΅ Β· βL/βz
This pattern is the same for every layer, regardless of network depth. You just repeat it.
In Matrix Form
For a layer with weight matrix W, bias b, and activation Ο:
Forward: z = Wx + b, a = Ο(z)
Backward: Ξ΄ = βL/βa β Ο'(z) (element-wise multiply)
βL/βW = Ξ΄ Β· xα΅
βL/βb = Ξ΄
βL/βx = Wα΅ Β· Ξ΄ (pass to previous layer)
where β denotes element-wise multiplication. This matrix formulation is what frameworks implement for efficient GPU computation.
Why It Is Called "Back" Propagation
The algorithm propagates the error signal backward through the network:
Layer 1 Layer 2 Output Loss
Wβ,bβ β Wβ,bβ β v,bβ β L
β β β
βL/βWβ βL/βWβ βL/βv βL/βΕ·
Forward: left β right (compute predictions)
Backward: right β left (compute gradients)
The forward pass flows data from input to output. The backward pass flows gradients from output to input. Both passes visit every layer exactly once, making backpropagation O(n) in the number of layers.
Summary
- Backpropagation applies the chain rule systematically from the output layer to the input layer
- At each layer: receive gradient, multiply by local derivative, compute parameter gradients, pass signal backward
- The gradient at each weight equals the incoming gradient signal times the layer input
- The backward pass reuses intermediate results from the forward pass
- In matrix form, the backward pass uses transposed weight matrices to propagate gradients
- The algorithm visits each layer once in the forward pass and once in the backward pass
- This efficiency is what makes training deep networks with millions of parameters practical
The final lesson discusses backpropagation in practice β the challenges, solutions, and how modern frameworks handle it.

