Backpropagation Step by Step
Backpropagation is the algorithm that computes the gradient of the loss with respect to every weight in a neural network. It is not a separate concept from the chain rule — it is the chain rule applied systematically from the output layer back to the input layer. This lesson walks through the algorithm on a multi-neuron network.
The Network
Consider a network with:
- 2 input features: x₁, x₂
- 2 hidden neurons with sigmoid activation
- 1 output neuron (no activation, for regression)
- MSE loss
x₁ ──→ [h₁] ──→ [out] ──→ L
╲ ╱ ╲ ╱
╲╱ ╲╱
╱╲ ╱╲
╱ ╲ ╱
x₂ ──→ [h₂] ──┘
Parameters
Layer 1 (input → hidden):
w₁₁ = 0.1 w₁₂ = 0.3 b₁ = 0.1 (weights to h₁)
w₂₁ = 0.2 w₂₂ = 0.4 b₂ = 0.1 (weights to h₂)
Layer 2 (hidden → output):
v₁ = 0.5 v₂ = 0.6 b₃ = 0.1 (weights to output)
Inputs and Target
x₁ = 1.0, x₂ = 0.5, target y = 1.0
Step 1: Forward Pass
Hidden layer pre-activation:
z₁ = w₁₁x₁ + w₁₂x₂ + b₁ = 0.1(1.0) + 0.3(0.5) + 0.1 = 0.35
z₂ = w₂₁x₁ + w₂₂x₂ + b₂ = 0.2(1.0) + 0.4(0.5) + 0.1 = 0.50
Hidden layer activation (sigmoid):
a₁ = σ(0.35) = 1/(1 + e⁻⁰·³⁵) ≈ 0.5866
a₂ = σ(0.50) = 1/(1 + e⁻⁰·⁵⁰) ≈ 0.6225
Output:
ŷ = v₁a₁ + v₂a₂ + b₃ = 0.5(0.5866) + 0.6(0.6225) + 0.1 = 0.7668
Loss (MSE):
L = (1/2)(ŷ - y)² = (1/2)(0.7668 - 1.0)² = (1/2)(0.0544) = 0.0272
Step 2: Backward Pass — Output Layer
Start with the derivative of the loss:
∂L/∂ŷ = ŷ - y = 0.7668 - 1.0 = -0.2332
Gradients for output layer weights:
∂L/∂v₁ = ∂L/∂ŷ · ∂ŷ/∂v₁ = (-0.2332) · a₁ = (-0.2332)(0.5866) = -0.1368
∂L/∂v₂ = ∂L/∂ŷ · ∂ŷ/∂v₂ = (-0.2332) · a₂ = (-0.2332)(0.6225) = -0.1452
∂L/∂b₃ = ∂L/∂ŷ · 1 = -0.2332
Pass the signal back to the hidden layer:
∂L/∂a₁ = ∂L/∂ŷ · ∂ŷ/∂a₁ = (-0.2332) · v₁ = (-0.2332)(0.5) = -0.1166
∂L/∂a₂ = ∂L/∂ŷ · ∂ŷ/∂a₂ = (-0.2332) · v₂ = (-0.2332)(0.6) = -0.1399
Step 3: Backward Pass — Hidden Layer
Through the sigmoid activation:
The sigmoid derivative is σ(z)(1 - σ(z)):
σ'(z₁) = a₁(1 - a₁) = 0.5866(1 - 0.5866) = 0.2425
σ'(z₂) = a₂(1 - a₂) = 0.6225(1 - 0.6225) = 0.2350
Chain rule through the activation:
∂L/∂z₁ = ∂L/∂a₁ · σ'(z₁) = (-0.1166)(0.2425) = -0.0283
∂L/∂z₂ = ∂L/∂a₂ · σ'(z₂) = (-0.1399)(0.2350) = -0.0329
Gradients for hidden layer weights:
∂L/∂w₁₁ = ∂L/∂z₁ · x₁ = (-0.0283)(1.0) = -0.0283
∂L/∂w₁₂ = ∂L/∂z₁ · x₂ = (-0.0283)(0.5) = -0.0141
∂L/∂b₁ = ∂L/∂z₁ · 1 = -0.0283
∂L/∂w₂₁ = ∂L/∂z₂ · x₁ = (-0.0329)(1.0) = -0.0329
∂L/∂w₂₂ = ∂L/∂z₂ · x₂ = (-0.0329)(0.5) = -0.0164
∂L/∂b₂ = ∂L/∂z₂ · 1 = -0.0329
Step 4: Parameter Update
With learning rate α = 0.5:
Output layer:
v₁ = 0.5 - 0.5(-0.1368) = 0.5684
v₂ = 0.6 - 0.5(-0.1452) = 0.6726
b₃ = 0.1 - 0.5(-0.2332) = 0.2166
Hidden layer:
w₁₁ = 0.1 - 0.5(-0.0283) = 0.1141
w₁₂ = 0.3 - 0.5(-0.0141) = 0.3071
b₁ = 0.1 - 0.5(-0.0283) = 0.1141
w₂₁ = 0.2 - 0.5(-0.0329) = 0.2164
w₂₂ = 0.4 - 0.5(-0.0164) = 0.4082
b₂ = 0.1 - 0.5(-0.0329) = 0.1164
All weights increased because all gradients were negative — the loss decreases when weights increase, pushing the prediction closer to the target of 1.0.
The Pattern
Every layer's backward pass follows the same pattern:
1. Receive the gradient signal from the layer above: ∂L/∂(layer output)
2. Multiply by the local derivative of the activation: ∂L/∂z = ∂L/∂a · σ'(z)
3. Compute weight gradients: ∂L/∂W = ∂L/∂z · (input to this layer)ᵀ
4. Compute bias gradients: ∂L/∂b = ∂L/∂z
5. Pass the signal to the previous layer: ∂L/∂(layer input) = Wᵀ · ∂L/∂z
This pattern is the same for every layer, regardless of network depth. You just repeat it.
In Matrix Form
For a layer with weight matrix W, bias b, and activation σ:
Forward: z = Wx + b, a = σ(z)
Backward: δ = ∂L/∂a ⊙ σ'(z) (element-wise multiply)
∂L/∂W = δ · xᵀ
∂L/∂b = δ
∂L/∂x = Wᵀ · δ (pass to previous layer)
where ⊙ denotes element-wise multiplication. This matrix formulation is what frameworks implement for efficient GPU computation.
Why It Is Called "Back" Propagation
The algorithm propagates the error signal backward through the network:
Layer 1 Layer 2 Output Loss
W₁,b₁ → W₂,b₂ → v,b₃ → L
← ← ←
∂L/∂W₁ ∂L/∂W₂ ∂L/∂v ∂L/∂ŷ
Forward: left → right (compute predictions)
Backward: right → left (compute gradients)
The forward pass flows data from input to output. The backward pass flows gradients from output to input. Both passes visit every layer exactly once, making backpropagation O(n) in the number of layers.
Summary
- Backpropagation applies the chain rule systematically from the output layer to the input layer
- At each layer: receive gradient, multiply by local derivative, compute parameter gradients, pass signal backward
- The gradient at each weight equals the incoming gradient signal times the layer input
- The backward pass reuses intermediate results from the forward pass
- In matrix form, the backward pass uses transposed weight matrices to propagate gradients
- The algorithm visits each layer once in the forward pass and once in the backward pass
- This efficiency is what makes training deep networks with millions of parameters practical
The final lesson discusses backpropagation in practice — the challenges, solutions, and how modern frameworks handle it.

