Backpropagation in Practice
The mathematical foundation of backpropagation is clean and elegant. In practice, however, deep networks introduce challenges: gradients can vanish or explode, numerical precision matters, and architectural choices determine whether training succeeds. This lesson covers the practical realities of backpropagation in modern deep learning.
The Vanishing Gradient Problem
When gradients pass through many layers, each layer multiplies by its local derivative. If those local derivatives are consistently less than 1, the gradient shrinks exponentially:
∂L/∂w₁ = ∂L/∂a₃ · σ'(z₃) · w₃ · σ'(z₂) · w₂ · σ'(z₁) · x
If each σ'(z) ≈ 0.25 and each w ≈ 0.5:
gradient ≈ ... × 0.125 × 0.125 × 0.125 = tiny number
For a 10-layer network with sigmoid activations, the gradient reaching the first layer can be millions of times smaller than the gradient at the output layer. Early layers barely learn.
Gradient magnitude by layer (vanishing):
Layer 10: ████████████████ (strong signal)
Layer 8: ████████ (moderate)
Layer 6: ████ (weak)
Layer 4: ██ (very weak)
Layer 2: █ (almost nothing)
Layer 1: · (vanished)
Why Sigmoid Causes This
The sigmoid derivative σ'(z) = σ(z)(1 - σ(z)) has a maximum of 0.25 (at z = 0). So every layer multiplies the gradient by at most 0.25, guaranteeing exponential decay through deep networks.
The Exploding Gradient Problem
The opposite can also occur. If local derivatives are consistently greater than 1, the gradient grows exponentially:
If each factor ≈ 2:
2 × 2 × 2 × ... × 2 (10 times) = 1024
Exploding gradients cause enormous parameter updates, leading to NaN values (numerical overflow) and training failure.
Solutions to Gradient Problems
ReLU Activation
ReLU (f(x) = max(0, x)) has a derivative of exactly 1 for positive inputs and 0 for negative inputs. For active neurons, the gradient passes through without shrinking:
ReLU derivative:
x < 0: f'(x) = 0 (neuron is "dead" — no gradient)
x > 0: f'(x) = 1 (perfect gradient flow)
This solved the vanishing gradient problem for feedforward networks and is why ReLU became the dominant activation function.
Gradient Clipping
To prevent exploding gradients, gradient clipping caps the gradient magnitude:
# Clip by value
gradient = max(min(gradient, threshold), -threshold)
# Clip by norm (more common)
if |∇L| > threshold:
∇L = threshold × (∇L / |∇L|)
Gradient clipping preserves the direction of the gradient but limits its size. It is standard practice in training RNNs and large language models.
Skip Connections (Residual Networks)
Skip connections add the input of a layer block directly to its output:
x ──→ [layers] ──→ (+) ──→ output
└─────────────→ (+)
Mathematically: output = F(x) + x, where F is the layer block. The gradient of this is:
∂output/∂x = ∂F/∂x + 1
The +1 ensures the gradient always has a direct path backward (the identity connection), preventing vanishing. This innovation enabled training of networks with 100+ layers (ResNets).
Weight Initialization
Careful initialization of weights ensures gradients start in a reasonable range:
| Method | Scale | Used With |
|---|---|---|
| Xavier/Glorot | √(2 / (fan_in + fan_out)) | Sigmoid, Tanh |
| He/Kaiming | √(2 / fan_in) | ReLU |
| LeCun | √(1 / fan_in) | SELU |
where fan_in is the number of inputs to the layer and fan_out is the number of outputs.
These initializations ensure that the variance of activations and gradients is preserved across layers at the start of training.
Automatic Differentiation in Frameworks
Modern frameworks compute backpropagation automatically:
# PyTorch
import torch
x = torch.tensor([1.0, 0.5])
w = torch.tensor([0.3, 0.7], requires_grad=True)
y_target = torch.tensor(1.0)
# Forward pass (framework builds computational graph)
y_pred = (w * x).sum()
loss = (y_pred - y_target) ** 2
# Backward pass (framework traverses graph automatically)
loss.backward()
# Gradients are now available
print(w.grad) # ∂L/∂w for each weight
You never manually compute chain rule derivatives. The framework:
- Records every operation during the forward pass
- Builds a computational graph
- On
.backward(), traverses the graph in reverse, applying the chain rule
Static vs. Dynamic Graphs
| Framework | Graph Type | Description |
|---|---|---|
| PyTorch | Dynamic | Graph rebuilt each forward pass (flexible, easy to debug) |
| JAX | Functional | Transform functions to get gradient functions |
| TensorFlow (eager) | Dynamic | Similar to PyTorch |
| TensorFlow (graph) | Static | Graph compiled once (faster, less flexible) |
Mixed Precision Training
Modern GPUs support half-precision (16-bit) floating-point arithmetic, which is faster and uses less memory. But 16-bit numbers have less precision, which can cause gradient underflow (tiny gradients become exactly zero).
Mixed precision training uses 16-bit for most operations but keeps a 32-bit copy of weights for accumulation:
Forward pass: 16-bit (fast)
Loss scaling: multiply loss by large constant (prevents underflow)
Backward pass: 16-bit (fast)
Weight update: 32-bit (precise)
Loss scaling artificially increases gradient magnitudes before the backward pass, ensuring they do not underflow in 16-bit. The scale is removed before the parameter update.
Gradient Accumulation
When the desired batch size is too large for GPU memory, gradient accumulation splits it into smaller mini-batches:
for i in range(accumulation_steps):
loss = model(mini_batch[i])
loss.backward() # gradients accumulate in .grad
optimizer.step() # update with accumulated gradient
optimizer.zero_grad() # reset for next batch
The gradients from each mini-batch are summed before the parameter update, simulating a larger effective batch size.
What You Have Learned
This course has covered the complete calculus pipeline behind machine learning:
| Module | Concept | Role in ML |
|---|---|---|
| 1 | Derivatives | Measure sensitivity of loss to parameters |
| 2 | Partial derivatives and gradients | Handle functions with millions of inputs |
| 3 | The chain rule | Enable gradient computation through composed functions |
| 4 | Gradient descent | Update parameters using gradient information |
| 5 | Loss functions and optimization | Define the objective and navigate the landscape |
| 6 | Backpropagation | Efficiently compute gradients for the entire network |
These concepts form the mathematical foundation of all neural network training. Whether you are working with a simple linear regression or a 100-billion-parameter language model, the same calculus principles apply: compute the loss, propagate gradients backward through the network using the chain rule, and update parameters to reduce the loss.
Where to Go Next
With this calculus foundation, you are prepared to:
- Understand training dynamics when reading ML papers and documentation
- Debug training failures (vanishing/exploding gradients, divergence, slow convergence)
- Make informed choices about optimizers, learning rates, and architectures
- Study more advanced topics: second-order optimization, natural gradient methods, and neural architecture search
If you have not already, consider taking the Linear Algebra for AI and Probability & Statistics for AI courses on this platform. Together with this calculus course, they provide the complete mathematical foundation for understanding modern machine learning.
Summary
- Vanishing gradients: gradient shrinks exponentially through many layers with sigmoid activations
- Exploding gradients: gradient grows exponentially, causing training instability
- ReLU solves vanishing gradients with a derivative of exactly 1 for active neurons
- Gradient clipping prevents exploding gradients by capping gradient magnitude
- Skip connections (ResNets) add identity paths that guarantee gradient flow
- Weight initialization (Xavier, He) keeps gradients stable at the start of training
- Automatic differentiation in frameworks computes backpropagation without manual calculus
- Mixed precision training uses 16-bit arithmetic with loss scaling for efficiency
- Gradient accumulation simulates large batches when memory is limited
- The calculus covered in this course — derivatives, chain rule, gradients, and optimization — is the complete foundation for understanding how all neural networks learn

