The Chain Rule for Multiple Variables
Real neural networks do not have a single chain of computation. Each layer has multiple weights, each neuron receives multiple inputs, and the loss depends on all parameters simultaneously. The multivariable chain rule handles this by summing contributions from all paths connecting an input to the output.
The Multivariable Chain Rule
If z depends on variables u₁, u₂, ..., uₘ, and each uᵢ depends on x, then:
dz/dx = (∂z/∂u₁)(du₁/dx) + (∂z/∂u₂)(du₂/dx) + ... + (∂z/∂uₘ)(duₘ/dx)
In compact form:
dz/dx = Σᵢ (∂z/∂uᵢ)(duᵢ/dx)
The key insight: when a variable influences the output through multiple paths, you sum the contributions from all paths.
Example: Two Paths
Suppose z = u² + v, where u = 2x and v = x³.
x ──→ u = 2x ──→ z = u² + v
└──→ v = x³ ──┘
Variable x affects z through both u and v. The total derivative is:
dz/dx = (∂z/∂u)(du/dx) + (∂z/∂v)(dv/dx)
= (2u)(2) + (1)(3x²)
= 4u + 3x²
= 4(2x) + 3x²
= 8x + 3x²
You can verify this by substituting directly: z = (2x)² + x³ = 4x² + x³, so dz/dx = 8x + 3x². The result matches.
Why Summation Over Paths Matters
In a neural network, a weight in an early layer affects every neuron in subsequent layers. The loss depends on that weight through many different computational paths.
┌→ neuron 2a ─→ neuron 3a ─┐
w₁ → neuron 1 ─┤ ├→ Loss
└→ neuron 2b ─→ neuron 3b ─┘
The total derivative ∂L/∂w₁ must account for w₁'s influence through all paths from the first neuron to the loss. The multivariable chain rule does this by summing contributions along every path.
The Jacobian Matrix
When both the input and output are vectors, the chain rule involves matrices. If f: ℝⁿ → ℝᵐ maps an n-dimensional input to an m-dimensional output, the Jacobian matrix contains all partial derivatives:
J = | ∂f₁/∂x₁ ∂f₁/∂x₂ ... ∂f₁/∂xₙ |
| ∂f₂/∂x₁ ∂f₂/∂x₂ ... ∂f₂/∂xₙ |
| ... ... ... ... |
| ∂fₘ/∂x₁ ∂fₘ/∂x₂ ... ∂fₘ/∂xₙ |
Row i, column j tells you: how does the i-th output change when the j-th input changes?
For composed functions z = f(g(x)), the multivariable chain rule becomes matrix multiplication of Jacobians:
J_total = J_f · J_g
This is why linear algebra and calculus are deeply connected in ML — gradient computation is matrix multiplication.
Example: A Two-Neuron Layer
Consider a layer with two neurons, each receiving two inputs:
a₁ = σ(w₁₁x₁ + w₁₂x₂ + b₁)
a₂ = σ(w₂₁x₁ + w₂₂x₂ + b₂)
The Jacobian of [a₁, a₂] with respect to [x₁, x₂] is:
J = | ∂a₁/∂x₁ ∂a₁/∂x₂ | | σ'(z₁)·w₁₁ σ'(z₁)·w₁₂ |
| ∂a₂/∂x₁ ∂a₂/∂x₂ | = | σ'(z₂)·w₂₁ σ'(z₂)·w₂₂ |
where z₁ = w₁₁x₁ + w₁₂x₂ + b₁ and z₂ = w₂₁x₁ + w₂₂x₂ + b₂.
Notice the pattern: each entry is the activation's derivative times the corresponding weight. This pattern repeats throughout backpropagation.
Partial Derivatives with Respect to Parameters
In ML, we typically want ∂L/∂wᵢⱼ — how the loss changes with respect to a specific weight. Using the multivariable chain rule:
∂L/∂wᵢⱼ = Σ (∂L/∂aₖ)(∂aₖ/∂wᵢⱼ)
summed over all outputs aₖ that depend on wᵢⱼ. In most architectures, each weight only directly affects one neuron's output, simplifying the sum to a single term.
Total Derivative vs. Partial Derivative
An important distinction:
- Partial derivative ∂f/∂x: How f changes with x, holding all other direct inputs constant
- Total derivative df/dx: How f changes with x, accounting for all indirect effects through other variables
In the example z = u² + v with u = 2x and v = x³:
∂z/∂x does not exist directly (z is written in terms of u and v, not x)
dz/dx = 8x + 3x² (accounts for influence through both u and v)
In backpropagation, we compute total derivatives — the full effect of each parameter on the loss through all paths.
Summary
- When a variable influences the output through multiple paths, sum the derivatives along all paths
- The multivariable chain rule: dz/dx = Σᵢ (∂z/∂uᵢ)(duᵢ/dx)
- The Jacobian matrix collects all partial derivatives of a vector function
- For composed vector functions, the chain rule becomes Jacobian matrix multiplication
- Each weight in a neural network may affect the loss through many paths
- Backpropagation handles this summation automatically and efficiently
- Total derivatives account for all indirect effects; partial derivatives hold other variables constant
The next lesson introduces computational graphs, which make the chain rule visual and systematic — and are exactly how frameworks like PyTorch and TensorFlow compute gradients.

