The Chain Rule for Multiple Variables

Real neural networks do not have a single chain of computation. Each layer has multiple weights, each neuron receives multiple inputs, and the loss depends on all parameters simultaneously. The multivariable chain rule handles this by summing contributions from all paths connecting an input to the output.

The Multivariable Chain Rule

If z depends on variables u₁, u₂, ..., uₘ, and each uᵢ depends on x, then:

dz/dx = (∂z/∂u₁)(du₁/dx) + (∂z/∂u₂)(du₂/dx) + ... + (∂z/∂uₘ)(duₘ/dx)

In compact form:

dz/dx = Σᵢ (∂z/∂uᵢ)(duᵢ/dx)

The key insight: when a variable influences the output through multiple paths, you sum the contributions from all paths.

Example: Two Paths

Suppose z = u² + v, where u = 2x and v = x³.

x ──→ u = 2x ──→ z = u² + v
  └──→ v = x³ ──┘

Variable x affects z through both u and v. The total derivative is:

dz/dx = (∂z/∂u)(du/dx) + (∂z/∂v)(dv/dx)
      = (2u)(2) + (1)(3x²)
      = 4u + 3x²
      = 4(2x) + 3x²
      = 8x + 3x²

You can verify this by substituting directly: z = (2x)² + x³ = 4x² + x³, so dz/dx = 8x + 3x². The result matches.

Why Summation Over Paths Matters

In a neural network, a weight in an early layer affects every neuron in subsequent layers. The loss depends on that weight through many different computational paths.

         ┌→ neuron 2a ─→ neuron 3a ─┐
w₁ → neuron 1 ─┤                           ├→ Loss
         └→ neuron 2b ─→ neuron 3b ─┘

The total derivative ∂L/∂w₁ must account for w₁'s influence through all paths from the first neuron to the loss. The multivariable chain rule does this by summing contributions along every path.

The Jacobian Matrix

When both the input and output are vectors, the chain rule involves matrices. If f: ℝⁿ → ℝᵐ maps an n-dimensional input to an m-dimensional output, the Jacobian matrix contains all partial derivatives:

J = | ∂f₁/∂x₁  ∂f₁/∂x₂  ...  ∂f₁/∂xₙ |
    | ∂f₂/∂x₁  ∂f₂/∂x₂  ...  ∂f₂/∂xₙ |
    |   ...       ...     ...    ...     |
    | ∂fₘ/∂x₁  ∂fₘ/∂x₂  ...  ∂fₘ/∂xₙ |

Row i, column j tells you: how does the i-th output change when the j-th input changes?

For composed functions z = f(g(x)), the multivariable chain rule becomes matrix multiplication of Jacobians:

J_total = J_f · J_g

This is why linear algebra and calculus are deeply connected in ML — gradient computation is matrix multiplication.

Example: A Two-Neuron Layer

Consider a layer with two neurons, each receiving two inputs:

a₁ = σ(w₁₁x₁ + w₁₂x₂ + b₁)
a₂ = σ(w₂₁x₁ + w₂₂x₂ + b₂)

The Jacobian of [a₁, a₂] with respect to [x₁, x₂] is:

J = | ∂a₁/∂x₁  ∂a₁/∂x₂ |   | σ'(z₁)·w₁₁  σ'(z₁)·w₁₂ |
    | ∂a₂/∂x₁  ∂a₂/∂x₂ | = | σ'(z₂)·w₂₁  σ'(z₂)·w₂₂ |

where z₁ = w₁₁x₁ + w₁₂x₂ + b₁ and z₂ = w₂₁x₁ + w₂₂x₂ + b₂.

Notice the pattern: each entry is the activation's derivative times the corresponding weight. This pattern repeats throughout backpropagation.

Partial Derivatives with Respect to Parameters

In ML, we typically want ∂L/∂wᵢⱼ — how the loss changes with respect to a specific weight. Using the multivariable chain rule:

∂L/∂wᵢⱼ = Σ (∂L/∂aₖ)(∂aₖ/∂wᵢⱼ)

summed over all outputs aₖ that depend on wᵢⱼ. In most architectures, each weight only directly affects one neuron's output, simplifying the sum to a single term.

Total Derivative vs. Partial Derivative

An important distinction:

Partial derivative ∂f/∂x: How f changes with x, holding all other direct inputs constant
Total derivative df/dx: How f changes with x, accounting for all indirect effects through other variables

In the example z = u² + v with u = 2x and v = x³:

∂z/∂x does not exist directly (z is written in terms of u and v, not x)
dz/dx = 8x + 3x² (accounts for influence through both u and v)

In backpropagation, we compute total derivatives — the full effect of each parameter on the loss through all paths.

Summary

When a variable influences the output through multiple paths, sum the derivatives along all paths
The multivariable chain rule: dz/dx = Σᵢ (∂z/∂uᵢ)(duᵢ/dx)
The Jacobian matrix collects all partial derivatives of a vector function
For composed vector functions, the chain rule becomes Jacobian matrix multiplication
Each weight in a neural network may affect the loss through many paths
Backpropagation handles this summation automatically and efficiently
Total derivatives account for all indirect effects; partial derivatives hold other variables constant

The next lesson introduces computational graphs, which make the chain rule visual and systematic — and are exactly how frameworks like PyTorch and TensorFlow compute gradients.

The Chain Rule for Multiple Variables

The Multivariable Chain Rule

If z depends on variables u₁, u₂, ..., uₘ, and each uᵢ depends on x, then:

dz/dx = (∂z/∂u₁)(du₁/dx) + (∂z/∂u₂)(du₂/dx) + ... + (∂z/∂uₘ)(duₘ/dx)

In compact form:

dz/dx = Σᵢ (∂z/∂uᵢ)(duᵢ/dx)

The key insight: when a variable influences the output through multiple paths, you sum the contributions from all paths.

Example: Two Paths

Suppose z = u² + v, where u = 2x and v = x³.

x ──→ u = 2x ──→ z = u² + v
  └──→ v = x³ ──┘

Variable x affects z through both u and v. The total derivative is:

dz/dx = (∂z/∂u)(du/dx) + (∂z/∂v)(dv/dx)
      = (2u)(2) + (1)(3x²)
      = 4u + 3x²
      = 4(2x) + 3x²
      = 8x + 3x²

You can verify this by substituting directly: z = (2x)² + x³ = 4x² + x³, so dz/dx = 8x + 3x². The result matches.

Why Summation Over Paths Matters

In a neural network, a weight in an early layer affects every neuron in subsequent layers. The loss depends on that weight through many different computational paths.

         ┌→ neuron 2a ─→ neuron 3a ─┐
w₁ → neuron 1 ─┤                           ├→ Loss
         └→ neuron 2b ─→ neuron 3b ─┘

The Jacobian Matrix

J = | ∂f₁/∂x₁  ∂f₁/∂x₂  ...  ∂f₁/∂xₙ |
    | ∂f₂/∂x₁  ∂f₂/∂x₂  ...  ∂f₂/∂xₙ |
    |   ...       ...     ...    ...     |
    | ∂fₘ/∂x₁  ∂fₘ/∂x₂  ...  ∂fₘ/∂xₙ |

Row i, column j tells you: how does the i-th output change when the j-th input changes?

For composed functions z = f(g(x)), the multivariable chain rule becomes matrix multiplication of Jacobians:

J_total = J_f · J_g

This is why linear algebra and calculus are deeply connected in ML — gradient computation is matrix multiplication.

Example: A Two-Neuron Layer

Consider a layer with two neurons, each receiving two inputs:

a₁ = σ(w₁₁x₁ + w₁₂x₂ + b₁)
a₂ = σ(w₂₁x₁ + w₂₂x₂ + b₂)

The Jacobian of [a₁, a₂] with respect to [x₁, x₂] is:

J = | ∂a₁/∂x₁  ∂a₁/∂x₂ |   | σ'(z₁)·w₁₁  σ'(z₁)·w₁₂ |
    | ∂a₂/∂x₁  ∂a₂/∂x₂ | = | σ'(z₂)·w₂₁  σ'(z₂)·w₂₂ |

where z₁ = w₁₁x₁ + w₁₂x₂ + b₁ and z₂ = w₂₁x₁ + w₂₂x₂ + b₂.

Notice the pattern: each entry is the activation's derivative times the corresponding weight. This pattern repeats throughout backpropagation.

Partial Derivatives with Respect to Parameters

In ML, we typically want ∂L/∂wᵢⱼ — how the loss changes with respect to a specific weight. Using the multivariable chain rule:

∂L/∂wᵢⱼ = Σ (∂L/∂aₖ)(∂aₖ/∂wᵢⱼ)

summed over all outputs aₖ that depend on wᵢⱼ. In most architectures, each weight only directly affects one neuron's output, simplifying the sum to a single term.

Total Derivative vs. Partial Derivative

An important distinction:

Partial derivative ∂f/∂x: How f changes with x, holding all other direct inputs constant
Total derivative df/dx: How f changes with x, accounting for all indirect effects through other variables

In the example z = u² + v with u = 2x and v = x³:

∂z/∂x does not exist directly (z is written in terms of u and v, not x)
dz/dx = 8x + 3x² (accounts for influence through both u and v)

In backpropagation, we compute total derivatives — the full effect of each parameter on the loss through all paths.

Summary

When a variable influences the output through multiple paths, sum the derivatives along all paths
The multivariable chain rule: dz/dx = Σᵢ (∂z/∂uᵢ)(duᵢ/dx)
The Jacobian matrix collects all partial derivatives of a vector function
For composed vector functions, the chain rule becomes Jacobian matrix multiplication
Each weight in a neural network may affect the loss through many paths
Backpropagation handles this summation automatically and efficiently
Total derivatives account for all indirect effects; partial derivatives hold other variables constant

The next lesson introduces computational graphs, which make the chain rule visual and systematic — and are exactly how frameworks like PyTorch and TensorFlow compute gradients.

The Chain Rule for Multiple Variables

The Multivariable Chain Rule

Example: Two Paths

Why Summation Over Paths Matters

The Jacobian Matrix

Example: A Two-Neuron Layer

Partial Derivatives with Respect to Parameters

Total Derivative vs. Partial Derivative

Summary

Questions & Answers

The Chain Rule for Multiple Variables

The Multivariable Chain Rule

Example: Two Paths

Why Summation Over Paths Matters

The Jacobian Matrix

Example: A Two-Neuron Layer

Partial Derivatives with Respect to Parameters

Total Derivative vs. Partial Derivative

Summary

Questions & Answers