The Chain Rule for Single Variables
Neural networks are built by stacking layers — each layer takes the output of the previous one as input. Mathematically, this means neural networks are compositions of functions: functions inside of functions. The chain rule is the calculus tool that tells you how to take the derivative of a composition.
Function Composition
When one function feeds into another, we call it composition. If g(x) = 2x and f(u) = u², then:
f(g(x)) = f(2x) = (2x)² = 4x²
The "inner function" g transforms the input first, then the "outer function" f processes the result.
In a neural network, this happens at every layer:
input → [layer 1: multiply by weights] → [activation function] → [layer 2: multiply by weights] → ...
Each arrow is a function, and the whole pipeline is a composition.
The Chain Rule
If y = f(g(x)), the chain rule says:
dy/dx = f'(g(x)) · g'(x)
In words: the derivative of the outer function, evaluated at the inner function, times the derivative of the inner function.
An equivalent and more intuitive notation uses intermediate variables:
If y = f(u) and u = g(x), then:
dy/dx = dy/du · du/dx
This reads naturally: the rate of change of y with respect to x equals the rate of change of y with respect to u, times the rate of change of u with respect to x.
The Chain Analogy
Think of a chain of gears. If the first gear (x) turns, it makes the second gear (u) turn, which makes the third gear (y) turn.
x ──→ u ──→ y
du/dx dy/du
How fast does y turn compared to x? Multiply the ratios:
dy/dx = dy/du × du/dx
If u turns twice as fast as x (du/dx = 2), and y turns three times as fast as u (dy/du = 3), then y turns six times as fast as x (dy/dx = 6).
Example: Derivative of (2x + 1)³
Let u = 2x + 1 (inner function) and y = u³ (outer function).
dy/du = 3u² = 3(2x + 1)² (power rule on outer)
du/dx = 2 (derivative of inner)
dy/dx = 3(2x + 1)² · 2 = 6(2x + 1)²
Example: Derivative of e⁻ˣ²
Let u = -x² (inner function) and y = eᵘ (outer function).
dy/du = eᵘ = e⁻ˣ² (exponential rule)
du/dx = -2x (power rule)
dy/dx = e⁻ˣ² · (-2x) = -2x · e⁻ˣ²
This function appears in Gaussian (normal) distributions, which are fundamental to ML.
Chaining More Than Two Functions
The chain rule extends to any number of composed functions. If y depends on u, which depends on v, which depends on x:
dy/dx = dy/du · du/dv · dv/dx
Example: A Three-Layer Computation
v = 3x (linear transformation)
u = v² (squaring)
y = sin(u) (activation-like function)
dv/dx = 3
du/dv = 2v = 6x
dy/du = cos(u) = cos(9x²)
dy/dx = cos(9x²) · 6x · 3 = 18x · cos(9x²)
Each factor in the chain corresponds to one "layer" of computation. This is exactly how backpropagation works in neural networks — it multiplies the local derivatives along the chain from output to input.
The Chain Rule in ML Context
Consider a single neuron with weight w, input x, and sigmoid activation:
z = wx (weighted input)
a = σ(z) (sigmoid activation)
L = (a - y)² (squared error loss)
To find how the loss L depends on the weight w, chain through each step:
dL/dw = dL/da · da/dz · dz/dw
Computing each piece:
dL/da = 2(a - y) (power rule)
da/dz = σ(z)(1 - σ(z)) (sigmoid derivative)
dz/dw = x (linear function)
Multiplying them together:
dL/dw = 2(a - y) · σ(z)(1 - σ(z)) · x
This tells us exactly how to update the weight w. Notice how the chain rule decomposes a complex derivative into simple, local derivatives multiplied together.
Why the Chain Rule Is Essential
Without the chain rule, you would need to:
- Write out the entire composite function explicitly
- Simplify it algebraically
- Take the derivative of the resulting (potentially enormous) expression
For a neural network with millions of parameters and dozens of layers, this is impossible. The chain rule lets you compute the derivative locally at each layer and multiply the results — which is exactly what backpropagation does.
Summary
- The chain rule handles derivatives of composed functions (functions inside functions)
- If y = f(g(x)): dy/dx = f'(g(x)) · g'(x)
- Equivalently: dy/dx = (dy/du) · (du/dx)
- The chain extends to any number of composed functions: just multiply all the local derivatives
- Neural networks are chains of composed functions: linear transformations and activations
- The chain rule decomposes the derivative of a complex pipeline into a product of simple, local derivatives
- This decomposition is the mathematical basis of backpropagation
The next lesson extends the chain rule to functions with multiple variables — the reality of neural networks where every layer has many weights.

