The Chain Rule for Single Variables

Neural networks are built by stacking layers — each layer takes the output of the previous one as input. Mathematically, this means neural networks are compositions of functions: functions inside of functions. The chain rule is the calculus tool that tells you how to take the derivative of a composition.

Function Composition

When one function feeds into another, we call it composition. If g(x) = 2x and f(u) = u², then:

f(g(x)) = f(2x) = (2x)² = 4x²

The "inner function" g transforms the input first, then the "outer function" f processes the result.

In a neural network, this happens at every layer:

input → [layer 1: multiply by weights] → [activation function] → [layer 2: multiply by weights] → ...

Each arrow is a function, and the whole pipeline is a composition.

The Chain Rule

If y = f(g(x)), the chain rule says:

dy/dx = f'(g(x)) · g'(x)

In words: the derivative of the outer function, evaluated at the inner function, times the derivative of the inner function.

An equivalent and more intuitive notation uses intermediate variables:

If y = f(u) and u = g(x), then:

dy/dx = dy/du · du/dx

This reads naturally: the rate of change of y with respect to x equals the rate of change of y with respect to u, times the rate of change of u with respect to x.

The Chain Analogy

Think of a chain of gears. If the first gear (x) turns, it makes the second gear (u) turn, which makes the third gear (y) turn.

x ──→ u ──→ y
  du/dx  dy/du

How fast does y turn compared to x? Multiply the ratios:

dy/dx = dy/du × du/dx

If u turns twice as fast as x (du/dx = 2), and y turns three times as fast as u (dy/du = 3), then y turns six times as fast as x (dy/dx = 6).

Example: Derivative of (2x + 1)³

Let u = 2x + 1 (inner function) and y = u³ (outer function).

dy/du = 3u² = 3(2x + 1)²    (power rule on outer)
du/dx = 2                      (derivative of inner)

dy/dx = 3(2x + 1)² · 2 = 6(2x + 1)²

Example: Derivative of e⁻ˣ²

Let u = -x² (inner function) and y = eᵘ (outer function).

dy/du = eᵘ = e⁻ˣ²    (exponential rule)
du/dx = -2x            (power rule)

dy/dx = e⁻ˣ² · (-2x) = -2x · e⁻ˣ²

This function appears in Gaussian (normal) distributions, which are fundamental to ML.

Chaining More Than Two Functions

The chain rule extends to any number of composed functions. If y depends on u, which depends on v, which depends on x:

dy/dx = dy/du · du/dv · dv/dx

Example: A Three-Layer Computation

v = 3x        (linear transformation)
u = v²        (squaring)
y = sin(u)    (activation-like function)

dv/dx = 3
du/dv = 2v = 6x
dy/du = cos(u) = cos(9x²)

dy/dx = cos(9x²) · 6x · 3 = 18x · cos(9x²)

Each factor in the chain corresponds to one "layer" of computation. This is exactly how backpropagation works in neural networks — it multiplies the local derivatives along the chain from output to input.

The Chain Rule in ML Context

Consider a single neuron with weight w, input x, and sigmoid activation:

z = wx          (weighted input)
a = σ(z)        (sigmoid activation)
L = (a - y)²    (squared error loss)

To find how the loss L depends on the weight w, chain through each step:

dL/dw = dL/da · da/dz · dz/dw

Computing each piece:

dL/da = 2(a - y)           (power rule)
da/dz = σ(z)(1 - σ(z))    (sigmoid derivative)
dz/dw = x                  (linear function)

Multiplying them together:

dL/dw = 2(a - y) · σ(z)(1 - σ(z)) · x

This tells us exactly how to update the weight w. Notice how the chain rule decomposes a complex derivative into simple, local derivatives multiplied together.

Why the Chain Rule Is Essential

Without the chain rule, you would need to:

Write out the entire composite function explicitly
Simplify it algebraically
Take the derivative of the resulting (potentially enormous) expression

For a neural network with millions of parameters and dozens of layers, this is impossible. The chain rule lets you compute the derivative locally at each layer and multiply the results — which is exactly what backpropagation does.

Summary

The chain rule handles derivatives of composed functions (functions inside functions)
If y = f(g(x)): dy/dx = f'(g(x)) · g'(x)
Equivalently: dy/dx = (dy/du) · (du/dx)
The chain extends to any number of composed functions: just multiply all the local derivatives
Neural networks are chains of composed functions: linear transformations and activations
The chain rule decomposes the derivative of a complex pipeline into a product of simple, local derivatives
This decomposition is the mathematical basis of backpropagation

The next lesson extends the chain rule to functions with multiple variables — the reality of neural networks where every layer has many weights.

The Chain Rule for Single Variables

Function Composition

When one function feeds into another, we call it composition. If g(x) = 2x and f(u) = u², then:

f(g(x)) = f(2x) = (2x)² = 4x²

The "inner function" g transforms the input first, then the "outer function" f processes the result.

In a neural network, this happens at every layer:

input → [layer 1: multiply by weights] → [activation function] → [layer 2: multiply by weights] → ...

Each arrow is a function, and the whole pipeline is a composition.

The Chain Rule

If y = f(g(x)), the chain rule says:

dy/dx = f'(g(x)) · g'(x)

In words: the derivative of the outer function, evaluated at the inner function, times the derivative of the inner function.

An equivalent and more intuitive notation uses intermediate variables:

If y = f(u) and u = g(x), then:

dy/dx = dy/du · du/dx

This reads naturally: the rate of change of y with respect to x equals the rate of change of y with respect to u, times the rate of change of u with respect to x.

The Chain Analogy

Think of a chain of gears. If the first gear (x) turns, it makes the second gear (u) turn, which makes the third gear (y) turn.

x ──→ u ──→ y
  du/dx  dy/du

How fast does y turn compared to x? Multiply the ratios:

dy/dx = dy/du × du/dx

If u turns twice as fast as x (du/dx = 2), and y turns three times as fast as u (dy/du = 3), then y turns six times as fast as x (dy/dx = 6).

Example: Derivative of (2x + 1)³

Let u = 2x + 1 (inner function) and y = u³ (outer function).

dy/du = 3u² = 3(2x + 1)²    (power rule on outer)
du/dx = 2                      (derivative of inner)

dy/dx = 3(2x + 1)² · 2 = 6(2x + 1)²

Example: Derivative of e⁻ˣ²

Let u = -x² (inner function) and y = eᵘ (outer function).

dy/du = eᵘ = e⁻ˣ²    (exponential rule)
du/dx = -2x            (power rule)

dy/dx = e⁻ˣ² · (-2x) = -2x · e⁻ˣ²

This function appears in Gaussian (normal) distributions, which are fundamental to ML.

Chaining More Than Two Functions

The chain rule extends to any number of composed functions. If y depends on u, which depends on v, which depends on x:

dy/dx = dy/du · du/dv · dv/dx

Example: A Three-Layer Computation

v = 3x        (linear transformation)
u = v²        (squaring)
y = sin(u)    (activation-like function)

dv/dx = 3
du/dv = 2v = 6x
dy/du = cos(u) = cos(9x²)

dy/dx = cos(9x²) · 6x · 3 = 18x · cos(9x²)

The Chain Rule in ML Context

Consider a single neuron with weight w, input x, and sigmoid activation:

z = wx          (weighted input)
a = σ(z)        (sigmoid activation)
L = (a - y)²    (squared error loss)

To find how the loss L depends on the weight w, chain through each step:

dL/dw = dL/da · da/dz · dz/dw

Computing each piece:

dL/da = 2(a - y)           (power rule)
da/dz = σ(z)(1 - σ(z))    (sigmoid derivative)
dz/dw = x                  (linear function)

Multiplying them together:

dL/dw = 2(a - y) · σ(z)(1 - σ(z)) · x

This tells us exactly how to update the weight w. Notice how the chain rule decomposes a complex derivative into simple, local derivatives multiplied together.

Why the Chain Rule Is Essential

Without the chain rule, you would need to:

Write out the entire composite function explicitly
Simplify it algebraically
Take the derivative of the resulting (potentially enormous) expression

Summary

The chain rule handles derivatives of composed functions (functions inside functions)
If y = f(g(x)): dy/dx = f'(g(x)) · g'(x)
Equivalently: dy/dx = (dy/du) · (du/dx)
The chain extends to any number of composed functions: just multiply all the local derivatives
Neural networks are chains of composed functions: linear transformations and activations
The chain rule decomposes the derivative of a complex pipeline into a product of simple, local derivatives
This decomposition is the mathematical basis of backpropagation

The next lesson extends the chain rule to functions with multiple variables — the reality of neural networks where every layer has many weights.

The Chain Rule for Single Variables

Function Composition

The Chain Rule

The Chain Analogy

Example: Derivative of (2x + 1)³

Example: Derivative of e⁻ˣ²

Chaining More Than Two Functions

Example: A Three-Layer Computation

The Chain Rule in ML Context

Why the Chain Rule Is Essential

Summary

Questions & Answers

The Chain Rule for Single Variables

Function Composition

The Chain Rule

The Chain Analogy

Example: Derivative of (2x + 1)³

Example: Derivative of e⁻ˣ²

Chaining More Than Two Functions

Example: A Three-Layer Computation

The Chain Rule in ML Context

Why the Chain Rule Is Essential

Summary

Questions & Answers