Computational Graphs and the Chain Rule

Computational graphs are the visual and algorithmic bridge between calculus and code. Every ML framework — PyTorch, TensorFlow, JAX — builds a computational graph of your model and then uses the chain rule on that graph to compute gradients. Understanding computational graphs makes backpropagation concrete and intuitive.

What Is a Computational Graph?

A computational graph is a directed graph where:

Nodes represent operations (add, multiply, apply activation) or variables (inputs, weights)
Edges represent data flow (which values feed into which operations)

Example: f(x, y) = (x + y) · y

x ──→ [+] ──→ [×] ──→ f
y ──→ [+]   ↗
y ──────────┘

Step by step:

Compute a = x + y
Compute f = a · y

For x = 2, y = 3:

a = 2 + 3 = 5
f = 5 × 3 = 15

Forward Pass and Backward Pass

The forward pass computes the output by flowing values from inputs to output (left to right). The backward pass computes derivatives by flowing gradients from output to inputs (right to left), using the chain rule at each node.

Forward Pass

x = 2 ──→ [+] a = 5 ──→ [×] f = 15
y = 3 ──→ [+]         ↗
y = 3 ────────────────┘

Backward Pass

Start with df/df = 1 (the output's derivative with respect to itself is always 1) and work backward.

At the multiplication node f = a · y:

∂f/∂a = y = 3
∂f/∂y_mult = a = 5    (y's contribution through multiplication)

At the addition node a = x + y:

∂a/∂x = 1
∂a/∂y_add = 1         (y's contribution through addition)

Now chain them together:

∂f/∂x = ∂f/∂a · ∂a/∂x = 3 · 1 = 3

For y, we must sum contributions from both paths (y flows into both the add and the multiply):

∂f/∂y = (∂f/∂a · ∂a/∂y_add) + ∂f/∂y_mult
      = (3 · 1) + 5
      = 8

You can verify: f(x, y) = (x + y) · y = xy + y². So ∂f/∂x = y = 3 and ∂f/∂y = x + 2y = 2 + 6 = 8.

Local Derivatives

The power of computational graphs is that each node only needs to know its local derivative — how its output changes with respect to its immediate inputs. The chain rule connects these local derivatives into the full gradient.

Operation	Output	∂output/∂input₁	∂output/∂input₂
a + b	a + b	1	1
a × b	a · b	b	a
a²	a²	2a	—
eᵃ	eᵃ	eᵃ	—
σ(a)	σ(a)	σ(a)(1-σ(a))	—
max(0, a)	ReLU(a)	1 if a > 0, 0 if a ≤ 0	—

A Neural Network Computational Graph

Consider a single-neuron network: input x, weight w, bias b, sigmoid activation, squared error loss with target t.

x ──→ [×] ──→ [+] ──→ [σ] ──→ [-] ──→ [²] ──→ L
w ──→ [×]   ↗        a      ↗       e
b ──────────┘     t ────────┘

Forward pass (with x=2, w=0.5, b=0.1, t=1):

z = wx = 0.5 × 2 = 1.0
s = z + b = 1.0 + 0.1 = 1.1
a = σ(1.1) ≈ 0.75
e = a - t = 0.75 - 1 = -0.25
L = e² = 0.0625

Backward pass:

∂L/∂e = 2e = 2(-0.25) = -0.5
∂e/∂a = 1
∂a/∂s = σ(s)(1-σ(s)) = 0.75 × 0.25 = 0.1875
∂s/∂z = 1
∂z/∂w = x = 2

∂L/∂w = ∂L/∂e · ∂e/∂a · ∂a/∂s · ∂s/∂z · ∂z/∂w
       = (-0.5)(1)(0.1875)(1)(2)
       = -0.1875

The gradient ∂L/∂w = -0.1875 tells us: increasing w slightly would decrease the loss. So gradient descent should increase w.

Automatic Differentiation

ML frameworks implement this backward-pass algorithm automatically. When you write:

# PyTorch pseudocode
z = w * x
s = z + b
a = torch.sigmoid(s)
L = (a - t) ** 2

L.backward()  # computes all gradients automatically
print(w.grad)  # ∂L/∂w

The framework:

Records the computational graph during the forward pass
On .backward(), traverses the graph in reverse
Applies the chain rule at each node
Accumulates gradients for each parameter

This is called automatic differentiation (autodiff), specifically reverse-mode autodiff. It is not numerical approximation or symbolic differentiation — it is exact and efficient.

Why Reverse Mode Is Efficient

In a neural network with millions of parameters and one loss value, reverse-mode autodiff computes all gradients in one backward pass through the graph. The cost is roughly the same as one forward pass.

The alternative — computing each ∂L/∂wᵢ separately — would require one pass per parameter. For a model with 1 million parameters, that would be 1 million times slower.

Method	Passes Required	When to Use
Forward mode	One per output	Many inputs, few outputs
Reverse mode	One per input dimension group	Few outputs, many inputs

ML always has one loss output and many parameter inputs, making reverse mode the right choice.

Summary

A computational graph represents a computation as nodes (operations) connected by edges (data flow)
The forward pass computes the output from inputs
The backward pass computes gradients from output to inputs using the chain rule
Each node only needs its local derivative — the chain rule connects them
When a variable has multiple paths to the output, sum the gradient contributions
ML frameworks build computational graphs automatically and traverse them in reverse to compute gradients
Reverse-mode automatic differentiation computes all parameter gradients in a single backward pass
This efficiency is what makes training neural networks with millions of parameters practical

The next module puts these gradient computations to work: gradient descent, the optimization algorithm that actually updates model parameters.

Computational Graphs and the Chain Rule

What Is a Computational Graph?

A computational graph is a directed graph where:

Nodes represent operations (add, multiply, apply activation) or variables (inputs, weights)
Edges represent data flow (which values feed into which operations)

Example: f(x, y) = (x + y) · y

x ──→ [+] ──→ [×] ──→ f
y ──→ [+]   ↗
y ──────────┘

Step by step:

Compute a = x + y
Compute f = a · y

For x = 2, y = 3:

a = 2 + 3 = 5
f = 5 × 3 = 15

Forward Pass and Backward Pass

Forward Pass

x = 2 ──→ [+] a = 5 ──→ [×] f = 15
y = 3 ──→ [+]         ↗
y = 3 ────────────────┘

Backward Pass

Start with df/df = 1 (the output's derivative with respect to itself is always 1) and work backward.

At the multiplication node f = a · y:

∂f/∂a = y = 3
∂f/∂y_mult = a = 5    (y's contribution through multiplication)

At the addition node a = x + y:

∂a/∂x = 1
∂a/∂y_add = 1         (y's contribution through addition)

Now chain them together:

∂f/∂x = ∂f/∂a · ∂a/∂x = 3 · 1 = 3

For y, we must sum contributions from both paths (y flows into both the add and the multiply):

∂f/∂y = (∂f/∂a · ∂a/∂y_add) + ∂f/∂y_mult
      = (3 · 1) + 5
      = 8

You can verify: f(x, y) = (x + y) · y = xy + y². So ∂f/∂x = y = 3 and ∂f/∂y = x + 2y = 2 + 6 = 8.

Local Derivatives

Operation	Output	∂output/∂input₁	∂output/∂input₂
a + b	a + b	1	1
a × b	a · b	b	a
a²	a²	2a	—
eᵃ	eᵃ	eᵃ	—
σ(a)	σ(a)	σ(a)(1-σ(a))	—
max(0, a)	ReLU(a)	1 if a > 0, 0 if a ≤ 0	—

A Neural Network Computational Graph

Consider a single-neuron network: input x, weight w, bias b, sigmoid activation, squared error loss with target t.

x ──→ [×] ──→ [+] ──→ [σ] ──→ [-] ──→ [²] ──→ L
w ──→ [×]   ↗        a      ↗       e
b ──────────┘     t ────────┘

Forward pass (with x=2, w=0.5, b=0.1, t=1):

z = wx = 0.5 × 2 = 1.0
s = z + b = 1.0 + 0.1 = 1.1
a = σ(1.1) ≈ 0.75
e = a - t = 0.75 - 1 = -0.25
L = e² = 0.0625

Backward pass:

∂L/∂e = 2e = 2(-0.25) = -0.5
∂e/∂a = 1
∂a/∂s = σ(s)(1-σ(s)) = 0.75 × 0.25 = 0.1875
∂s/∂z = 1
∂z/∂w = x = 2

∂L/∂w = ∂L/∂e · ∂e/∂a · ∂a/∂s · ∂s/∂z · ∂z/∂w
       = (-0.5)(1)(0.1875)(1)(2)
       = -0.1875

The gradient ∂L/∂w = -0.1875 tells us: increasing w slightly would decrease the loss. So gradient descent should increase w.

Automatic Differentiation

ML frameworks implement this backward-pass algorithm automatically. When you write:

# PyTorch pseudocode
z = w * x
s = z + b
a = torch.sigmoid(s)
L = (a - t) ** 2

L.backward()  # computes all gradients automatically
print(w.grad)  # ∂L/∂w

The framework:

Records the computational graph during the forward pass
On .backward(), traverses the graph in reverse
Applies the chain rule at each node
Accumulates gradients for each parameter

This is called automatic differentiation (autodiff), specifically reverse-mode autodiff. It is not numerical approximation or symbolic differentiation — it is exact and efficient.

Why Reverse Mode Is Efficient

The alternative — computing each ∂L/∂wᵢ separately — would require one pass per parameter. For a model with 1 million parameters, that would be 1 million times slower.

Method	Passes Required	When to Use
Forward mode	One per output	Many inputs, few outputs
Reverse mode	One per input dimension group	Few outputs, many inputs

ML always has one loss output and many parameter inputs, making reverse mode the right choice.

Summary

A computational graph represents a computation as nodes (operations) connected by edges (data flow)
The forward pass computes the output from inputs
The backward pass computes gradients from output to inputs using the chain rule
Each node only needs its local derivative — the chain rule connects them
When a variable has multiple paths to the output, sum the gradient contributions
ML frameworks build computational graphs automatically and traverse them in reverse to compute gradients
Reverse-mode automatic differentiation computes all parameter gradients in a single backward pass
This efficiency is what makes training neural networks with millions of parameters practical

The next module puts these gradient computations to work: gradient descent, the optimization algorithm that actually updates model parameters.

Computational Graphs and the Chain Rule

What Is a Computational Graph?

Example: f(x, y) = (x + y) · y

Forward Pass and Backward Pass

Forward Pass

Backward Pass

Local Derivatives

A Neural Network Computational Graph

Automatic Differentiation

Why Reverse Mode Is Efficient

Summary

Questions & Answers

Computational Graphs and the Chain Rule

What Is a Computational Graph?

Example: f(x, y) = (x + y) · y

Forward Pass and Backward Pass

Forward Pass

Backward Pass

Local Derivatives

A Neural Network Computational Graph

Automatic Differentiation

Why Reverse Mode Is Efficient

Summary

Questions & Answers