Computational Graphs and the Chain Rule
Computational graphs are the visual and algorithmic bridge between calculus and code. Every ML framework — PyTorch, TensorFlow, JAX — builds a computational graph of your model and then uses the chain rule on that graph to compute gradients. Understanding computational graphs makes backpropagation concrete and intuitive.
What Is a Computational Graph?
A computational graph is a directed graph where:
- Nodes represent operations (add, multiply, apply activation) or variables (inputs, weights)
- Edges represent data flow (which values feed into which operations)
Example: f(x, y) = (x + y) · y
x ──→ [+] ──→ [×] ──→ f
y ──→ [+] ↗
y ──────────┘
Step by step:
- Compute a = x + y
- Compute f = a · y
For x = 2, y = 3:
- a = 2 + 3 = 5
- f = 5 × 3 = 15
Forward Pass and Backward Pass
The forward pass computes the output by flowing values from inputs to output (left to right). The backward pass computes derivatives by flowing gradients from output to inputs (right to left), using the chain rule at each node.
Forward Pass
x = 2 ──→ [+] a = 5 ──→ [×] f = 15
y = 3 ──→ [+] ↗
y = 3 ────────────────┘
Backward Pass
Start with df/df = 1 (the output's derivative with respect to itself is always 1) and work backward.
At the multiplication node f = a · y:
∂f/∂a = y = 3
∂f/∂y_mult = a = 5 (y's contribution through multiplication)
At the addition node a = x + y:
∂a/∂x = 1
∂a/∂y_add = 1 (y's contribution through addition)
Now chain them together:
∂f/∂x = ∂f/∂a · ∂a/∂x = 3 · 1 = 3
For y, we must sum contributions from both paths (y flows into both the add and the multiply):
∂f/∂y = (∂f/∂a · ∂a/∂y_add) + ∂f/∂y_mult
= (3 · 1) + 5
= 8
You can verify: f(x, y) = (x + y) · y = xy + y². So ∂f/∂x = y = 3 and ∂f/∂y = x + 2y = 2 + 6 = 8.
Local Derivatives
The power of computational graphs is that each node only needs to know its local derivative — how its output changes with respect to its immediate inputs. The chain rule connects these local derivatives into the full gradient.
| Operation | Output | ∂output/∂input₁ | ∂output/∂input₂ |
|---|---|---|---|
| a + b | a + b | 1 | 1 |
| a × b | a · b | b | a |
| a² | a² | 2a | — |
| eᵃ | eᵃ | eᵃ | — |
| σ(a) | σ(a) | σ(a)(1-σ(a)) | — |
| max(0, a) | ReLU(a) | 1 if a > 0, 0 if a ≤ 0 | — |
A Neural Network Computational Graph
Consider a single-neuron network: input x, weight w, bias b, sigmoid activation, squared error loss with target t.
x ──→ [×] ──→ [+] ──→ [σ] ──→ [-] ──→ [²] ──→ L
w ──→ [×] ↗ a ↗ e
b ──────────┘ t ────────┘
Forward pass (with x=2, w=0.5, b=0.1, t=1):
z = wx = 0.5 × 2 = 1.0
s = z + b = 1.0 + 0.1 = 1.1
a = σ(1.1) ≈ 0.75
e = a - t = 0.75 - 1 = -0.25
L = e² = 0.0625
Backward pass:
∂L/∂e = 2e = 2(-0.25) = -0.5
∂e/∂a = 1
∂a/∂s = σ(s)(1-σ(s)) = 0.75 × 0.25 = 0.1875
∂s/∂z = 1
∂z/∂w = x = 2
∂L/∂w = ∂L/∂e · ∂e/∂a · ∂a/∂s · ∂s/∂z · ∂z/∂w
= (-0.5)(1)(0.1875)(1)(2)
= -0.1875
The gradient ∂L/∂w = -0.1875 tells us: increasing w slightly would decrease the loss. So gradient descent should increase w.
Automatic Differentiation
ML frameworks implement this backward-pass algorithm automatically. When you write:
# PyTorch pseudocode
z = w * x
s = z + b
a = torch.sigmoid(s)
L = (a - t) ** 2
L.backward() # computes all gradients automatically
print(w.grad) # ∂L/∂w
The framework:
- Records the computational graph during the forward pass
- On
.backward(), traverses the graph in reverse - Applies the chain rule at each node
- Accumulates gradients for each parameter
This is called automatic differentiation (autodiff), specifically reverse-mode autodiff. It is not numerical approximation or symbolic differentiation — it is exact and efficient.
Why Reverse Mode Is Efficient
In a neural network with millions of parameters and one loss value, reverse-mode autodiff computes all gradients in one backward pass through the graph. The cost is roughly the same as one forward pass.
The alternative — computing each ∂L/∂wᵢ separately — would require one pass per parameter. For a model with 1 million parameters, that would be 1 million times slower.
| Method | Passes Required | When to Use |
|---|---|---|
| Forward mode | One per output | Many inputs, few outputs |
| Reverse mode | One per input dimension group | Few outputs, many inputs |
ML always has one loss output and many parameter inputs, making reverse mode the right choice.
Summary
- A computational graph represents a computation as nodes (operations) connected by edges (data flow)
- The forward pass computes the output from inputs
- The backward pass computes gradients from output to inputs using the chain rule
- Each node only needs its local derivative — the chain rule connects them
- When a variable has multiple paths to the output, sum the gradient contributions
- ML frameworks build computational graphs automatically and traverse them in reverse to compute gradients
- Reverse-mode automatic differentiation computes all parameter gradients in a single backward pass
- This efficiency is what makes training neural networks with millions of parameters practical
The next module puts these gradient computations to work: gradient descent, the optimization algorithm that actually updates model parameters.

