Partial Derivatives Explained
When a function has multiple inputs, a regular derivative is not enough. You need to know how the output changes when you adjust one input while keeping all others fixed. This is a partial derivative, and it is how ML models figure out which parameter to change and by how much.
The Core Idea
Given a function f(x, y), the partial derivative with respect to x measures how f changes when x changes, while y is held constant. It is written as:
∂f/∂x
The symbol ∂ (a curly "d") replaces the straight d to signal that there are other variables being held constant.
Similarly, ∂f/∂y measures how f changes when y changes, while x is held constant.
Computing Partial Derivatives
The rule is simple: treat all other variables as constants and differentiate normally.
Example 1: f(x, y) = x² + 3xy + y²
To find ∂f/∂x, treat y as a constant:
∂f/∂x = 2x + 3y + 0 = 2x + 3y
To find ∂f/∂y, treat x as a constant:
∂f/∂y = 0 + 3x + 2y = 3x + 2y
Example 2: f(w₁, w₂) = (w₁ · 5 + w₂ · 3 - 10)²
This looks like a loss function for a two-weight model with feature values 5 and 3, and target 10. Let e = 5w₁ + 3w₂ - 10.
∂f/∂w₁ = 2(5w₁ + 3w₂ - 10) · 5 = 10e
∂f/∂w₂ = 2(5w₁ + 3w₂ - 10) · 3 = 6e
Each partial derivative tells you: if you nudge this weight slightly, how much does the loss change?
Geometric Interpretation
For a function of two variables, the partial derivative ∂f/∂x at a point is the slope of the surface in the x-direction. It is what you get if you "slice" the surface with a plane parallel to the x-axis.
f (loss)
^
| / ← slope in x-direction = ∂f/∂x
| /
| / · · · · surface
| / · · · ·
+──────────────> x
point
∂f/∂y is the slope in the y-direction at the same point. Together, they describe how the surface tilts in both axis directions.
Partial Derivatives in ML Training
In a neural network with parameters w₁, w₂, ..., wₙ, the loss function L depends on all of them:
L(w₁, w₂, ..., wₙ)
Training requires computing:
∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ
Each partial derivative answers: "If I increase this one weight by a tiny amount while keeping everything else the same, does the loss go up or down, and by how much?"
Why "Partial" Matters
If you have a network with 1 million weights, each weight's partial derivative is computed as if the other 999,999 weights are frozen. This simplifies an impossibly complex multidimensional problem into 1 million one-dimensional problems.
Higher-Order Partial Derivatives
Just as you can take the derivative of a derivative in single-variable calculus, you can take partial derivatives of partial derivatives:
∂²f/∂x² = second partial derivative with respect to x
∂²f/∂x∂y = mixed partial derivative (first y, then x)
The second partial derivative ∂²f/∂x² tells you how the slope itself is changing — whether the surface is curving upward (convex) or downward (concave) in that direction. This curvature information is used by advanced optimizers like Adam and Newton's method to take smarter steps.
Practical Example: Linear Regression Loss
Consider linear regression with two features:
prediction = w₁x₁ + w₂x₂ + b
L = (1/2)(prediction - y)²
The partial derivatives are:
∂L/∂w₁ = (prediction - y) · x₁
∂L/∂w₂ = (prediction - y) · x₂
∂L/∂b = (prediction - y) · 1
Notice the pattern: each partial derivative equals the error times the corresponding input. This makes intuitive sense:
- If the error is large, the gradient is large (make a big correction)
- If the input feature is large, the gradient is large (that feature contributed more to the error)
- If the error is zero, all gradients are zero (no update needed — the model is correct)
Notation Variants
You will encounter several notations for the same thing:
| Notation | Meaning |
|---|---|
| ∂f/∂x | Partial derivative of f with respect to x |
| ∂L/∂wᵢ | Partial of loss with respect to weight i |
| fₓ | Shorthand for ∂f/∂x |
| ∇ᵢL | The i-th component of the gradient (covered next lesson) |
Summary
- A partial derivative measures how a function changes when one input varies and all others are held constant
- Notation: ∂f/∂x uses curly ∂ to indicate other variables exist
- Computation: treat all other variables as constants and differentiate normally
- Geometrically: the slope of the surface in one direction
- In ML: each partial derivative tells the model how sensitive the loss is to one specific parameter
- The error-times-input pattern in linear regression gradients appears throughout ML
- Higher-order partial derivatives capture curvature information used by advanced optimizers
The next lesson combines all partial derivatives into a single powerful object: the gradient vector.

