Partial Derivatives Explained

When a function has multiple inputs, a regular derivative is not enough. You need to know how the output changes when you adjust one input while keeping all others fixed. This is a partial derivative, and it is how ML models figure out which parameter to change and by how much.

The Core Idea

Given a function f(x, y), the partial derivative with respect to x measures how f changes when x changes, while y is held constant. It is written as:

∂f/∂x

The symbol ∂ (a curly "d") replaces the straight d to signal that there are other variables being held constant.

Similarly, ∂f/∂y measures how f changes when y changes, while x is held constant.

Computing Partial Derivatives

The rule is simple: treat all other variables as constants and differentiate normally.

Example 1: f(x, y) = x² + 3xy + y²

To find ∂f/∂x, treat y as a constant:

∂f/∂x = 2x + 3y + 0 = 2x + 3y

To find ∂f/∂y, treat x as a constant:

∂f/∂y = 0 + 3x + 2y = 3x + 2y

Example 2: f(w₁, w₂) = (w₁ · 5 + w₂ · 3 - 10)²

This looks like a loss function for a two-weight model with feature values 5 and 3, and target 10. Let e = 5w₁ + 3w₂ - 10.

∂f/∂w₁ = 2(5w₁ + 3w₂ - 10) · 5 = 10e
∂f/∂w₂ = 2(5w₁ + 3w₂ - 10) · 3 = 6e

Each partial derivative tells you: if you nudge this weight slightly, how much does the loss change?

Geometric Interpretation

For a function of two variables, the partial derivative ∂f/∂x at a point is the slope of the surface in the x-direction. It is what you get if you "slice" the surface with a plane parallel to the x-axis.

f (loss)
 ^
 |       /  ← slope in x-direction = ∂f/∂x
 |      /
 |     / · · · ·  surface
 |    / · · · ·
 +──────────────> x
         point

∂f/∂y is the slope in the y-direction at the same point. Together, they describe how the surface tilts in both axis directions.

Partial Derivatives in ML Training

In a neural network with parameters w₁, w₂, ..., wₙ, the loss function L depends on all of them:

L(w₁, w₂, ..., wₙ)

Training requires computing:

∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ

Each partial derivative answers: "If I increase this one weight by a tiny amount while keeping everything else the same, does the loss go up or down, and by how much?"

Why "Partial" Matters

If you have a network with 1 million weights, each weight's partial derivative is computed as if the other 999,999 weights are frozen. This simplifies an impossibly complex multidimensional problem into 1 million one-dimensional problems.

Higher-Order Partial Derivatives

Just as you can take the derivative of a derivative in single-variable calculus, you can take partial derivatives of partial derivatives:

∂²f/∂x² = second partial derivative with respect to x
∂²f/∂x∂y = mixed partial derivative (first y, then x)

The second partial derivative ∂²f/∂x² tells you how the slope itself is changing — whether the surface is curving upward (convex) or downward (concave) in that direction. This curvature information is used by advanced optimizers like Adam and Newton's method to take smarter steps.

Practical Example: Linear Regression Loss

Consider linear regression with two features:

prediction = w₁x₁ + w₂x₂ + b
L = (1/2)(prediction - y)²

The partial derivatives are:

∂L/∂w₁ = (prediction - y) · x₁
∂L/∂w₂ = (prediction - y) · x₂
∂L/∂b  = (prediction - y) · 1

Notice the pattern: each partial derivative equals the error times the corresponding input. This makes intuitive sense:

If the error is large, the gradient is large (make a big correction)
If the input feature is large, the gradient is large (that feature contributed more to the error)
If the error is zero, all gradients are zero (no update needed — the model is correct)

Notation Variants

You will encounter several notations for the same thing:

Notation	Meaning
∂f/∂x	Partial derivative of f with respect to x
∂L/∂wᵢ	Partial of loss with respect to weight i
fₓ	Shorthand for ∂f/∂x
∇ᵢL	The i-th component of the gradient (covered next lesson)

Summary

A partial derivative measures how a function changes when one input varies and all others are held constant
Notation: ∂f/∂x uses curly ∂ to indicate other variables exist
Computation: treat all other variables as constants and differentiate normally
Geometrically: the slope of the surface in one direction
In ML: each partial derivative tells the model how sensitive the loss is to one specific parameter
The error-times-input pattern in linear regression gradients appears throughout ML
Higher-order partial derivatives capture curvature information used by advanced optimizers

The next lesson combines all partial derivatives into a single powerful object: the gradient vector.

Partial Derivatives Explained

The Core Idea

Given a function f(x, y), the partial derivative with respect to x measures how f changes when x changes, while y is held constant. It is written as:

∂f/∂x

The symbol ∂ (a curly "d") replaces the straight d to signal that there are other variables being held constant.

Similarly, ∂f/∂y measures how f changes when y changes, while x is held constant.

Computing Partial Derivatives

The rule is simple: treat all other variables as constants and differentiate normally.

Example 1: f(x, y) = x² + 3xy + y²

To find ∂f/∂x, treat y as a constant:

∂f/∂x = 2x + 3y + 0 = 2x + 3y

To find ∂f/∂y, treat x as a constant:

∂f/∂y = 0 + 3x + 2y = 3x + 2y

Example 2: f(w₁, w₂) = (w₁ · 5 + w₂ · 3 - 10)²

This looks like a loss function for a two-weight model with feature values 5 and 3, and target 10. Let e = 5w₁ + 3w₂ - 10.

∂f/∂w₁ = 2(5w₁ + 3w₂ - 10) · 5 = 10e
∂f/∂w₂ = 2(5w₁ + 3w₂ - 10) · 3 = 6e

Each partial derivative tells you: if you nudge this weight slightly, how much does the loss change?

Geometric Interpretation

f (loss)
 ^
 |       /  ← slope in x-direction = ∂f/∂x
 |      /
 |     / · · · ·  surface
 |    / · · · ·
 +──────────────> x
         point

∂f/∂y is the slope in the y-direction at the same point. Together, they describe how the surface tilts in both axis directions.

Partial Derivatives in ML Training

In a neural network with parameters w₁, w₂, ..., wₙ, the loss function L depends on all of them:

L(w₁, w₂, ..., wₙ)

Training requires computing:

∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ

Each partial derivative answers: "If I increase this one weight by a tiny amount while keeping everything else the same, does the loss go up or down, and by how much?"

Why "Partial" Matters

Higher-Order Partial Derivatives

Just as you can take the derivative of a derivative in single-variable calculus, you can take partial derivatives of partial derivatives:

∂²f/∂x² = second partial derivative with respect to x
∂²f/∂x∂y = mixed partial derivative (first y, then x)

Practical Example: Linear Regression Loss

Consider linear regression with two features:

prediction = w₁x₁ + w₂x₂ + b
L = (1/2)(prediction - y)²

The partial derivatives are:

∂L/∂w₁ = (prediction - y) · x₁
∂L/∂w₂ = (prediction - y) · x₂
∂L/∂b  = (prediction - y) · 1

Notice the pattern: each partial derivative equals the error times the corresponding input. This makes intuitive sense:

If the error is large, the gradient is large (make a big correction)
If the input feature is large, the gradient is large (that feature contributed more to the error)
If the error is zero, all gradients are zero (no update needed — the model is correct)

Notation Variants

You will encounter several notations for the same thing:

Notation	Meaning
∂f/∂x	Partial derivative of f with respect to x
∂L/∂wᵢ	Partial of loss with respect to weight i
fₓ	Shorthand for ∂f/∂x
∇ᵢL	The i-th component of the gradient (covered next lesson)

Summary

A partial derivative measures how a function changes when one input varies and all others are held constant
Notation: ∂f/∂x uses curly ∂ to indicate other variables exist
Computation: treat all other variables as constants and differentiate normally
Geometrically: the slope of the surface in one direction
In ML: each partial derivative tells the model how sensitive the loss is to one specific parameter
The error-times-input pattern in linear regression gradients appears throughout ML
Higher-order partial derivatives capture curvature information used by advanced optimizers

The next lesson combines all partial derivatives into a single powerful object: the gradient vector.

Partial Derivatives Explained

The Core Idea

Computing Partial Derivatives

Example 1: f(x, y) = x² + 3xy + y²

Example 2: f(w₁, w₂) = (w₁ · 5 + w₂ · 3 - 10)²

Geometric Interpretation

Partial Derivatives in ML Training

Why "Partial" Matters

Higher-Order Partial Derivatives

Practical Example: Linear Regression Loss

Notation Variants

Summary

Questions & Answers

Partial Derivatives Explained

The Core Idea

Computing Partial Derivatives

Example 1: f(x, y) = x² + 3xy + y²

Example 2: f(w₁, w₂) = (w₁ · 5 + w₂ · 3 - 10)²

Geometric Interpretation

Partial Derivatives in ML Training

Why "Partial" Matters

Higher-Order Partial Derivatives

Practical Example: Linear Regression Loss

Notation Variants

Summary

Questions & Answers