Derivative Rules You Actually Need

You do not need to memorize dozens of derivative formulas for machine learning. A small set of rules covers the vast majority of what you will encounter in loss functions, activation functions, and optimization. This lesson covers exactly those rules.

The Power Rule

If f(x) = xⁿ, then f'(x) = n · xⁿ⁻¹.

Function	Derivative	Why It Matters
x²	2x	Squared error loss
x³	3x²	Polynomial features
x¹ = x	1	Linear functions
x⁰ = 1	0	Constants have zero derivative
x⁻¹ = 1/x	-1/x²	Inverse relationships
x^(1/2) = √x	1/(2√x)	Square root normalization

The power rule is the workhorse of calculus. Whenever you see a polynomial expression in a loss function, this is the rule you apply.

Example: Squared Error

The squared error for a single prediction is:

L = (prediction - target)²

Let e = prediction - target. Then L = e², and dL/de = 2e. This simple derivative drives the training of every regression model.

The Constant Multiple Rule

If f(x) = c · g(x), then f'(x) = c · g'(x). Constants just "come along for the ride."

f(x) = 5x³
f'(x) = 5 · 3x² = 15x²

This is why the Mean Squared Error (MSE) divides by n — it is a constant that scales the gradient but does not change its direction.

The Sum Rule

The derivative of a sum is the sum of the derivatives:

If f(x) = g(x) + h(x), then f'(x) = g'(x) + h'(x)

Example: Total Loss

When you compute total loss over a batch of training examples:

L_total = L₁ + L₂ + L₃ + ... + Lₙ

The derivative of the total loss with respect to a parameter is simply the sum of each individual derivative:

dL_total/dw = dL₁/dw + dL₂/dw + dL₃/dw + ... + dLₙ/dw

This is why batch training works: you can compute gradients for each example independently and add them up.

The Exponential Rule

If f(x) = eˣ, then f'(x) = eˣ. The exponential function is its own derivative.

This makes eˣ incredibly convenient for ML. It appears in:

The softmax function (converts scores to probabilities)
The sigmoid function (binary classification)
Log-likelihood loss functions

The Sigmoid Function

The sigmoid function is one of the most important in ML:

σ(x) = 1 / (1 + e⁻ˣ)

Its derivative has an elegant form:

σ'(x) = σ(x) · (1 - σ(x))

σ(x)
 1 |          ___________
   |        /
   |      /
0.5|----/
   |  /
   |/
 0 |___________
   ────────────────── x
       0

The derivative is largest at x = 0 (where σ = 0.5) and approaches zero at the extremes. This means the sigmoid is most sensitive to changes around its midpoint and nearly flat at large positive or negative values — a property called saturation that can cause training problems.

The Logarithm Rule

If f(x) = ln(x), then f'(x) = 1/x.

The natural logarithm appears constantly in ML because of cross-entropy loss:

L = -ln(p)

where p is the predicted probability of the correct class. Its derivative is:

dL/dp = -1/p

When p is close to 1 (correct prediction), the gradient is small. When p is close to 0 (wrong prediction), the gradient is very large. This is exactly the behavior you want: penalize confident wrong predictions heavily.

The Product Rule

If f(x) = g(x) · h(x), then f'(x) = g'(x) · h(x) + g(x) · h'(x).

The product rule appears when computing derivatives of expressions where two variable terms are multiplied together. In ML, this comes up in:

Weight regularization terms (weight × activation)
Attention mechanisms (query × key)

Example

f(x) = x² · eˣ
f'(x) = 2x · eˣ + x² · eˣ = eˣ(2x + x²)

The Quotient Rule

If f(x) = g(x) / h(x), then:

f'(x) = [g'(x) · h(x) - g(x) · h'(x)] / [h(x)]²

This appears in the derivation of the softmax gradient and in normalization layers. In practice, you can often avoid the quotient rule by rewriting g/h as g · h⁻¹ and using the product and chain rules instead.

Common Activation Function Derivatives

Here is a quick reference for activation functions you will encounter:

Activation	f(x)	f'(x)	Used In
ReLU	max(0, x)	0 if x < 0, 1 if x > 0	Most hidden layers
Sigmoid	1/(1+e⁻ˣ)	σ(x)(1-σ(x))	Binary classification output
Tanh	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	1 - tanh²(x)	RNNs, some hidden layers
Leaky ReLU	x if x > 0, 0.01x if x < 0	1 if x > 0, 0.01 if x < 0	Alternative to ReLU

ReLU is dominant in modern networks because its derivative is trivially simple: either 0 or 1. This makes gradient computation fast and avoids the saturation problem of sigmoid and tanh.

Putting It Together: A Simple Loss Derivative

Suppose you have a linear model with one weight w predicting y = wx, with squared error loss:

L = (wx - target)²

Using the chain rule (covered in detail in Module 3) and the power rule:

dL/dw = 2(wx - target) · x

This tells the model:

Direction: If (wx - target) and x have the same sign, increasing w increases the loss, so decrease w
Magnitude: The gradient is proportional to both the error and the input — larger errors and larger inputs produce larger updates

Summary

Power rule: d/dx(xⁿ) = nxⁿ⁻¹ — used in polynomial loss functions
Constant multiple: constants pass through the derivative unchanged
Sum rule: derivative of a sum = sum of derivatives — enables batch training
Exponential: d/dx(eˣ) = eˣ — appears in sigmoid, softmax, log-likelihood
Logarithm: d/dx(ln x) = 1/x — drives cross-entropy loss
Product rule: d/dx(g·h) = g'h + gh' — used in attention and regularization
Sigmoid derivative: σ(x)(1 - σ(x)) — elegant but saturates at extremes
ReLU derivative: 0 or 1 — simple and fast, dominant in modern networks
These rules, combined with the chain rule from Module 3, are sufficient for understanding all gradient computations in machine learning

Next, we move to functions with multiple inputs — the reality of every ML model — starting with functions of several variables.

Derivative Rules You Actually Need

The Power Rule

If f(x) = xⁿ, then f'(x) = n · xⁿ⁻¹.

Function	Derivative	Why It Matters
x²	2x	Squared error loss
x³	3x²	Polynomial features
x¹ = x	1	Linear functions
x⁰ = 1	0	Constants have zero derivative
x⁻¹ = 1/x	-1/x²	Inverse relationships
x^(1/2) = √x	1/(2√x)	Square root normalization

The power rule is the workhorse of calculus. Whenever you see a polynomial expression in a loss function, this is the rule you apply.

Example: Squared Error

The squared error for a single prediction is:

L = (prediction - target)²

Let e = prediction - target. Then L = e², and dL/de = 2e. This simple derivative drives the training of every regression model.

The Constant Multiple Rule

If f(x) = c · g(x), then f'(x) = c · g'(x). Constants just "come along for the ride."

f(x) = 5x³
f'(x) = 5 · 3x² = 15x²

This is why the Mean Squared Error (MSE) divides by n — it is a constant that scales the gradient but does not change its direction.

The Sum Rule

The derivative of a sum is the sum of the derivatives:

If f(x) = g(x) + h(x), then f'(x) = g'(x) + h'(x)

Example: Total Loss

When you compute total loss over a batch of training examples:

L_total = L₁ + L₂ + L₃ + ... + Lₙ

The derivative of the total loss with respect to a parameter is simply the sum of each individual derivative:

dL_total/dw = dL₁/dw + dL₂/dw + dL₃/dw + ... + dLₙ/dw

This is why batch training works: you can compute gradients for each example independently and add them up.

The Exponential Rule

If f(x) = eˣ, then f'(x) = eˣ. The exponential function is its own derivative.

This makes eˣ incredibly convenient for ML. It appears in:

The softmax function (converts scores to probabilities)
The sigmoid function (binary classification)
Log-likelihood loss functions

The Sigmoid Function

The sigmoid function is one of the most important in ML:

σ(x) = 1 / (1 + e⁻ˣ)

Its derivative has an elegant form:

σ'(x) = σ(x) · (1 - σ(x))

σ(x)
 1 |          ___________
   |        /
   |      /
0.5|----/
   |  /
   |/
 0 |___________
   ────────────────── x
       0

The Logarithm Rule

If f(x) = ln(x), then f'(x) = 1/x.

The natural logarithm appears constantly in ML because of cross-entropy loss:

L = -ln(p)

where p is the predicted probability of the correct class. Its derivative is:

dL/dp = -1/p

The Product Rule

If f(x) = g(x) · h(x), then f'(x) = g'(x) · h(x) + g(x) · h'(x).

The product rule appears when computing derivatives of expressions where two variable terms are multiplied together. In ML, this comes up in:

Weight regularization terms (weight × activation)
Attention mechanisms (query × key)

Example

f(x) = x² · eˣ
f'(x) = 2x · eˣ + x² · eˣ = eˣ(2x + x²)

The Quotient Rule

If f(x) = g(x) / h(x), then:

f'(x) = [g'(x) · h(x) - g(x) · h'(x)] / [h(x)]²

Common Activation Function Derivatives

Here is a quick reference for activation functions you will encounter:

Activation	f(x)	f'(x)	Used In
ReLU	max(0, x)	0 if x < 0, 1 if x > 0	Most hidden layers
Sigmoid	1/(1+e⁻ˣ)	σ(x)(1-σ(x))	Binary classification output
Tanh	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	1 - tanh²(x)	RNNs, some hidden layers
Leaky ReLU	x if x > 0, 0.01x if x < 0	1 if x > 0, 0.01 if x < 0	Alternative to ReLU

ReLU is dominant in modern networks because its derivative is trivially simple: either 0 or 1. This makes gradient computation fast and avoids the saturation problem of sigmoid and tanh.

Putting It Together: A Simple Loss Derivative

Suppose you have a linear model with one weight w predicting y = wx, with squared error loss:

L = (wx - target)²

Using the chain rule (covered in detail in Module 3) and the power rule:

dL/dw = 2(wx - target) · x

This tells the model:

Direction: If (wx - target) and x have the same sign, increasing w increases the loss, so decrease w
Magnitude: The gradient is proportional to both the error and the input — larger errors and larger inputs produce larger updates

Summary

Power rule: d/dx(xⁿ) = nxⁿ⁻¹ — used in polynomial loss functions
Constant multiple: constants pass through the derivative unchanged
Sum rule: derivative of a sum = sum of derivatives — enables batch training
Exponential: d/dx(eˣ) = eˣ — appears in sigmoid, softmax, log-likelihood
Logarithm: d/dx(ln x) = 1/x — drives cross-entropy loss
Product rule: d/dx(g·h) = g'h + gh' — used in attention and regularization
Sigmoid derivative: σ(x)(1 - σ(x)) — elegant but saturates at extremes
ReLU derivative: 0 or 1 — simple and fast, dominant in modern networks
These rules, combined with the chain rule from Module 3, are sufficient for understanding all gradient computations in machine learning

Next, we move to functions with multiple inputs — the reality of every ML model — starting with functions of several variables.

Derivative Rules You Actually Need

The Power Rule

Example: Squared Error

The Constant Multiple Rule

The Sum Rule

Example: Total Loss

The Exponential Rule

The Sigmoid Function

The Logarithm Rule

The Product Rule

Example

The Quotient Rule

Common Activation Function Derivatives

Putting It Together: A Simple Loss Derivative

Summary

Questions & Answers

Derivative Rules You Actually Need

The Power Rule

Example: Squared Error

The Constant Multiple Rule

The Sum Rule

Example: Total Loss

The Exponential Rule

The Sigmoid Function

The Logarithm Rule

The Product Rule

Example

The Quotient Rule

Common Activation Function Derivatives

Putting It Together: A Simple Loss Derivative

Summary

Questions & Answers