Derivative Rules You Actually Need
You do not need to memorize dozens of derivative formulas for machine learning. A small set of rules covers the vast majority of what you will encounter in loss functions, activation functions, and optimization. This lesson covers exactly those rules.
The Power Rule
If f(x) = xⁿ, then f'(x) = n · xⁿ⁻¹.
| Function | Derivative | Why It Matters |
|---|---|---|
| x² | 2x | Squared error loss |
| x³ | 3x² | Polynomial features |
| x¹ = x | 1 | Linear functions |
| x⁰ = 1 | 0 | Constants have zero derivative |
| x⁻¹ = 1/x | -1/x² | Inverse relationships |
| x^(1/2) = √x | 1/(2√x) | Square root normalization |
The power rule is the workhorse of calculus. Whenever you see a polynomial expression in a loss function, this is the rule you apply.
Example: Squared Error
The squared error for a single prediction is:
L = (prediction - target)²
Let e = prediction - target. Then L = e², and dL/de = 2e. This simple derivative drives the training of every regression model.
The Constant Multiple Rule
If f(x) = c · g(x), then f'(x) = c · g'(x). Constants just "come along for the ride."
f(x) = 5x³
f'(x) = 5 · 3x² = 15x²
This is why the Mean Squared Error (MSE) divides by n — it is a constant that scales the gradient but does not change its direction.
The Sum Rule
The derivative of a sum is the sum of the derivatives:
If f(x) = g(x) + h(x), then f'(x) = g'(x) + h'(x)
Example: Total Loss
When you compute total loss over a batch of training examples:
L_total = L₁ + L₂ + L₃ + ... + Lₙ
The derivative of the total loss with respect to a parameter is simply the sum of each individual derivative:
dL_total/dw = dL₁/dw + dL₂/dw + dL₃/dw + ... + dLₙ/dw
This is why batch training works: you can compute gradients for each example independently and add them up.
The Exponential Rule
If f(x) = eˣ, then f'(x) = eˣ. The exponential function is its own derivative.
This makes eˣ incredibly convenient for ML. It appears in:
- The softmax function (converts scores to probabilities)
- The sigmoid function (binary classification)
- Log-likelihood loss functions
The Sigmoid Function
The sigmoid function is one of the most important in ML:
σ(x) = 1 / (1 + e⁻ˣ)
Its derivative has an elegant form:
σ'(x) = σ(x) · (1 - σ(x))
σ(x)
1 | ___________
| /
| /
0.5|----/
| /
|/
0 |___________
────────────────── x
0
The derivative is largest at x = 0 (where σ = 0.5) and approaches zero at the extremes. This means the sigmoid is most sensitive to changes around its midpoint and nearly flat at large positive or negative values — a property called saturation that can cause training problems.
The Logarithm Rule
If f(x) = ln(x), then f'(x) = 1/x.
The natural logarithm appears constantly in ML because of cross-entropy loss:
L = -ln(p)
where p is the predicted probability of the correct class. Its derivative is:
dL/dp = -1/p
When p is close to 1 (correct prediction), the gradient is small. When p is close to 0 (wrong prediction), the gradient is very large. This is exactly the behavior you want: penalize confident wrong predictions heavily.
The Product Rule
If f(x) = g(x) · h(x), then f'(x) = g'(x) · h(x) + g(x) · h'(x).
The product rule appears when computing derivatives of expressions where two variable terms are multiplied together. In ML, this comes up in:
- Weight regularization terms (weight × activation)
- Attention mechanisms (query × key)
Example
f(x) = x² · eˣ
f'(x) = 2x · eˣ + x² · eˣ = eˣ(2x + x²)
The Quotient Rule
If f(x) = g(x) / h(x), then:
f'(x) = [g'(x) · h(x) - g(x) · h'(x)] / [h(x)]²
This appears in the derivation of the softmax gradient and in normalization layers. In practice, you can often avoid the quotient rule by rewriting g/h as g · h⁻¹ and using the product and chain rules instead.
Common Activation Function Derivatives
Here is a quick reference for activation functions you will encounter:
| Activation | f(x) | f'(x) | Used In |
|---|---|---|---|
| ReLU | max(0, x) | 0 if x < 0, 1 if x > 0 | Most hidden layers |
| Sigmoid | 1/(1+e⁻ˣ) | σ(x)(1-σ(x)) | Binary classification output |
| Tanh | (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | 1 - tanh²(x) | RNNs, some hidden layers |
| Leaky ReLU | x if x > 0, 0.01x if x < 0 | 1 if x > 0, 0.01 if x < 0 | Alternative to ReLU |
ReLU is dominant in modern networks because its derivative is trivially simple: either 0 or 1. This makes gradient computation fast and avoids the saturation problem of sigmoid and tanh.
Putting It Together: A Simple Loss Derivative
Suppose you have a linear model with one weight w predicting y = wx, with squared error loss:
L = (wx - target)²
Using the chain rule (covered in detail in Module 3) and the power rule:
dL/dw = 2(wx - target) · x
This tells the model:
- Direction: If (wx - target) and x have the same sign, increasing w increases the loss, so decrease w
- Magnitude: The gradient is proportional to both the error and the input — larger errors and larger inputs produce larger updates
Summary
- Power rule: d/dx(xⁿ) = nxⁿ⁻¹ — used in polynomial loss functions
- Constant multiple: constants pass through the derivative unchanged
- Sum rule: derivative of a sum = sum of derivatives — enables batch training
- Exponential: d/dx(eˣ) = eˣ — appears in sigmoid, softmax, log-likelihood
- Logarithm: d/dx(ln x) = 1/x — drives cross-entropy loss
- Product rule: d/dx(g·h) = g'h + gh' — used in attention and regularization
- Sigmoid derivative: σ(x)(1 - σ(x)) — elegant but saturates at extremes
- ReLU derivative: 0 or 1 — simple and fast, dominant in modern networks
- These rules, combined with the chain rule from Module 3, are sufficient for understanding all gradient computations in machine learning
Next, we move to functions with multiple inputs — the reality of every ML model — starting with functions of several variables.

