Common Loss Functions in Machine Learning
A loss function measures how wrong a model's predictions are. It is the function that gradient descent minimizes. Choosing the right loss function determines what the model optimizes for, and understanding its gradient tells you how the model learns from its mistakes.
Why Loss Functions Matter
The loss function defines the objective of training. Two models with the same architecture but different loss functions will learn different things. The gradient of the loss function drives every parameter update, so its mathematical properties directly affect training dynamics.
Mean Squared Error (MSE)
MSE is the standard loss for regression problems (predicting continuous values).
MSE = (1/N) Σᵢ (ŷᵢ - yᵢ)²
where ŷᵢ is the prediction and yᵢ is the true value.
For a single example:
L = (ŷ - y)²
Gradient of MSE
∂L/∂ŷ = 2(ŷ - y)
The gradient is proportional to the error. This means:
- Large errors produce large gradients → big updates (good: the model corrects quickly)
- Small errors produce small gradients → small updates (good: fine-tuning near the optimum)
- Zero error produces zero gradient → no update (good: don't change what is already correct)
Why the Square?
The squaring serves two purposes:
- Makes all errors positive (an error of -5 is just as bad as +5)
- Penalizes large errors disproportionately (an error of 10 contributes 100, while an error of 2 contributes 4)
Loss
^
| * *
| \ /
| \ /
| \ /
| \ /
| \ /
| \_____/
+────────────────────> prediction error
0
Mean Absolute Error (MAE)
MAE uses absolute values instead of squares:
MAE = (1/N) Σᵢ |ŷᵢ - yᵢ|
Gradient of MAE
∂L/∂ŷ = +1 if ŷ > y
-1 if ŷ < y
undefined at ŷ = y
The gradient is always ±1, regardless of the error size. This makes MAE more robust to outliers than MSE (a massive outlier does not dominate the gradient), but it can cause training instability near the optimum because the gradient does not shrink as the prediction improves.
Binary Cross-Entropy
Binary cross-entropy is the standard loss for binary classification (predicting yes/no, 0/1).
L = -[y · ln(p) + (1 - y) · ln(1 - p)]
where y ∈ 1 is the true label and p is the predicted probability.
When y = 1: L = -ln(p). As p → 1, loss → 0. As p → 0, loss → ∞. When y = 0: L = -ln(1-p). As p → 0, loss → 0. As p → 1, loss → ∞.
Gradient of Binary Cross-Entropy
∂L/∂p = -(y/p) + (1-y)/(1-p)
The gradient is very large when the model is confidently wrong (predicting p ≈ 0 when y = 1, or p ≈ 1 when y = 0). This is exactly right: confidently wrong predictions should produce strong learning signals.
Loss
^
|* *
| * *
| * *
| ** **
| *** ***
| **********
+────────────────────────> predicted probability p
0 1
y=1: -ln(p) y=0: -ln(1-p)
Categorical Cross-Entropy
For multi-class classification (e.g., classifying images into 10 categories), the loss extends to multiple classes:
L = -Σⱼ yⱼ · ln(pⱼ)
where yⱼ is 1 for the correct class and 0 otherwise (one-hot encoding), and pⱼ is the predicted probability for class j (typically from softmax).
Since only the correct class has yⱼ = 1, this simplifies to:
L = -ln(p_correct)
The gradient pushes the model to increase the predicted probability of the correct class.
Softmax and Cross-Entropy Together
In practice, the softmax function and cross-entropy loss are combined for numerical stability. Given raw scores (logits) z₁, z₂, ..., zₖ:
pⱼ = exp(zⱼ) / Σₖ exp(zₖ) (softmax)
L = -ln(p_correct) (cross-entropy)
The gradient of the combined softmax + cross-entropy with respect to the logit zⱼ has an elegant form:
∂L/∂zⱼ = pⱼ - yⱼ
This is remarkably simple: the gradient for each class is just the difference between the predicted probability and the target. For the correct class, it is p - 1 (push probability up). For incorrect classes, it is p - 0 = p (push probability down).
Huber Loss
Huber loss combines MSE and MAE. It behaves like MSE for small errors (smooth gradient near zero) and like MAE for large errors (robust to outliers):
L = (1/2)(ŷ - y)² if |ŷ - y| ≤ δ
δ|ŷ - y| - (1/2)δ² if |ŷ - y| > δ
where δ is a threshold parameter (commonly 1.0).
Choosing the Right Loss Function
| Task | Loss Function | Why |
|---|---|---|
| Regression | MSE | Standard, smooth gradient, penalizes large errors |
| Regression with outliers | MAE or Huber | Robust to extreme values |
| Binary classification | Binary cross-entropy | Natural fit for probabilities |
| Multi-class classification | Categorical cross-entropy | Works with softmax outputs |
| Ranking / similarity | Contrastive or triplet loss | Compares pairs or triplets |
Summary
- The loss function defines what the model optimizes — it is the function that gradient descent minimizes
- MSE penalizes errors quadratically, giving larger gradients for larger errors — standard for regression
- Cross-entropy penalizes confident wrong predictions severely — standard for classification
- The softmax + cross-entropy gradient simplifies to (prediction - target) for each class
- MAE gives constant gradient magnitude, robust to outliers but less stable near the optimum
- Huber loss combines the benefits of MSE and MAE
- The loss function's gradient determines how strongly the model responds to different types of errors
The next lesson explores the optimization landscape — the shape of the loss surface that gradient descent must navigate.

