Common Loss Functions in Machine Learning

A loss function measures how wrong a model's predictions are. It is the function that gradient descent minimizes. Choosing the right loss function determines what the model optimizes for, and understanding its gradient tells you how the model learns from its mistakes.

Why Loss Functions Matter

The loss function defines the objective of training. Two models with the same architecture but different loss functions will learn different things. The gradient of the loss function drives every parameter update, so its mathematical properties directly affect training dynamics.

Mean Squared Error (MSE)

MSE is the standard loss for regression problems (predicting continuous values).

MSE = (1/N) Σᵢ (ŷᵢ - yᵢ)²

where ŷᵢ is the prediction and yᵢ is the true value.

For a single example:

L = (ŷ - y)²

Gradient of MSE

∂L/∂ŷ = 2(ŷ - y)

The gradient is proportional to the error. This means:

Large errors produce large gradients → big updates (good: the model corrects quickly)
Small errors produce small gradients → small updates (good: fine-tuning near the optimum)
Zero error produces zero gradient → no update (good: don't change what is already correct)

Why the Square?

The squaring serves two purposes:

Makes all errors positive (an error of -5 is just as bad as +5)
Penalizes large errors disproportionately (an error of 10 contributes 100, while an error of 2 contributes 4)

Loss
 ^
 |  *                 *
 |   \               /
 |    \             /
 |     \           /
 |      \         /
 |       \       /
 |        \_____/
 +────────────────────> prediction error
          0

Mean Absolute Error (MAE)

MAE uses absolute values instead of squares:

MAE = (1/N) Σᵢ |ŷᵢ - yᵢ|

Gradient of MAE

∂L/∂ŷ = +1 if ŷ > y
         -1 if ŷ < y
         undefined at ŷ = y

The gradient is always ±1, regardless of the error size. This makes MAE more robust to outliers than MSE (a massive outlier does not dominate the gradient), but it can cause training instability near the optimum because the gradient does not shrink as the prediction improves.

Binary Cross-Entropy

Binary cross-entropy is the standard loss for binary classification (predicting yes/no, 0/1).

L = -[y · ln(p) + (1 - y) · ln(1 - p)]

where y ∈ 1 is the true label and p is the predicted probability.

When y = 1: L = -ln(p). As p → 1, loss → 0. As p → 0, loss → ∞. When y = 0: L = -ln(1-p). As p → 0, loss → 0. As p → 1, loss → ∞.

Gradient of Binary Cross-Entropy

∂L/∂p = -(y/p) + (1-y)/(1-p)

The gradient is very large when the model is confidently wrong (predicting p ≈ 0 when y = 1, or p ≈ 1 when y = 0). This is exactly right: confidently wrong predictions should produce strong learning signals.

Loss
 ^
 |*                        *
 | *                      *
 |  *                    *
 |   **                **
 |     ***          ***
 |        **********
 +────────────────────────> predicted probability p
  0                        1
     y=1: -ln(p)    y=0: -ln(1-p)

Categorical Cross-Entropy

For multi-class classification (e.g., classifying images into 10 categories), the loss extends to multiple classes:

L = -Σⱼ yⱼ · ln(pⱼ)

where yⱼ is 1 for the correct class and 0 otherwise (one-hot encoding), and pⱼ is the predicted probability for class j (typically from softmax).

Since only the correct class has yⱼ = 1, this simplifies to:

L = -ln(p_correct)

The gradient pushes the model to increase the predicted probability of the correct class.

Softmax and Cross-Entropy Together

In practice, the softmax function and cross-entropy loss are combined for numerical stability. Given raw scores (logits) z₁, z₂, ..., zₖ:

pⱼ = exp(zⱼ) / Σₖ exp(zₖ)     (softmax)
L = -ln(p_correct)               (cross-entropy)

The gradient of the combined softmax + cross-entropy with respect to the logit zⱼ has an elegant form:

∂L/∂zⱼ = pⱼ - yⱼ

This is remarkably simple: the gradient for each class is just the difference between the predicted probability and the target. For the correct class, it is p - 1 (push probability up). For incorrect classes, it is p - 0 = p (push probability down).

Huber Loss

Huber loss combines MSE and MAE. It behaves like MSE for small errors (smooth gradient near zero) and like MAE for large errors (robust to outliers):

L = (1/2)(ŷ - y)²           if |ŷ - y| ≤ δ
    δ|ŷ - y| - (1/2)δ²      if |ŷ - y| > δ

where δ is a threshold parameter (commonly 1.0).

Choosing the Right Loss Function

Task	Loss Function	Why
Regression	MSE	Standard, smooth gradient, penalizes large errors
Regression with outliers	MAE or Huber	Robust to extreme values
Binary classification	Binary cross-entropy	Natural fit for probabilities
Multi-class classification	Categorical cross-entropy	Works with softmax outputs
Ranking / similarity	Contrastive or triplet loss	Compares pairs or triplets

Summary

The loss function defines what the model optimizes — it is the function that gradient descent minimizes
MSE penalizes errors quadratically, giving larger gradients for larger errors — standard for regression
Cross-entropy penalizes confident wrong predictions severely — standard for classification
The softmax + cross-entropy gradient simplifies to (prediction - target) for each class
MAE gives constant gradient magnitude, robust to outliers but less stable near the optimum
Huber loss combines the benefits of MSE and MAE
The loss function's gradient determines how strongly the model responds to different types of errors

The next lesson explores the optimization landscape — the shape of the loss surface that gradient descent must navigate.

Common Loss Functions in Machine Learning

Why Loss Functions Matter

Mean Squared Error (MSE)

MSE is the standard loss for regression problems (predicting continuous values).

MSE = (1/N) Σᵢ (ŷᵢ - yᵢ)²

where ŷᵢ is the prediction and yᵢ is the true value.

For a single example:

L = (ŷ - y)²

Gradient of MSE

∂L/∂ŷ = 2(ŷ - y)

The gradient is proportional to the error. This means:

Large errors produce large gradients → big updates (good: the model corrects quickly)
Small errors produce small gradients → small updates (good: fine-tuning near the optimum)
Zero error produces zero gradient → no update (good: don't change what is already correct)

Why the Square?

The squaring serves two purposes:

Makes all errors positive (an error of -5 is just as bad as +5)
Penalizes large errors disproportionately (an error of 10 contributes 100, while an error of 2 contributes 4)

Loss
 ^
 |  *                 *
 |   \               /
 |    \             /
 |     \           /
 |      \         /
 |       \       /
 |        \_____/
 +────────────────────> prediction error
          0

Mean Absolute Error (MAE)

MAE uses absolute values instead of squares:

MAE = (1/N) Σᵢ |ŷᵢ - yᵢ|

Gradient of MAE

∂L/∂ŷ = +1 if ŷ > y
         -1 if ŷ < y
         undefined at ŷ = y

Binary Cross-Entropy

Binary cross-entropy is the standard loss for binary classification (predicting yes/no, 0/1).

L = -[y · ln(p) + (1 - y) · ln(1 - p)]

where y ∈ 1 is the true label and p is the predicted probability.

When y = 1: L = -ln(p). As p → 1, loss → 0. As p → 0, loss → ∞. When y = 0: L = -ln(1-p). As p → 0, loss → 0. As p → 1, loss → ∞.

Gradient of Binary Cross-Entropy

∂L/∂p = -(y/p) + (1-y)/(1-p)

Loss
 ^
 |*                        *
 | *                      *
 |  *                    *
 |   **                **
 |     ***          ***
 |        **********
 +────────────────────────> predicted probability p
  0                        1
     y=1: -ln(p)    y=0: -ln(1-p)

Categorical Cross-Entropy

For multi-class classification (e.g., classifying images into 10 categories), the loss extends to multiple classes:

L = -Σⱼ yⱼ · ln(pⱼ)

where yⱼ is 1 for the correct class and 0 otherwise (one-hot encoding), and pⱼ is the predicted probability for class j (typically from softmax).

Since only the correct class has yⱼ = 1, this simplifies to:

L = -ln(p_correct)

The gradient pushes the model to increase the predicted probability of the correct class.

Softmax and Cross-Entropy Together

In practice, the softmax function and cross-entropy loss are combined for numerical stability. Given raw scores (logits) z₁, z₂, ..., zₖ:

pⱼ = exp(zⱼ) / Σₖ exp(zₖ)     (softmax)
L = -ln(p_correct)               (cross-entropy)

The gradient of the combined softmax + cross-entropy with respect to the logit zⱼ has an elegant form:

∂L/∂zⱼ = pⱼ - yⱼ

Huber Loss

Huber loss combines MSE and MAE. It behaves like MSE for small errors (smooth gradient near zero) and like MAE for large errors (robust to outliers):

L = (1/2)(ŷ - y)²           if |ŷ - y| ≤ δ
    δ|ŷ - y| - (1/2)δ²      if |ŷ - y| > δ

where δ is a threshold parameter (commonly 1.0).

Choosing the Right Loss Function

Task	Loss Function	Why
Regression	MSE	Standard, smooth gradient, penalizes large errors
Regression with outliers	MAE or Huber	Robust to extreme values
Binary classification	Binary cross-entropy	Natural fit for probabilities
Multi-class classification	Categorical cross-entropy	Works with softmax outputs
Ranking / similarity	Contrastive or triplet loss	Compares pairs or triplets

Summary

The loss function defines what the model optimizes — it is the function that gradient descent minimizes
MSE penalizes errors quadratically, giving larger gradients for larger errors — standard for regression
Cross-entropy penalizes confident wrong predictions severely — standard for classification
The softmax + cross-entropy gradient simplifies to (prediction - target) for each class
MAE gives constant gradient magnitude, robust to outliers but less stable near the optimum
Huber loss combines the benefits of MSE and MAE
The loss function's gradient determines how strongly the model responds to different types of errors

The next lesson explores the optimization landscape — the shape of the loss surface that gradient descent must navigate.

Common Loss Functions in Machine Learning

Why Loss Functions Matter

Mean Squared Error (MSE)

Gradient of MSE

Why the Square?

Mean Absolute Error (MAE)

Gradient of MAE

Binary Cross-Entropy

Gradient of Binary Cross-Entropy

Categorical Cross-Entropy

Softmax and Cross-Entropy Together

Huber Loss

Choosing the Right Loss Function

Summary

Questions & Answers

Common Loss Functions in Machine Learning

Why Loss Functions Matter

Mean Squared Error (MSE)

Gradient of MSE

Why the Square?

Mean Absolute Error (MAE)

Gradient of MAE

Binary Cross-Entropy

Gradient of Binary Cross-Entropy

Categorical Cross-Entropy

Softmax and Cross-Entropy Together

Huber Loss

Choosing the Right Loss Function

Summary

Questions & Answers