Loss Functions and Optimization

Loss functions define what "good" means for a model, and optimization algorithms find the parameters that achieve it. Both are deeply connected to probability and statistics. This lesson explores the key loss functions used in AI and how they relate to probabilistic reasoning.

What is a Loss Function?

A loss function (or cost function, objective function) measures how wrong the model's predictions are:

L(y_true, y_pred) → scalar value

Lower loss = better predictions
Training minimizes total loss over the dataset

Loss Functions and Probability

Most loss functions have probabilistic interpretations:

Loss	Probabilistic Interpretation
Cross-entropy	Negative log-likelihood (classification)
Mean squared error	Negative log-likelihood (Gaussian)
Binary cross-entropy	Negative log-likelihood (Bernoulli)
KL divergence	Distance between distributions

Cross-Entropy Loss

The most common loss for classification:

L = -Σ yᵢ × log(p̂ᵢ)

For single-class labels (one-hot), this simplifies to:

L = -log(p̂_true_class)

Why It Works

True class = 2, prediction = [0.1, 0.1, 0.7, 0.1]

L = -log(0.7) ≈ 0.36

If the model improves:

Better prediction = [0.05, 0.05, 0.85, 0.05]
L = -log(0.85) ≈ 0.16

Loss decreases as the model assigns more probability to the correct class.

Multi-Class Cross-Entropy

For K classes:

L = -Σₖ yₖ × log(p̂ₖ)

This is the categorical cross-entropy used in softmax classifiers.

Binary Cross-Entropy

For binary classification:

L = -[y × log(p̂) + (1-y) × log(1-p̂)]

If y = 1: L = -log(p̂)
If y = 0: L = -log(1-p̂)

Mean Squared Error (MSE)

For regression:

L = (1/n) × Σ (yᵢ - ŷᵢ)²

Probabilistic View

MSE assumes Gaussian noise:

y = f(x) + ε, where ε ~ Normal(0, σ²)

Minimizing MSE = Maximizing likelihood under this assumption.

When to Use MSE

Continuous targets
When errors are roughly symmetric
When outliers shouldn't dominate

MSE Variants

Mean Absolute Error (MAE):

L = (1/n) × Σ |yᵢ - ŷᵢ|

More robust to outliers; assumes Laplace noise.

Huber Loss:

L = { ½(y - ŷ)²           if |y - ŷ| ≤ δ
    { δ|y - ŷ| - ½δ²     otherwise

Combines MSE (small errors) with MAE (large errors).

KL Divergence

Measures how different two probability distributions are:

KL(P || Q) = Σ P(x) × log(P(x) / Q(x))

Properties:

Always ≥ 0
= 0 only when P = Q
Not symmetric: KL(P||Q) ≠ KL(Q||P)

Cross-Entropy and KL Divergence

Cross-Entropy(P, Q) = Entropy(P) + KL(P || Q)

When training, P (true distribution) is fixed, so:

Minimizing cross-entropy = Minimizing KL divergence

KL in Variational Autoencoders

VAEs minimize:

L = Reconstruction_Loss + β × KL(q(z|x) || p(z))

The KL term keeps the learned latent space close to a prior (usually standard normal).

Focal Loss

For imbalanced classification:

L = -(1 - p̂)^γ × log(p̂)

γ = 0: Standard cross-entropy
γ > 0: Down-weights easy examples

Example with γ = 2:

Easy example (p̂ = 0.9): L = -(0.1)² × log(0.9) = 0.001
Hard example (p̂ = 0.1): L = -(0.9)² × log(0.1) = 1.87

Focus shifts to hard examples.

Contrastive Losses

For learning representations:

Triplet Loss

L = max(0, d(anchor, positive) - d(anchor, negative) + margin)

Push similar items together, dissimilar items apart.

InfoNCE (used in CLIP, SimCLR)

L = -log(exp(sim(xᵢ, xⱼ⁺) / τ) / Σₖ exp(sim(xᵢ, xₖ) / τ))

Treats similarity as probability; cross-entropy over positive vs. negatives.

Optimization Algorithms

Gradient Descent

Basic update:

θ ← θ - lr × ∇L(θ)

Move in the direction of steepest descent.

Stochastic Gradient Descent (SGD)

Use mini-batch estimates:

θ ← θ - lr × ∇L_batch(θ)

Faster but noisier. The noise can help escape local minima.

Momentum

Accumulate velocity:

v ← β×v + ∇L(θ)
θ ← θ - lr × v

Smooths out noisy gradients, accelerates convergence.

RMSprop

Adapt learning rate per parameter:

s ← β×s + (1-β)×(∇L(θ))²
θ ← θ - lr × ∇L(θ) / (√s + ε)

Larger gradients → smaller effective learning rate.

Adam

Combines momentum and RMSprop:

m ← β₁×m + (1-β₁)×∇L(θ)       # First moment
v ← β₂×v + (1-β₂)×(∇L(θ))²    # Second moment
m̂ = m / (1 - β₁ᵗ)              # Bias correction
v̂ = v / (1 - β₂ᵗ)
θ ← θ - lr × m̂ / (√v̂ + ε)

Works well across many problems with default hyperparameters.

AdamW

Adam with decoupled weight decay:

θ ← θ - lr × (m̂ / (√v̂ + ε) + λ × θ)

Better regularization; preferred for transformers.

Learning Rate Schedules

Constant

lr(t) = lr₀

Simple but may not converge optimally.

Step Decay

lr(t) = lr₀ × γ^(t // step_size)

Decrease by factor γ every step_size epochs.

Cosine Annealing

lr(t) = lr_min + ½(lr_max - lr_min)(1 + cos(πt/T))

Smooth decay following cosine curve.

Warmup

lr(t) = lr₀ × (t / warmup_steps)  for t < warmup_steps

Start small and increase; helps with large batch training.

One-Cycle

Increase then decrease learning rate:

↗↗↗↗ peak ↘↘↘↘

Often achieves better results faster.

The Loss Landscape

Neural network loss functions define high-dimensional landscapes:

Loss
  ↑     Local      Global
  |  ↗  minimum    minimum
  | ↗    \_/\_    ↘ ___/\_
  |↗              ↘
  +------------------------→ Parameters

Challenges

Local minima: Not the global best
Saddle points: Flat in some directions
Plateaus: Very flat regions

Why Deep Learning Works

Despite non-convexity:

Most local minima are similar quality
SGD noise helps escape bad minima
Overparameterization creates good paths
Modern optimizers handle saddle points

Gradient Clipping

Prevent exploding gradients:

# Clip by value
gradients = [clip(g, -max_val, max_val) for g in gradients]

# Clip by norm
norm = sqrt(sum(g² for g in gradients))
if norm > max_norm:
    gradients = [g * max_norm / norm for g in gradients]

Essential for RNNs and transformers.

Summary

Loss functions measure prediction quality
Cross-entropy = MLE for classification
MSE = MLE for regression with Gaussian noise
KL divergence measures distribution difference
Focal loss helps with imbalanced data
Optimizers (SGD, Adam) navigate the loss landscape
Learning rate schedules control the optimization trajectory
Gradient clipping prevents training instability

This completes Module 5! Next, we'll explore sampling, randomness, and evaluation metrics—how AI generates outputs and how we measure success.