Loss Functions and Optimization
Loss functions define what "good" means for a model, and optimization algorithms find the parameters that achieve it. Both are deeply connected to probability and statistics. This lesson explores the key loss functions used in AI and how they relate to probabilistic reasoning.
What is a Loss Function?
A loss function (or cost function, objective function) measures how wrong the model's predictions are:
L(y_true, y_pred) → scalar value
- Lower loss = better predictions
- Training minimizes total loss over the dataset
Loss Functions and Probability
Most loss functions have probabilistic interpretations:
| Loss | Probabilistic Interpretation |
|---|---|
| Cross-entropy | Negative log-likelihood (classification) |
| Mean squared error | Negative log-likelihood (Gaussian) |
| Binary cross-entropy | Negative log-likelihood (Bernoulli) |
| KL divergence | Distance between distributions |
Cross-Entropy Loss
The most common loss for classification:
L = -Σ yᵢ × log(p̂ᵢ)
For single-class labels (one-hot), this simplifies to:
L = -log(p̂_true_class)
Why It Works
True class = 2, prediction = [0.1, 0.1, 0.7, 0.1]
L = -log(0.7) ≈ 0.36
If the model improves:
Better prediction = [0.05, 0.05, 0.85, 0.05]
L = -log(0.85) ≈ 0.16
Loss decreases as the model assigns more probability to the correct class.
Multi-Class Cross-Entropy
For K classes:
L = -Σₖ yₖ × log(p̂ₖ)
This is the categorical cross-entropy used in softmax classifiers.
Binary Cross-Entropy
For binary classification:
L = -[y × log(p̂) + (1-y) × log(1-p̂)]
- If y = 1: L = -log(p̂)
- If y = 0: L = -log(1-p̂)
Mean Squared Error (MSE)
For regression:
L = (1/n) × Σ (yᵢ - ŷᵢ)²
Probabilistic View
MSE assumes Gaussian noise:
y = f(x) + ε, where ε ~ Normal(0, σ²)
Minimizing MSE = Maximizing likelihood under this assumption.
When to Use MSE
- Continuous targets
- When errors are roughly symmetric
- When outliers shouldn't dominate
MSE Variants
Mean Absolute Error (MAE):
L = (1/n) × Σ |yᵢ - ŷᵢ|
More robust to outliers; assumes Laplace noise.
Huber Loss:
L = { ½(y - ŷ)² if |y - ŷ| ≤ δ
{ δ|y - ŷ| - ½δ² otherwise
Combines MSE (small errors) with MAE (large errors).
KL Divergence
Measures how different two probability distributions are:
KL(P || Q) = Σ P(x) × log(P(x) / Q(x))
Properties:
- Always ≥ 0
- = 0 only when P = Q
- Not symmetric: KL(P||Q) ≠ KL(Q||P)
Cross-Entropy and KL Divergence
Cross-Entropy(P, Q) = Entropy(P) + KL(P || Q)
When training, P (true distribution) is fixed, so:
- Minimizing cross-entropy = Minimizing KL divergence
KL in Variational Autoencoders
VAEs minimize:
L = Reconstruction_Loss + β × KL(q(z|x) || p(z))
The KL term keeps the learned latent space close to a prior (usually standard normal).
Focal Loss
For imbalanced classification:
L = -(1 - p̂)^γ × log(p̂)
- γ = 0: Standard cross-entropy
- γ > 0: Down-weights easy examples
Example with γ = 2:
- Easy example (p̂ = 0.9): L = -(0.1)² × log(0.9) = 0.001
- Hard example (p̂ = 0.1): L = -(0.9)² × log(0.1) = 1.87
Focus shifts to hard examples.
Contrastive Losses
For learning representations:
Triplet Loss
L = max(0, d(anchor, positive) - d(anchor, negative) + margin)
Push similar items together, dissimilar items apart.
InfoNCE (used in CLIP, SimCLR)
L = -log(exp(sim(xᵢ, xⱼ⁺) / τ) / Σₖ exp(sim(xᵢ, xₖ) / τ))
Treats similarity as probability; cross-entropy over positive vs. negatives.
Optimization Algorithms
Gradient Descent
Basic update:
θ ← θ - lr × ∇L(θ)
Move in the direction of steepest descent.
Stochastic Gradient Descent (SGD)
Use mini-batch estimates:
θ ← θ - lr × ∇L_batch(θ)
Faster but noisier. The noise can help escape local minima.
Momentum
Accumulate velocity:
v ← β×v + ∇L(θ)
θ ← θ - lr × v
Smooths out noisy gradients, accelerates convergence.
RMSprop
Adapt learning rate per parameter:
s ← β×s + (1-β)×(∇L(θ))²
θ ← θ - lr × ∇L(θ) / (√s + ε)
Larger gradients → smaller effective learning rate.
Adam
Combines momentum and RMSprop:
m ← β₁×m + (1-β₁)×∇L(θ) # First moment
v ← β₂×v + (1-β₂)×(∇L(θ))² # Second moment
m̂ = m / (1 - β₁ᵗ) # Bias correction
v̂ = v / (1 - β₂ᵗ)
θ ← θ - lr × m̂ / (√v̂ + ε)
Works well across many problems with default hyperparameters.
AdamW
Adam with decoupled weight decay:
θ ← θ - lr × (m̂ / (√v̂ + ε) + λ × θ)
Better regularization; preferred for transformers.
Learning Rate Schedules
Constant
lr(t) = lr₀
Simple but may not converge optimally.
Step Decay
lr(t) = lr₀ × γ^(t // step_size)
Decrease by factor γ every step_size epochs.
Cosine Annealing
lr(t) = lr_min + ½(lr_max - lr_min)(1 + cos(πt/T))
Smooth decay following cosine curve.
Warmup
lr(t) = lr₀ × (t / warmup_steps) for t < warmup_steps
Start small and increase; helps with large batch training.
One-Cycle
Increase then decrease learning rate:
↗↗↗↗ peak ↘↘↘↘
Often achieves better results faster.
The Loss Landscape
Neural network loss functions define high-dimensional landscapes:
Loss
↑ Local Global
| ↗ minimum minimum
| ↗ \_/\_ ↘ ___/\_
|↗ ↘
+------------------------→ Parameters
Challenges
- Local minima: Not the global best
- Saddle points: Flat in some directions
- Plateaus: Very flat regions
Why Deep Learning Works
Despite non-convexity:
- Most local minima are similar quality
- SGD noise helps escape bad minima
- Overparameterization creates good paths
- Modern optimizers handle saddle points
Gradient Clipping
Prevent exploding gradients:
# Clip by value
gradients = [clip(g, -max_val, max_val) for g in gradients]
# Clip by norm
norm = sqrt(sum(g² for g in gradients))
if norm > max_norm:
gradients = [g * max_norm / norm for g in gradients]
Essential for RNNs and transformers.
Summary
- Loss functions measure prediction quality
- Cross-entropy = MLE for classification
- MSE = MLE for regression with Gaussian noise
- KL divergence measures distribution difference
- Focal loss helps with imbalanced data
- Optimizers (SGD, Adam) navigate the loss landscape
- Learning rate schedules control the optimization trajectory
- Gradient clipping prevents training instability
This completes Module 5! Next, we'll explore sampling, randomness, and evaluation metrics—how AI generates outputs and how we measure success.

