Variants of Gradient Descent

Plain gradient descent has limitations: it can be slow in flat regions, oscillate in narrow valleys, and struggle with different parameter scales. Over the years, researchers have developed variants that address these issues. Understanding these optimizers helps you choose the right one and debug training problems.

Momentum

The key idea behind momentum is to accumulate past gradients, like a ball rolling downhill that builds speed.

In plain SGD, the update depends only on the current gradient:

w = w - α · ∇L

With momentum, we maintain a velocity v that accumulates past gradients:

v = β · v + ∇L
w = w - α · v

where β (typically 0.9) is the momentum coefficient.

Why Momentum Helps

Consider a narrow valley in the loss surface:

Without momentum:           With momentum:
  ↕ ↕ ↕ ↕ ↕ ↕ ─→           ─────────────→
  oscillates side-to-side    rolls straight through
  slow forward progress      fast convergence

The gradient in the narrow direction alternates sign (pointing up, then down), so these components cancel out in the velocity. The gradient in the downhill direction is consistent, so it accumulates in the velocity. The result: momentum dampens oscillations and accelerates movement along consistent gradient directions.

The Ball-on-a-Hill Analogy

Imagine rolling a ball down a hilly surface:

Without momentum: the ball stops as soon as it encounters an uphill slope
With momentum: the ball has inertia and can roll over small bumps and through flat regions

This helps the optimizer escape shallow local minima and plateaus.

Nesterov Accelerated Gradient (NAG)

NAG is a refinement of momentum. Instead of computing the gradient at the current position, it computes the gradient at the position where momentum would take you:

v = β · v + ∇L(w - α · β · v)    ← "look ahead"
w = w - α · v

The intuition: if momentum is going to overshoot, Nesterov looks ahead and sees the gradient at the future position, allowing it to correct course before arriving there.

AdaGrad: Adaptive Learning Rates

Different parameters may need different learning rates. A parameter that rarely updates (sparse gradient) might need a larger step, while one that updates frequently might need a smaller step.

AdaGrad adapts the learning rate for each parameter based on the sum of its past squared gradients:

G = G + (∇L)²               (accumulate squared gradients, element-wise)
w = w - (α / √(G + ε)) · ∇L   (divide learning rate by accumulated magnitude)

where ε ≈ 10⁻⁸ prevents division by zero.

Parameters with large accumulated gradients get smaller effective learning rates. Parameters with small accumulated gradients get larger effective learning rates.

The Limitation

AdaGrad's accumulated gradient G only grows — it never shrinks. Over time, the effective learning rate can become extremely small, causing training to stall. This is fine for convex problems but problematic for deep learning.

RMSProp: Fixing AdaGrad's Decay

RMSProp (Root Mean Square Propagation) fixes AdaGrad by using an exponential moving average of squared gradients instead of a sum:

G = γ · G + (1 - γ) · (∇L)²     (moving average, γ ≈ 0.9)
w = w - (α / √(G + ε)) · ∇L

The moving average lets recent gradients matter more than old ones, preventing the learning rate from decaying to zero.

Adam: The Default Optimizer

Adam (Adaptive Moment Estimation) combines momentum and RMSProp. It maintains two running averages:

m = β₁ · m + (1 - β₁) · ∇L       (first moment: mean of gradients = momentum)
v = β₂ · v + (1 - β₂) · (∇L)²    (second moment: mean of squared gradients)

With bias correction (important in early steps when m and v are initialized to zero):

m̂ = m / (1 - β₁ᵗ)
v̂ = v / (1 - β₂ᵗ)

The update:

w = w - α · m̂ / (√v̂ + ε)

Why Adam Is Popular

Feature	Benefit
Momentum (m)	Accelerates through consistent gradient directions
Adaptive rates (v)	Different learning rate for each parameter
Bias correction	Accurate estimates even in early training
Minimal tuning	Works well with default settings (α=0.001, β₁=0.9, β₂=0.999)

Adam is the default choice for most deep learning tasks. It handles varying gradient magnitudes, sparse gradients, and noisy data well.

AdamW: Weight Decay Done Right

Standard Adam applies weight decay (L2 regularization) by adding it to the gradient, which interacts poorly with the adaptive learning rate. AdamW decouples the weight decay from the gradient-based update:

w = w - α · (m̂ / (√v̂ + ε) + λ · w)

where λ is the weight decay coefficient. AdamW has become the standard optimizer for training large language models and vision transformers.

Choosing an Optimizer

Scenario	Recommended Optimizer
Starting a new project	Adam or AdamW
Computer vision (CNNs)	SGD with momentum (often achieves better final accuracy)
Large language models	AdamW
Sparse data (NLP, recommendations)	Adam
When SGD works but is slow	SGD with momentum + learning rate schedule

Comparison Summary

Optimizer	Key Idea	Learning Rate	Momentum
SGD	Basic gradient step	Fixed, global	No
SGD + Momentum	Accumulate velocity	Fixed, global	Yes
AdaGrad	Adapt per parameter	Adaptive (decays)	No
RMSProp	Moving average adaptation	Adaptive (stable)	No
Adam	Momentum + adaptation	Adaptive (stable)	Yes
AdamW	Adam + decoupled weight decay	Adaptive (stable)	Yes

Summary

Momentum accelerates SGD by accumulating past gradients, dampening oscillations
AdaGrad adapts learning rates per parameter but can decay too aggressively
RMSProp fixes AdaGrad with exponential moving averages of squared gradients
Adam combines momentum and adaptive rates — the default for most deep learning
AdamW decouples weight decay from the adaptive update, preferred for transformers and LLMs
The choice of optimizer affects training speed and final model quality
Adam with default settings is a strong starting point for most problems

The next module covers loss functions — the mathematical objectives that gradient descent actually optimizes — and the broader optimization landscape.