Variants of Gradient Descent
Plain gradient descent has limitations: it can be slow in flat regions, oscillate in narrow valleys, and struggle with different parameter scales. Over the years, researchers have developed variants that address these issues. Understanding these optimizers helps you choose the right one and debug training problems.
Momentum
The key idea behind momentum is to accumulate past gradients, like a ball rolling downhill that builds speed.
In plain SGD, the update depends only on the current gradient:
w = w - α · ∇L
With momentum, we maintain a velocity v that accumulates past gradients:
v = β · v + ∇L
w = w - α · v
where β (typically 0.9) is the momentum coefficient.
Why Momentum Helps
Consider a narrow valley in the loss surface:
Without momentum: With momentum:
↕ ↕ ↕ ↕ ↕ ↕ ─→ ─────────────→
oscillates side-to-side rolls straight through
slow forward progress fast convergence
The gradient in the narrow direction alternates sign (pointing up, then down), so these components cancel out in the velocity. The gradient in the downhill direction is consistent, so it accumulates in the velocity. The result: momentum dampens oscillations and accelerates movement along consistent gradient directions.
The Ball-on-a-Hill Analogy
Imagine rolling a ball down a hilly surface:
- Without momentum: the ball stops as soon as it encounters an uphill slope
- With momentum: the ball has inertia and can roll over small bumps and through flat regions
This helps the optimizer escape shallow local minima and plateaus.
Nesterov Accelerated Gradient (NAG)
NAG is a refinement of momentum. Instead of computing the gradient at the current position, it computes the gradient at the position where momentum would take you:
v = β · v + ∇L(w - α · β · v) ← "look ahead"
w = w - α · v
The intuition: if momentum is going to overshoot, Nesterov looks ahead and sees the gradient at the future position, allowing it to correct course before arriving there.
AdaGrad: Adaptive Learning Rates
Different parameters may need different learning rates. A parameter that rarely updates (sparse gradient) might need a larger step, while one that updates frequently might need a smaller step.
AdaGrad adapts the learning rate for each parameter based on the sum of its past squared gradients:
G = G + (∇L)² (accumulate squared gradients, element-wise)
w = w - (α / √(G + ε)) · ∇L (divide learning rate by accumulated magnitude)
where ε ≈ 10⁻⁸ prevents division by zero.
Parameters with large accumulated gradients get smaller effective learning rates. Parameters with small accumulated gradients get larger effective learning rates.
The Limitation
AdaGrad's accumulated gradient G only grows — it never shrinks. Over time, the effective learning rate can become extremely small, causing training to stall. This is fine for convex problems but problematic for deep learning.
RMSProp: Fixing AdaGrad's Decay
RMSProp (Root Mean Square Propagation) fixes AdaGrad by using an exponential moving average of squared gradients instead of a sum:
G = γ · G + (1 - γ) · (∇L)² (moving average, γ ≈ 0.9)
w = w - (α / √(G + ε)) · ∇L
The moving average lets recent gradients matter more than old ones, preventing the learning rate from decaying to zero.
Adam: The Default Optimizer
Adam (Adaptive Moment Estimation) combines momentum and RMSProp. It maintains two running averages:
m = β₁ · m + (1 - β₁) · ∇L (first moment: mean of gradients = momentum)
v = β₂ · v + (1 - β₂) · (∇L)² (second moment: mean of squared gradients)
With bias correction (important in early steps when m and v are initialized to zero):
m̂ = m / (1 - β₁ᵗ)
v̂ = v / (1 - β₂ᵗ)
The update:
w = w - α · m̂ / (√v̂ + ε)
Why Adam Is Popular
| Feature | Benefit |
|---|---|
| Momentum (m) | Accelerates through consistent gradient directions |
| Adaptive rates (v) | Different learning rate for each parameter |
| Bias correction | Accurate estimates even in early training |
| Minimal tuning | Works well with default settings (α=0.001, β₁=0.9, β₂=0.999) |
Adam is the default choice for most deep learning tasks. It handles varying gradient magnitudes, sparse gradients, and noisy data well.
AdamW: Weight Decay Done Right
Standard Adam applies weight decay (L2 regularization) by adding it to the gradient, which interacts poorly with the adaptive learning rate. AdamW decouples the weight decay from the gradient-based update:
w = w - α · (m̂ / (√v̂ + ε) + λ · w)
where λ is the weight decay coefficient. AdamW has become the standard optimizer for training large language models and vision transformers.
Choosing an Optimizer
| Scenario | Recommended Optimizer |
|---|---|
| Starting a new project | Adam or AdamW |
| Computer vision (CNNs) | SGD with momentum (often achieves better final accuracy) |
| Large language models | AdamW |
| Sparse data (NLP, recommendations) | Adam |
| When SGD works but is slow | SGD with momentum + learning rate schedule |
Comparison Summary
| Optimizer | Key Idea | Learning Rate | Momentum |
|---|---|---|---|
| SGD | Basic gradient step | Fixed, global | No |
| SGD + Momentum | Accumulate velocity | Fixed, global | Yes |
| AdaGrad | Adapt per parameter | Adaptive (decays) | No |
| RMSProp | Moving average adaptation | Adaptive (stable) | No |
| Adam | Momentum + adaptation | Adaptive (stable) | Yes |
| AdamW | Adam + decoupled weight decay | Adaptive (stable) | Yes |
Summary
- Momentum accelerates SGD by accumulating past gradients, dampening oscillations
- AdaGrad adapts learning rates per parameter but can decay too aggressively
- RMSProp fixes AdaGrad with exponential moving averages of squared gradients
- Adam combines momentum and adaptive rates — the default for most deep learning
- AdamW decouples weight decay from the adaptive update, preferred for transformers and LLMs
- The choice of optimizer affects training speed and final model quality
- Adam with default settings is a strong starting point for most problems
The next module covers loss functions — the mathematical objectives that gradient descent actually optimizes — and the broader optimization landscape.

