Learning Rate and Convergence
The learning rate α is the single most important hyperparameter in gradient descent. Too large, and the model overshoots the minimum and diverges. Too small, and training takes forever. Understanding how the learning rate interacts with the gradient is essential for training any model effectively.
What the Learning Rate Controls
The update rule is:
w_new = w_old - α · ∂L/∂w
The learning rate α scales the gradient before subtracting it. It determines how far you step in the direction indicated by the gradient.
Step size = α × |gradient|
The gradient tells you which direction to go. The learning rate tells you how far to go in that direction.
Too Large: Overshooting
If the learning rate is too large, the parameter jumps past the minimum and lands on the other side, potentially at a higher loss. In the worst case, each step increases the loss, and the model diverges.
L(w)
^
| * *
| \←step 2 / ←step 1
| \ /
| \_____/
+────────────────> w
minimum
The parameter bounces back and forth around the minimum, never settling down. This is a common sign that the learning rate is too high.
Too Small: Slow Convergence
If the learning rate is too small, each step makes minimal progress. Training might require millions of steps to reach the minimum, or it might get stuck in a flat region where gradients are tiny.
L(w)
^
| * tiny steps: * * * * * * * * * * *
| \ ↘
| \ *
| \_________________________________/
+─────────────────────────────────────> w
Just Right: Convergence
A well-chosen learning rate makes steady progress toward the minimum:
L(w)
^
| *
| \
| *
| \
| *
| \_*__*
+────────────────> w
The loss decreases consistently and the parameters approach their optimal values.
Learning Rate in Practice
Typical learning rate ranges for common optimizers:
| Optimizer | Typical Learning Rate |
|---|---|
| SGD | 0.01 - 0.1 |
| SGD with momentum | 0.001 - 0.01 |
| Adam | 0.0001 - 0.001 |
| AdaGrad | 0.01 |
These are starting points. The optimal learning rate depends on the model architecture, dataset, and batch size.
Learning Rate Schedules
Instead of using a fixed learning rate, modern training often changes the learning rate over time:
Step Decay
Reduce the learning rate by a factor at specific epochs:
Epoch 1-30: α = 0.01
Epoch 31-60: α = 0.001
Epoch 61-90: α = 0.0001
Cosine Annealing
Gradually reduce the learning rate following a cosine curve:
α(t) = α_min + (α_max - α_min) / 2 · (1 + cos(πt / T))
α
^
| *
| *
| *
| *
| *
| *
| *
| *
+─────────────────────> epoch
Warm-up
Start with a very small learning rate and increase it linearly for the first few epochs:
α
^
| *──────*──────*──*──*
| / \
| / \
| /
| /
| *
+─────────────────────────────────> epoch
warm-up stable decay
Warm-up helps stabilize training in the early stages when gradients are large and unpredictable.
Convergence Criteria
How do you know when to stop training? Common approaches:
| Criterion | Description |
|---|---|
| Loss threshold | Stop when loss drops below a target value |
| Gradient norm | Stop when |∇L| is close to zero |
| Validation loss | Stop when loss on held-out data stops improving |
| Fixed epochs | Train for a predetermined number of epochs |
| Early stopping | Stop when validation loss increases for several consecutive epochs |
Early stopping is the most common in practice. It not only determines when to stop but also prevents overfitting — if the model starts memorizing training data, the validation loss increases, and training stops.
The Loss Curve
During training, you monitor the loss curve — loss plotted against training steps or epochs:
Loss
^
| *
| *
| **
| **
| ***
| ****
| *****
| ********
+─────────────────────────────> Epoch
A healthy loss curve:
- Decreases steeply at first (large gradients, quick progress)
- Gradually flattens (approaching a minimum, smaller gradients)
- Levels off (converged or near-converged)
Warning signs:
- Loss exploding (increasing): learning rate too high
- Loss stuck (flat from the start): learning rate too low, or model architecture problem
- Loss oscillating wildly: learning rate too high or batch size too small
The Relationship Between Learning Rate and Batch Size
Larger batch sizes produce more stable gradient estimates (less noise). This allows larger learning rates. A common heuristic:
If you double the batch size, multiply the learning rate by √2
This is why distributed training across many GPUs (which effectively increases batch size) often uses learning rate scaling.
Summary
- The learning rate α determines step size: too large causes divergence, too small causes slow convergence
- The gradient provides direction; the learning rate provides magnitude
- Learning rate schedules (step decay, cosine annealing, warm-up) improve training
- Early stopping monitors validation loss to prevent overfitting and determine when to stop
- The loss curve reveals whether the learning rate and training are working correctly
- Larger batch sizes can support larger learning rates due to more stable gradient estimates
- The learning rate is typically the first hyperparameter to tune when training any model
Next, we explore variants of gradient descent that go beyond the basic update rule — using momentum, adaptive learning rates, and other techniques that make optimization faster and more robust.

