Learning Rate and Convergence

The learning rate α is the single most important hyperparameter in gradient descent. Too large, and the model overshoots the minimum and diverges. Too small, and training takes forever. Understanding how the learning rate interacts with the gradient is essential for training any model effectively.

What the Learning Rate Controls

The update rule is:

w_new = w_old - α · ∂L/∂w

The learning rate α scales the gradient before subtracting it. It determines how far you step in the direction indicated by the gradient.

Step size = α × |gradient|

The gradient tells you which direction to go. The learning rate tells you how far to go in that direction.

Too Large: Overshooting

If the learning rate is too large, the parameter jumps past the minimum and lands on the other side, potentially at a higher loss. In the worst case, each step increases the loss, and the model diverges.

L(w)
 ^
 |  *           *
 |   \←step 2  / ←step 1
 |    \       /
 |     \_____/
 +────────────────> w
     minimum

The parameter bounces back and forth around the minimum, never settling down. This is a common sign that the learning rate is too high.

Too Small: Slow Convergence

If the learning rate is too small, each step makes minimal progress. Training might require millions of steps to reach the minimum, or it might get stuck in a flat region where gradients are tiny.

L(w)
 ^
 |  * tiny steps: * * * * * * * * * * *
 |   \                                 ↘
 |    \                                  *
 |     \_________________________________/
 +─────────────────────────────────────> w

Just Right: Convergence

A well-chosen learning rate makes steady progress toward the minimum:

L(w)
 ^
 |  *
 |   \
 |    *
 |     \
 |      *
 |       \_*__*
 +────────────────> w

The loss decreases consistently and the parameters approach their optimal values.

Learning Rate in Practice

Typical learning rate ranges for common optimizers:

Optimizer	Typical Learning Rate
SGD	0.01 - 0.1
SGD with momentum	0.001 - 0.01
Adam	0.0001 - 0.001
AdaGrad	0.01

These are starting points. The optimal learning rate depends on the model architecture, dataset, and batch size.

Learning Rate Schedules

Instead of using a fixed learning rate, modern training often changes the learning rate over time:

Step Decay

Reduce the learning rate by a factor at specific epochs:

Epoch  1-30:  α = 0.01
Epoch 31-60:  α = 0.001
Epoch 61-90:  α = 0.0001

Cosine Annealing

Gradually reduce the learning rate following a cosine curve:

α(t) = α_min + (α_max - α_min) / 2 · (1 + cos(πt / T))

α
 ^
 | *
 |  *
 |   *
 |     *
 |       *
 |          *
 |              *
 |                    *
 +─────────────────────> epoch

Warm-up

Start with a very small learning rate and increase it linearly for the first few epochs:

α
 ^
 |          *──────*──────*──*──*
 |        /                       \
 |      /                          \
 |    /
 |  /
 | *
 +─────────────────────────────────> epoch
   warm-up     stable      decay

Warm-up helps stabilize training in the early stages when gradients are large and unpredictable.

Convergence Criteria

How do you know when to stop training? Common approaches:

Criterion	Description
Loss threshold	Stop when loss drops below a target value
Gradient norm	Stop when \|∇L\| is close to zero
Validation loss	Stop when loss on held-out data stops improving
Fixed epochs	Train for a predetermined number of epochs
Early stopping	Stop when validation loss increases for several consecutive epochs

Early stopping is the most common in practice. It not only determines when to stop but also prevents overfitting — if the model starts memorizing training data, the validation loss increases, and training stops.

The Loss Curve

During training, you monitor the loss curve — loss plotted against training steps or epochs:

Loss
 ^
 |  *
 |   *
 |    **
 |      **
 |        ***
 |           ****
 |               *****
 |                    ********
 +─────────────────────────────> Epoch

A healthy loss curve:

Decreases steeply at first (large gradients, quick progress)
Gradually flattens (approaching a minimum, smaller gradients)
Levels off (converged or near-converged)

Warning signs:

Loss exploding (increasing): learning rate too high
Loss stuck (flat from the start): learning rate too low, or model architecture problem
Loss oscillating wildly: learning rate too high or batch size too small

The Relationship Between Learning Rate and Batch Size

Larger batch sizes produce more stable gradient estimates (less noise). This allows larger learning rates. A common heuristic:

If you double the batch size, multiply the learning rate by √2

This is why distributed training across many GPUs (which effectively increases batch size) often uses learning rate scaling.

Summary

The learning rate α determines step size: too large causes divergence, too small causes slow convergence
The gradient provides direction; the learning rate provides magnitude
Learning rate schedules (step decay, cosine annealing, warm-up) improve training
Early stopping monitors validation loss to prevent overfitting and determine when to stop
The loss curve reveals whether the learning rate and training are working correctly
Larger batch sizes can support larger learning rates due to more stable gradient estimates
The learning rate is typically the first hyperparameter to tune when training any model

Next, we explore variants of gradient descent that go beyond the basic update rule — using momentum, adaptive learning rates, and other techniques that make optimization faster and more robust.

Learning Rate and Convergence

What the Learning Rate Controls

The update rule is:

w_new = w_old - α · ∂L/∂w

The learning rate α scales the gradient before subtracting it. It determines how far you step in the direction indicated by the gradient.

Step size = α × |gradient|

The gradient tells you which direction to go. The learning rate tells you how far to go in that direction.

Too Large: Overshooting

L(w)
 ^
 |  *           *
 |   \←step 2  / ←step 1
 |    \       /
 |     \_____/
 +────────────────> w
     minimum

The parameter bounces back and forth around the minimum, never settling down. This is a common sign that the learning rate is too high.

Too Small: Slow Convergence

If the learning rate is too small, each step makes minimal progress. Training might require millions of steps to reach the minimum, or it might get stuck in a flat region where gradients are tiny.

L(w)
 ^
 |  * tiny steps: * * * * * * * * * * *
 |   \                                 ↘
 |    \                                  *
 |     \_________________________________/
 +─────────────────────────────────────> w

Just Right: Convergence

A well-chosen learning rate makes steady progress toward the minimum:

L(w)
 ^
 |  *
 |   \
 |    *
 |     \
 |      *
 |       \_*__*
 +────────────────> w

The loss decreases consistently and the parameters approach their optimal values.

Learning Rate in Practice

Typical learning rate ranges for common optimizers:

Optimizer	Typical Learning Rate
SGD	0.01 - 0.1
SGD with momentum	0.001 - 0.01
Adam	0.0001 - 0.001
AdaGrad	0.01

These are starting points. The optimal learning rate depends on the model architecture, dataset, and batch size.

Learning Rate Schedules

Instead of using a fixed learning rate, modern training often changes the learning rate over time:

Step Decay

Reduce the learning rate by a factor at specific epochs:

Epoch  1-30:  α = 0.01
Epoch 31-60:  α = 0.001
Epoch 61-90:  α = 0.0001

Cosine Annealing

Gradually reduce the learning rate following a cosine curve:

α(t) = α_min + (α_max - α_min) / 2 · (1 + cos(πt / T))

α
 ^
 | *
 |  *
 |   *
 |     *
 |       *
 |          *
 |              *
 |                    *
 +─────────────────────> epoch

Warm-up

Start with a very small learning rate and increase it linearly for the first few epochs:

α
 ^
 |          *──────*──────*──*──*
 |        /                       \
 |      /                          \
 |    /
 |  /
 | *
 +─────────────────────────────────> epoch
   warm-up     stable      decay

Warm-up helps stabilize training in the early stages when gradients are large and unpredictable.

Convergence Criteria

How do you know when to stop training? Common approaches:

Criterion	Description
Loss threshold	Stop when loss drops below a target value
Gradient norm	Stop when \|∇L\| is close to zero
Validation loss	Stop when loss on held-out data stops improving
Fixed epochs	Train for a predetermined number of epochs
Early stopping	Stop when validation loss increases for several consecutive epochs

The Loss Curve

During training, you monitor the loss curve — loss plotted against training steps or epochs:

Loss
 ^
 |  *
 |   *
 |    **
 |      **
 |        ***
 |           ****
 |               *****
 |                    ********
 +─────────────────────────────> Epoch

A healthy loss curve:

Decreases steeply at first (large gradients, quick progress)
Gradually flattens (approaching a minimum, smaller gradients)
Levels off (converged or near-converged)

Warning signs:

Loss exploding (increasing): learning rate too high
Loss stuck (flat from the start): learning rate too low, or model architecture problem
Loss oscillating wildly: learning rate too high or batch size too small

The Relationship Between Learning Rate and Batch Size

Larger batch sizes produce more stable gradient estimates (less noise). This allows larger learning rates. A common heuristic:

If you double the batch size, multiply the learning rate by √2

This is why distributed training across many GPUs (which effectively increases batch size) often uses learning rate scaling.

Summary

The learning rate α determines step size: too large causes divergence, too small causes slow convergence
The gradient provides direction; the learning rate provides magnitude
Learning rate schedules (step decay, cosine annealing, warm-up) improve training
Early stopping monitors validation loss to prevent overfitting and determine when to stop
The loss curve reveals whether the learning rate and training are working correctly
Larger batch sizes can support larger learning rates due to more stable gradient estimates
The learning rate is typically the first hyperparameter to tune when training any model

Next, we explore variants of gradient descent that go beyond the basic update rule — using momentum, adaptive learning rates, and other techniques that make optimization faster and more robust.

Learning Rate and Convergence

What the Learning Rate Controls

Too Large: Overshooting

Too Small: Slow Convergence

Just Right: Convergence

Learning Rate in Practice

Learning Rate Schedules

Step Decay

Cosine Annealing

Warm-up

Convergence Criteria

The Loss Curve

The Relationship Between Learning Rate and Batch Size

Summary

Questions & Answers

Learning Rate and Convergence

What the Learning Rate Controls

Too Large: Overshooting

Too Small: Slow Convergence

Just Right: Convergence

Learning Rate in Practice

Learning Rate Schedules

Step Decay

Cosine Annealing

Warm-up

Convergence Criteria

The Loss Curve

The Relationship Between Learning Rate and Batch Size

Summary

Questions & Answers