Introduction to Gradient Descent

Gradient descent is the algorithm that makes machine learning work. It takes the gradients computed by calculus and uses them to iteratively improve model parameters. Every time you train a neural network, logistic regression model, or linear regression, gradient descent (or a variant of it) is running under the hood.

The Algorithm

Gradient descent is remarkably simple. Starting from random parameter values:

1. Compute the loss L for the current parameters
2. Compute the gradient ∇L (how the loss changes with each parameter)
3. Update each parameter: w_new = w_old - α · ∂L/∂w
4. Repeat until the loss is small enough

The learning rate α (alpha) controls the step size. That is the entire algorithm.

A One-Parameter Example

Suppose your model has one weight w, and the loss function is:

L(w) = (w - 3)²

This is a parabola with its minimum at w = 3 (where L = 0).

L(w)
 ^
 |  *                 *
 |   \               /
 |    \             /
 |     \           /
 |      \         /
 |       \       /
 |        \_____/
 +────────────────────> w
          3

The gradient is: dL/dw = 2(w - 3).

Starting at w = 0 with learning rate α = 0.1:

Step	w	L(w)	dL/dw	Update
0	0.00	9.00	-6.00	w = 0 - 0.1(-6) = 0.60
1	0.60	5.76	-4.80	w = 0.6 - 0.1(-4.8) = 1.08
2	1.08	3.69	-3.84	w = 1.08 - 0.1(-3.84) = 1.46
3	1.46	2.36	-3.07	w = 1.46 - 0.1(-3.07) = 1.77
...	...	...	...	...
20	2.97	0.001	-0.06	approaching w = 3

The weight moves toward 3, where the loss is minimized. Each step reduces the loss.

Two-Parameter Gradient Descent

With two parameters, gradient descent navigates a 2D surface. At each step, the gradient points uphill, and the algorithm steps in the opposite direction (downhill).

w₂
 ^     start
 |       *
 |       ↓  step 1
 |       *
 |        ↘ step 2
 |         *
 |          ↓ step 3
 |          *  ← approaching minimum
 +──────────────> w₁

On a contour plot, the path zigzags toward the center (minimum):

w₂
 ^
 |  ╭──────────╮
 | ╭┤  ╭────╮  ├╮
 | │ ╭─┤ *  ├─╮ │
 | │ │  ╰──↙╯ │ │   Gradient descent path
 | │ ╰─┤ *  ├─╯ │   spirals inward toward
 | ╰┤  ╰↙───╯  ├╯   the minimum
 |  ╰──↙──────╯
 +──────────────> w₁

Why It Works: The Descent Guarantee

The gradient ∇L points in the direction of steepest increase. Moving in the direction -∇L guarantees that, for a small enough step size, the loss decreases.

Mathematically, for a small step α:

L(w - α∇L) < L(w)    (as long as α is small enough and ∇L ≠ 0)

This is the descent property: every step reduces the loss (when the learning rate is appropriate). The algorithm converges to a point where ∇L = 0, which is a local minimum.

Batch Gradient Descent

In practice, the loss is computed over a dataset of N training examples:

L = (1/N) Σᵢ Lᵢ(w)

Batch gradient descent computes the gradient using all training examples before making one update:

∇L = (1/N) Σᵢ ∇Lᵢ(w)

This gives the most accurate gradient estimate but is slow for large datasets because you must process every example before taking a single step.

Stochastic Gradient Descent (SGD)

Stochastic gradient descent uses a single random training example to estimate the gradient:

∇L ≈ ∇Lᵢ(w)    for a randomly chosen i

This is noisy (each example gives a different gradient), but it is much faster per step. The noise can actually help by allowing the optimizer to escape shallow local minima.

Mini-Batch Gradient Descent

Mini-batch gradient descent is the standard compromise used in practice. It uses a small random subset (typically 32-256 examples) to estimate the gradient:

∇L ≈ (1/B) Σᵢ∈batch ∇Lᵢ(w)

where B is the batch size.

Method	Examples per Step	Speed	Gradient Quality
Batch	All N	Slow	Exact
Stochastic	1	Fast	Very noisy
Mini-batch	B (32-256)	Moderate	Reasonably accurate

Mini-batch SGD is what "training" means in practice. When people say "SGD," they usually mean mini-batch SGD.

The Training Loop in Code

The complete training loop in pseudocode:

for epoch in range(num_epochs):         # repeat over the dataset
    for batch in get_batches(data):      # iterate through mini-batches
        predictions = model(batch.inputs) # forward pass
        loss = loss_function(predictions, batch.targets)
        gradients = compute_gradients(loss)  # backward pass
        for param in model.parameters:
            param = param - learning_rate * gradients[param]  # update

Each pass through the entire dataset is called an epoch. A model might train for 10-100 epochs, with hundreds or thousands of mini-batch updates per epoch.

Summary

Gradient descent iteratively updates parameters by subtracting the gradient scaled by a learning rate
The update rule: w_new = w_old - α · ∇L
The gradient points uphill; subtracting it moves downhill toward lower loss
Batch gradient descent uses all data (accurate but slow)
Stochastic gradient descent uses one example (fast but noisy)
Mini-batch gradient descent (the standard) uses a small random subset
Training loops through the data in epochs, updating parameters with each mini-batch

The next lesson explores the critical role of the learning rate and what determines whether gradient descent converges successfully.

Introduction to Gradient Descent

The Algorithm

Gradient descent is remarkably simple. Starting from random parameter values:

1. Compute the loss L for the current parameters
2. Compute the gradient ∇L (how the loss changes with each parameter)
3. Update each parameter: w_new = w_old - α · ∂L/∂w
4. Repeat until the loss is small enough

The learning rate α (alpha) controls the step size. That is the entire algorithm.

A One-Parameter Example

Suppose your model has one weight w, and the loss function is:

L(w) = (w - 3)²

This is a parabola with its minimum at w = 3 (where L = 0).

L(w)
 ^
 |  *                 *
 |   \               /
 |    \             /
 |     \           /
 |      \         /
 |       \       /
 |        \_____/
 +────────────────────> w
          3

The gradient is: dL/dw = 2(w - 3).

Starting at w = 0 with learning rate α = 0.1:

Step	w	L(w)	dL/dw	Update
0	0.00	9.00	-6.00	w = 0 - 0.1(-6) = 0.60
1	0.60	5.76	-4.80	w = 0.6 - 0.1(-4.8) = 1.08
2	1.08	3.69	-3.84	w = 1.08 - 0.1(-3.84) = 1.46
3	1.46	2.36	-3.07	w = 1.46 - 0.1(-3.07) = 1.77
...	...	...	...	...
20	2.97	0.001	-0.06	approaching w = 3

The weight moves toward 3, where the loss is minimized. Each step reduces the loss.

Two-Parameter Gradient Descent

With two parameters, gradient descent navigates a 2D surface. At each step, the gradient points uphill, and the algorithm steps in the opposite direction (downhill).

w₂
 ^     start
 |       *
 |       ↓  step 1
 |       *
 |        ↘ step 2
 |         *
 |          ↓ step 3
 |          *  ← approaching minimum
 +──────────────> w₁

On a contour plot, the path zigzags toward the center (minimum):

w₂
 ^
 |  ╭──────────╮
 | ╭┤  ╭────╮  ├╮
 | │ ╭─┤ *  ├─╮ │
 | │ │  ╰──↙╯ │ │   Gradient descent path
 | │ ╰─┤ *  ├─╯ │   spirals inward toward
 | ╰┤  ╰↙───╯  ├╯   the minimum
 |  ╰──↙──────╯
 +──────────────> w₁

Why It Works: The Descent Guarantee

The gradient ∇L points in the direction of steepest increase. Moving in the direction -∇L guarantees that, for a small enough step size, the loss decreases.

Mathematically, for a small step α:

L(w - α∇L) < L(w)    (as long as α is small enough and ∇L ≠ 0)

This is the descent property: every step reduces the loss (when the learning rate is appropriate). The algorithm converges to a point where ∇L = 0, which is a local minimum.

Batch Gradient Descent

In practice, the loss is computed over a dataset of N training examples:

L = (1/N) Σᵢ Lᵢ(w)

Batch gradient descent computes the gradient using all training examples before making one update:

∇L = (1/N) Σᵢ ∇Lᵢ(w)

This gives the most accurate gradient estimate but is slow for large datasets because you must process every example before taking a single step.

Stochastic Gradient Descent (SGD)

Stochastic gradient descent uses a single random training example to estimate the gradient:

∇L ≈ ∇Lᵢ(w)    for a randomly chosen i

This is noisy (each example gives a different gradient), but it is much faster per step. The noise can actually help by allowing the optimizer to escape shallow local minima.

Mini-Batch Gradient Descent

Mini-batch gradient descent is the standard compromise used in practice. It uses a small random subset (typically 32-256 examples) to estimate the gradient:

∇L ≈ (1/B) Σᵢ∈batch ∇Lᵢ(w)

where B is the batch size.

Method	Examples per Step	Speed	Gradient Quality
Batch	All N	Slow	Exact
Stochastic	1	Fast	Very noisy
Mini-batch	B (32-256)	Moderate	Reasonably accurate

Mini-batch SGD is what "training" means in practice. When people say "SGD," they usually mean mini-batch SGD.

The Training Loop in Code

The complete training loop in pseudocode:

for epoch in range(num_epochs):         # repeat over the dataset
    for batch in get_batches(data):      # iterate through mini-batches
        predictions = model(batch.inputs) # forward pass
        loss = loss_function(predictions, batch.targets)
        gradients = compute_gradients(loss)  # backward pass
        for param in model.parameters:
            param = param - learning_rate * gradients[param]  # update

Each pass through the entire dataset is called an epoch. A model might train for 10-100 epochs, with hundreds or thousands of mini-batch updates per epoch.

Summary

Gradient descent iteratively updates parameters by subtracting the gradient scaled by a learning rate
The update rule: w_new = w_old - α · ∇L
The gradient points uphill; subtracting it moves downhill toward lower loss
Batch gradient descent uses all data (accurate but slow)
Stochastic gradient descent uses one example (fast but noisy)
Mini-batch gradient descent (the standard) uses a small random subset
Training loops through the data in epochs, updating parameters with each mini-batch

The next lesson explores the critical role of the learning rate and what determines whether gradient descent converges successfully.

Introduction to Gradient Descent

The Algorithm

A One-Parameter Example

Two-Parameter Gradient Descent

Why It Works: The Descent Guarantee

Batch Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch Gradient Descent

The Training Loop in Code

Summary

Questions & Answers

Introduction to Gradient Descent

The Algorithm

A One-Parameter Example

Two-Parameter Gradient Descent

Why It Works: The Descent Guarantee

Batch Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch Gradient Descent

The Training Loop in Code

Summary

Questions & Answers