Introduction to Gradient Descent
Gradient descent is the algorithm that makes machine learning work. It takes the gradients computed by calculus and uses them to iteratively improve model parameters. Every time you train a neural network, logistic regression model, or linear regression, gradient descent (or a variant of it) is running under the hood.
The Algorithm
Gradient descent is remarkably simple. Starting from random parameter values:
1. Compute the loss L for the current parameters
2. Compute the gradient ∇L (how the loss changes with each parameter)
3. Update each parameter: w_new = w_old - α · ∂L/∂w
4. Repeat until the loss is small enough
The learning rate α (alpha) controls the step size. That is the entire algorithm.
A One-Parameter Example
Suppose your model has one weight w, and the loss function is:
L(w) = (w - 3)²
This is a parabola with its minimum at w = 3 (where L = 0).
L(w)
^
| * *
| \ /
| \ /
| \ /
| \ /
| \ /
| \_____/
+────────────────────> w
3
The gradient is: dL/dw = 2(w - 3).
Starting at w = 0 with learning rate α = 0.1:
| Step | w | L(w) | dL/dw | Update |
|---|---|---|---|---|
| 0 | 0.00 | 9.00 | -6.00 | w = 0 - 0.1(-6) = 0.60 |
| 1 | 0.60 | 5.76 | -4.80 | w = 0.6 - 0.1(-4.8) = 1.08 |
| 2 | 1.08 | 3.69 | -3.84 | w = 1.08 - 0.1(-3.84) = 1.46 |
| 3 | 1.46 | 2.36 | -3.07 | w = 1.46 - 0.1(-3.07) = 1.77 |
| ... | ... | ... | ... | ... |
| 20 | 2.97 | 0.001 | -0.06 | approaching w = 3 |
The weight moves toward 3, where the loss is minimized. Each step reduces the loss.
Two-Parameter Gradient Descent
With two parameters, gradient descent navigates a 2D surface. At each step, the gradient points uphill, and the algorithm steps in the opposite direction (downhill).
w₂
^ start
| *
| ↓ step 1
| *
| ↘ step 2
| *
| ↓ step 3
| * ← approaching minimum
+──────────────> w₁
On a contour plot, the path zigzags toward the center (minimum):
w₂
^
| ╭──────────╮
| ╭┤ ╭────╮ ├╮
| │ ╭─┤ * ├─╮ │
| │ │ ╰──↙╯ │ │ Gradient descent path
| │ ╰─┤ * ├─╯ │ spirals inward toward
| ╰┤ ╰↙───╯ ├╯ the minimum
| ╰──↙──────╯
+──────────────> w₁
Why It Works: The Descent Guarantee
The gradient ∇L points in the direction of steepest increase. Moving in the direction -∇L guarantees that, for a small enough step size, the loss decreases.
Mathematically, for a small step α:
L(w - α∇L) < L(w) (as long as α is small enough and ∇L ≠ 0)
This is the descent property: every step reduces the loss (when the learning rate is appropriate). The algorithm converges to a point where ∇L = 0, which is a local minimum.
Batch Gradient Descent
In practice, the loss is computed over a dataset of N training examples:
L = (1/N) Σᵢ Lᵢ(w)
Batch gradient descent computes the gradient using all training examples before making one update:
∇L = (1/N) Σᵢ ∇Lᵢ(w)
This gives the most accurate gradient estimate but is slow for large datasets because you must process every example before taking a single step.
Stochastic Gradient Descent (SGD)
Stochastic gradient descent uses a single random training example to estimate the gradient:
∇L ≈ ∇Lᵢ(w) for a randomly chosen i
This is noisy (each example gives a different gradient), but it is much faster per step. The noise can actually help by allowing the optimizer to escape shallow local minima.
Mini-Batch Gradient Descent
Mini-batch gradient descent is the standard compromise used in practice. It uses a small random subset (typically 32-256 examples) to estimate the gradient:
∇L ≈ (1/B) Σᵢ∈batch ∇Lᵢ(w)
where B is the batch size.
| Method | Examples per Step | Speed | Gradient Quality |
|---|---|---|---|
| Batch | All N | Slow | Exact |
| Stochastic | 1 | Fast | Very noisy |
| Mini-batch | B (32-256) | Moderate | Reasonably accurate |
Mini-batch SGD is what "training" means in practice. When people say "SGD," they usually mean mini-batch SGD.
The Training Loop in Code
The complete training loop in pseudocode:
for epoch in range(num_epochs): # repeat over the dataset
for batch in get_batches(data): # iterate through mini-batches
predictions = model(batch.inputs) # forward pass
loss = loss_function(predictions, batch.targets)
gradients = compute_gradients(loss) # backward pass
for param in model.parameters:
param = param - learning_rate * gradients[param] # update
Each pass through the entire dataset is called an epoch. A model might train for 10-100 epochs, with hundreds or thousands of mini-batch updates per epoch.
Summary
- Gradient descent iteratively updates parameters by subtracting the gradient scaled by a learning rate
- The update rule: w_new = w_old - α · ∇L
- The gradient points uphill; subtracting it moves downhill toward lower loss
- Batch gradient descent uses all data (accurate but slow)
- Stochastic gradient descent uses one example (fast but noisy)
- Mini-batch gradient descent (the standard) uses a small random subset
- Training loops through the data in epochs, updating parameters with each mini-batch
The next lesson explores the critical role of the learning rate and what determines whether gradient descent converges successfully.

