The Gradient Vector
Each partial derivative tells you how the loss changes in one parameter's direction. But models have many parameters, and we need all this information packaged together. The gradient is a vector that collects all partial derivatives into a single object, pointing in the direction of steepest increase.
Defining the Gradient
For a function f(x₁, x₂, ..., xₙ), the gradient is:
∇f = [ ∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ ]
The symbol ∇ is called "nabla" or "del." The gradient is a vector with the same number of components as the function has inputs.
Example
For f(x, y) = x² + 3xy + y²:
∂f/∂x = 2x + 3y
∂f/∂y = 3x + 2y
∇f = [2x + 3y, 3x + 2y]
At the point (1, 2):
∇f(1, 2) = [2(1) + 3(2), 3(1) + 2(2)] = [8, 7]
The Direction of Steepest Ascent
The gradient vector has a remarkable property: it points in the direction of steepest increase of the function.
y
^ ∇f = [8, 7]
| ↗
| /
| /
| * (1, 2)
+──────────> x
If you start at point (1, 2) and move in the direction [8, 7], the function f increases as fast as possible. This means:
- ∇f points toward steepest ascent (increasing loss)
- -∇f points toward steepest descent (decreasing loss)
This is why gradient descent subtracts the gradient: it moves in the direction that reduces the loss most rapidly.
Gradient in ML Training
For a loss function L with parameters w = [w₁, w₂, ..., wₙ]:
∇L = [ ∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ ]
The gradient descent update rule is:
w_new = w_old - α · ∇L
where α (alpha) is the learning rate. Each parameter gets updated by its own partial derivative:
w₁_new = w₁_old - α · ∂L/∂w₁
w₂_new = w₂_old - α · ∂L/∂w₂
...
wₙ_new = wₙ_old - α · ∂L/∂wₙ
Magnitude of the Gradient
The length (magnitude) of the gradient vector tells you how steep the surface is at that point:
|∇f| = √( (∂f/∂x₁)² + (∂f/∂x₂)² + ... + (∂f/∂xₙ)² )
- Large gradient: the surface is steep — the model is far from a minimum and should take large steps
- Small gradient: the surface is nearly flat — the model may be close to a minimum
- Zero gradient: the surface is flat — the model is at a critical point (minimum, maximum, or saddle point)
Visualizing the Gradient on a Contour Plot
On a contour plot, the gradient at each point is perpendicular to the contour line and points toward increasing values:
w₂
^
| ╭──────╮
| ╭─┤ ├─╮
| ╭┤ ╭────╮ ├╮
| │ ╭─┤ * ├─╮ │ * = minimum
| ╰┤←←╰────╯ ├╯ ← = -∇L (gradient descent direction)
| ╰─┤ ├─╯
| ╰──────╯
+──────────────> w₁
The arrows point inward (toward the minimum) because gradient descent moves in the negative gradient direction.
Gradient of a Neural Network
In a neural network, the gradient has one component for every trainable parameter:
| Component | Meaning |
|---|---|
| ∂L/∂W₁ | How the loss changes with respect to the weights in layer 1 |
| ∂L/∂b₁ | How the loss changes with respect to the biases in layer 1 |
| ∂L/∂W₂ | How the loss changes with respect to the weights in layer 2 |
| ∂L/∂b₂ | How the loss changes with respect to the biases in layer 2 |
| ... | One entry for every parameter in the network |
A network with 1 million parameters produces a gradient vector with 1 million components. Computing this gradient efficiently is exactly what backpropagation (Module 6) accomplishes.
Directional Derivatives
The partial derivative ∂f/∂x measures the rate of change along the x-axis. But what about an arbitrary direction? The directional derivative measures the rate of change in any direction v:
D_v f = ∇f · v̂
where v̂ is the unit vector in direction v. This is a dot product between the gradient and the direction. The directional derivative is maximized when v̂ points in the same direction as ∇f — confirming that the gradient points in the direction of steepest ascent.
Properties of the Gradient
| Property | Meaning |
|---|---|
| ∇f points toward steepest ascent | -∇f is the best direction to minimize f |
| ∇f | |
| ∇f = 0 at critical points | Minima, maxima, and saddle points all have zero gradient |
| ∇f is perpendicular to level curves | The gradient cuts across contour lines at right angles |
Summary
- The gradient ∇f is a vector of all partial derivatives
- It points in the direction of steepest increase of the function
- Gradient descent moves in the opposite direction: -∇f
- The magnitude |∇f| indicates how steep the surface is
- Each component of the gradient tells one parameter how to change
- The gradient is perpendicular to contour lines on the loss surface
- Computing the gradient efficiently for neural networks is the job of backpropagation
The next module covers the chain rule — the mathematical machinery that makes it possible to compute gradients through layers of composition, which is essential for neural networks.

