The Gradient Vector

Each partial derivative tells you how the loss changes in one parameter's direction. But models have many parameters, and we need all this information packaged together. The gradient is a vector that collects all partial derivatives into a single object, pointing in the direction of steepest increase.

Defining the Gradient

For a function f(x₁, x₂, ..., xₙ), the gradient is:

∇f = [ ∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ ]

The symbol ∇ is called "nabla" or "del." The gradient is a vector with the same number of components as the function has inputs.

Example

For f(x, y) = x² + 3xy + y²:

∂f/∂x = 2x + 3y
∂f/∂y = 3x + 2y

∇f = [2x + 3y, 3x + 2y]

At the point (1, 2):

∇f(1, 2) = [2(1) + 3(2), 3(1) + 2(2)] = [8, 7]

The Direction of Steepest Ascent

The gradient vector has a remarkable property: it points in the direction of steepest increase of the function.

   y
   ^     ∇f = [8, 7]
   |      ↗
   |    /
   |  /
   | * (1, 2)
   +──────────> x

If you start at point (1, 2) and move in the direction [8, 7], the function f increases as fast as possible. This means:

∇f points toward steepest ascent (increasing loss)
-∇f points toward steepest descent (decreasing loss)

This is why gradient descent subtracts the gradient: it moves in the direction that reduces the loss most rapidly.

Gradient in ML Training

For a loss function L with parameters w = [w₁, w₂, ..., wₙ]:

∇L = [ ∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ ]

The gradient descent update rule is:

w_new = w_old - α · ∇L

where α (alpha) is the learning rate. Each parameter gets updated by its own partial derivative:

w₁_new = w₁_old - α · ∂L/∂w₁
w₂_new = w₂_old - α · ∂L/∂w₂
...
wₙ_new = wₙ_old - α · ∂L/∂wₙ

Magnitude of the Gradient

The length (magnitude) of the gradient vector tells you how steep the surface is at that point:

|∇f| = √( (∂f/∂x₁)² + (∂f/∂x₂)² + ... + (∂f/∂xₙ)² )

Large gradient: the surface is steep — the model is far from a minimum and should take large steps
Small gradient: the surface is nearly flat — the model may be close to a minimum
Zero gradient: the surface is flat — the model is at a critical point (minimum, maximum, or saddle point)

Visualizing the Gradient on a Contour Plot

On a contour plot, the gradient at each point is perpendicular to the contour line and points toward increasing values:

w₂
 ^
 |    ╭──────╮
 |  ╭─┤      ├─╮
 | ╭┤  ╭────╮  ├╮
 | │ ╭─┤  * ├─╮ │   * = minimum
 | ╰┤←←╰────╯  ├╯   ← = -∇L (gradient descent direction)
 |  ╰─┤      ├─╯
 |    ╰──────╯
 +──────────────> w₁

The arrows point inward (toward the minimum) because gradient descent moves in the negative gradient direction.

Gradient of a Neural Network

In a neural network, the gradient has one component for every trainable parameter:

Component	Meaning
∂L/∂W₁	How the loss changes with respect to the weights in layer 1
∂L/∂b₁	How the loss changes with respect to the biases in layer 1
∂L/∂W₂	How the loss changes with respect to the weights in layer 2
∂L/∂b₂	How the loss changes with respect to the biases in layer 2
...	One entry for every parameter in the network

A network with 1 million parameters produces a gradient vector with 1 million components. Computing this gradient efficiently is exactly what backpropagation (Module 6) accomplishes.

Directional Derivatives

The partial derivative ∂f/∂x measures the rate of change along the x-axis. But what about an arbitrary direction? The directional derivative measures the rate of change in any direction v:

D_v f = ∇f · v̂

where v̂ is the unit vector in direction v. This is a dot product between the gradient and the direction. The directional derivative is maximized when v̂ points in the same direction as ∇f — confirming that the gradient points in the direction of steepest ascent.

Properties of the Gradient

Property	Meaning
∇f points toward steepest ascent	-∇f is the best direction to minimize f
	∇f
∇f = 0 at critical points	Minima, maxima, and saddle points all have zero gradient
∇f is perpendicular to level curves	The gradient cuts across contour lines at right angles

Summary

The gradient ∇f is a vector of all partial derivatives
It points in the direction of steepest increase of the function
Gradient descent moves in the opposite direction: -∇f
The magnitude |∇f| indicates how steep the surface is
Each component of the gradient tells one parameter how to change
The gradient is perpendicular to contour lines on the loss surface
Computing the gradient efficiently for neural networks is the job of backpropagation

The next module covers the chain rule — the mathematical machinery that makes it possible to compute gradients through layers of composition, which is essential for neural networks.

The Gradient Vector

Defining the Gradient

For a function f(x₁, x₂, ..., xₙ), the gradient is:

∇f = [ ∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ ]

The symbol ∇ is called "nabla" or "del." The gradient is a vector with the same number of components as the function has inputs.

Example

For f(x, y) = x² + 3xy + y²:

∂f/∂x = 2x + 3y
∂f/∂y = 3x + 2y

∇f = [2x + 3y, 3x + 2y]

At the point (1, 2):

∇f(1, 2) = [2(1) + 3(2), 3(1) + 2(2)] = [8, 7]

The Direction of Steepest Ascent

The gradient vector has a remarkable property: it points in the direction of steepest increase of the function.

   y
   ^     ∇f = [8, 7]
   |      ↗
   |    /
   |  /
   | * (1, 2)
   +──────────> x

If you start at point (1, 2) and move in the direction [8, 7], the function f increases as fast as possible. This means:

∇f points toward steepest ascent (increasing loss)
-∇f points toward steepest descent (decreasing loss)

This is why gradient descent subtracts the gradient: it moves in the direction that reduces the loss most rapidly.

Gradient in ML Training

For a loss function L with parameters w = [w₁, w₂, ..., wₙ]:

∇L = [ ∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ ]

The gradient descent update rule is:

w_new = w_old - α · ∇L

where α (alpha) is the learning rate. Each parameter gets updated by its own partial derivative:

w₁_new = w₁_old - α · ∂L/∂w₁
w₂_new = w₂_old - α · ∂L/∂w₂
...
wₙ_new = wₙ_old - α · ∂L/∂wₙ

Magnitude of the Gradient

The length (magnitude) of the gradient vector tells you how steep the surface is at that point:

|∇f| = √( (∂f/∂x₁)² + (∂f/∂x₂)² + ... + (∂f/∂xₙ)² )

Large gradient: the surface is steep — the model is far from a minimum and should take large steps
Small gradient: the surface is nearly flat — the model may be close to a minimum
Zero gradient: the surface is flat — the model is at a critical point (minimum, maximum, or saddle point)

Visualizing the Gradient on a Contour Plot

On a contour plot, the gradient at each point is perpendicular to the contour line and points toward increasing values:

w₂
 ^
 |    ╭──────╮
 |  ╭─┤      ├─╮
 | ╭┤  ╭────╮  ├╮
 | │ ╭─┤  * ├─╮ │   * = minimum
 | ╰┤←←╰────╯  ├╯   ← = -∇L (gradient descent direction)
 |  ╰─┤      ├─╯
 |    ╰──────╯
 +──────────────> w₁

The arrows point inward (toward the minimum) because gradient descent moves in the negative gradient direction.

Gradient of a Neural Network

In a neural network, the gradient has one component for every trainable parameter:

Component	Meaning
∂L/∂W₁	How the loss changes with respect to the weights in layer 1
∂L/∂b₁	How the loss changes with respect to the biases in layer 1
∂L/∂W₂	How the loss changes with respect to the weights in layer 2
∂L/∂b₂	How the loss changes with respect to the biases in layer 2
...	One entry for every parameter in the network

A network with 1 million parameters produces a gradient vector with 1 million components. Computing this gradient efficiently is exactly what backpropagation (Module 6) accomplishes.

Directional Derivatives

The partial derivative ∂f/∂x measures the rate of change along the x-axis. But what about an arbitrary direction? The directional derivative measures the rate of change in any direction v:

D_v f = ∇f · v̂

Properties of the Gradient

Property	Meaning
∇f points toward steepest ascent	-∇f is the best direction to minimize f
	∇f
∇f = 0 at critical points	Minima, maxima, and saddle points all have zero gradient
∇f is perpendicular to level curves	The gradient cuts across contour lines at right angles

Summary

The gradient ∇f is a vector of all partial derivatives
It points in the direction of steepest increase of the function
Gradient descent moves in the opposite direction: -∇f
The magnitude |∇f| indicates how steep the surface is
Each component of the gradient tells one parameter how to change
The gradient is perpendicular to contour lines on the loss surface
Computing the gradient efficiently for neural networks is the job of backpropagation

The next module covers the chain rule — the mathematical machinery that makes it possible to compute gradients through layers of composition, which is essential for neural networks.

The Gradient Vector

Defining the Gradient

Example

The Direction of Steepest Ascent

Gradient in ML Training

Magnitude of the Gradient

Visualizing the Gradient on a Contour Plot

Gradient of a Neural Network

Directional Derivatives

Properties of the Gradient

Summary

Questions & Answers

The Gradient Vector

Defining the Gradient

Example

The Direction of Steepest Ascent

Gradient in ML Training

Magnitude of the Gradient

Visualizing the Gradient on a Contour Plot

Gradient of a Neural Network

Directional Derivatives

Properties of the Gradient

Summary

Questions & Answers