Regularization and Overfitting

A model that fits the training data perfectly but performs poorly on new data has overfitted. Regularization modifies the loss function to discourage overly complex models, and the calculus of regularization reveals exactly how it shapes the optimization process.

What Is Overfitting?

Overfitting occurs when a model learns the noise in the training data rather than the underlying pattern.

Underfitting:           Good fit:            Overfitting:
    *   *                  * *                 * *
  * /     *              */ \  *             */\  /\*
   /   *                /    \*             /  \/  \
  /  *                 /    * \            /  *  *  \*
 /      *             /  *    \           /* *    *
straight line    smooth curve    wiggly curve that
misses pattern   captures trend  hits every point

The overfitted model memorizes training data but fails to generalize. In calculus terms: the model found a minimum of the training loss that does not correspond to low test loss.

L2 Regularization (Weight Decay)

L2 regularization adds a penalty proportional to the square of each weight:

L_total = L_data + λ Σᵢ wᵢ²

where λ (lambda) is the regularization strength and the sum is over all weights.

The Gradient of L2 Regularization

∂L_total/∂wᵢ = ∂L_data/∂wᵢ + 2λwᵢ

The extra term 2λwᵢ pushes each weight toward zero. Larger weights get pushed harder. This is why it is called weight decay — at each step, weights shrink slightly:

wᵢ_new = wᵢ_old - α(∂L_data/∂wᵢ + 2λwᵢ)
       = wᵢ_old(1 - 2αλ) - α · ∂L_data/∂wᵢ

The factor (1 - 2αλ) multiplies the old weight, shrinking it by a small fraction each step. Only the data gradient term can grow weights; the regularization term always shrinks them.

Why Smaller Weights Help

Large weights amplify small input differences, making the model sensitive to noise. Small weights produce smoother, more generalizable functions.

Large weights (no regularization):     Small weights (with regularization):
     *  ← sharp feature                    ~~ smooth curve ~~
    / \                                    /                  \
   /   \   *                              /    *               \
  /     \ / \                            /   *   *              \
 /       *   \                          /  *       *             \

L1 Regularization (Lasso)

L1 regularization penalizes the absolute value of weights:

L_total = L_data + λ Σᵢ |wᵢ|

The Gradient of L1 Regularization

∂L_total/∂wᵢ = ∂L_data/∂wᵢ + λ · sign(wᵢ)

where sign(wᵢ) is +1 if wᵢ > 0 and -1 if wᵢ < 0.

Unlike L2, the penalty is constant regardless of weight magnitude. This means L1 can push small weights all the way to exactly zero, effectively removing features from the model. This produces sparse models.

L1 vs. L2 Comparison

Property	L1	L2
Penalty term	λΣ\|wᵢ\|	λΣwᵢ²
Gradient contribution	λ · sign(w)	2λw
Effect on weights	Pushes to exact zero (sparse)	Shrinks toward zero (small but non-zero)
Feature selection	Yes (removes irrelevant features)	No (keeps all features)
Use case	When many features are irrelevant	General regularization

Dropout (Regularization Without Changing the Loss)

Dropout randomly sets neuron activations to zero during training:

Training:     h = [0.5, 0, 0.3, 0, 0.7]   (dropped neurons zeroed out)
Inference:    h = [0.5, 0.2, 0.3, 0.4, 0.7] × (1 - dropout_rate)

From a calculus perspective, dropout changes which paths in the computational graph are active. The gradient flows only through non-dropped neurons, and the model is forced to learn redundant representations — no single neuron can be solely responsible for a feature.

Batch Normalization

Batch normalization normalizes layer activations to have zero mean and unit variance:

x̂ = (x - μ_batch) / √(σ²_batch + ε)
y = γx̂ + β

where γ and β are learnable parameters.

Batch norm affects the loss landscape by:

Smoothing the surface (reducing sharp curvature)
Reducing sensitivity to weight initialization
Allowing higher learning rates

The derivatives through batch norm are more complex because the mean and variance depend on the entire batch. Frameworks handle this automatically, but the effect is to add an implicit regularization by coupling each example's gradient to the entire batch.

The Bias-Variance Tradeoff

Regularization controls the tradeoff between two types of error:

Error Type	Description	Caused By
Bias	Model is too simple to capture the pattern	Underfitting (too much regularization)
Variance	Model is too sensitive to training data noise	Overfitting (too little regularization)

Error
 ^
 |  \             /
 |   \ variance  / bias
 |    \         /
 |     \       /
 |      \_____/
 |    total error
 |       * optimal
 +──────────────────> Regularization strength λ
  none              strong

The optimal λ minimizes total error. Too little regularization and variance dominates (overfitting). Too much regularization and bias dominates (underfitting).

Regularization in Practice

Common regularization strategies for different architectures:

Architecture	Typical Regularization
Linear models	L1 or L2
Small neural networks	L2 + dropout
Large CNNs	Dropout + data augmentation
Transformers	AdamW (weight decay) + dropout
Very large models (LLMs)	Weight decay + gradient clipping

Summary

Overfitting occurs when a model memorizes training data instead of learning general patterns
L2 regularization adds λΣwᵢ² to the loss, pushing weights toward zero and producing smoother models
L1 regularization adds λΣ|wᵢ| to the loss, driving some weights to exactly zero for feature selection
The gradient of L2 is proportional to the weight (2λw); the gradient of L1 is constant (λ · sign(w))
Dropout randomly zeros activations during training, forcing redundant representations
Batch normalization smooths the loss landscape and adds implicit regularization
The bias-variance tradeoff governs the choice of regularization strength
Regularization shapes the loss function to favor solutions that generalize, not just solutions that minimize training error

The next module puts everything together: how neural networks use the chain rule, gradients, loss functions, and optimization to learn through backpropagation.

Regularization and Overfitting

What Is Overfitting?

Overfitting occurs when a model learns the noise in the training data rather than the underlying pattern.

Underfitting:           Good fit:            Overfitting:
    *   *                  * *                 * *
  * /     *              */ \  *             */\  /\*
   /   *                /    \*             /  \/  \
  /  *                 /    * \            /  *  *  \*
 /      *             /  *    \           /* *    *
straight line    smooth curve    wiggly curve that
misses pattern   captures trend  hits every point

The overfitted model memorizes training data but fails to generalize. In calculus terms: the model found a minimum of the training loss that does not correspond to low test loss.

L2 Regularization (Weight Decay)

L2 regularization adds a penalty proportional to the square of each weight:

L_total = L_data + λ Σᵢ wᵢ²

where λ (lambda) is the regularization strength and the sum is over all weights.

The Gradient of L2 Regularization

∂L_total/∂wᵢ = ∂L_data/∂wᵢ + 2λwᵢ

The extra term 2λwᵢ pushes each weight toward zero. Larger weights get pushed harder. This is why it is called weight decay — at each step, weights shrink slightly:

wᵢ_new = wᵢ_old - α(∂L_data/∂wᵢ + 2λwᵢ)
       = wᵢ_old(1 - 2αλ) - α · ∂L_data/∂wᵢ

The factor (1 - 2αλ) multiplies the old weight, shrinking it by a small fraction each step. Only the data gradient term can grow weights; the regularization term always shrinks them.

Why Smaller Weights Help

Large weights amplify small input differences, making the model sensitive to noise. Small weights produce smoother, more generalizable functions.

Large weights (no regularization):     Small weights (with regularization):
     *  ← sharp feature                    ~~ smooth curve ~~
    / \                                    /                  \
   /   \   *                              /    *               \
  /     \ / \                            /   *   *              \
 /       *   \                          /  *       *             \

L1 Regularization (Lasso)

L1 regularization penalizes the absolute value of weights:

L_total = L_data + λ Σᵢ |wᵢ|

The Gradient of L1 Regularization

∂L_total/∂wᵢ = ∂L_data/∂wᵢ + λ · sign(wᵢ)

where sign(wᵢ) is +1 if wᵢ > 0 and -1 if wᵢ < 0.

L1 vs. L2 Comparison

Property	L1	L2
Penalty term	λΣ\|wᵢ\|	λΣwᵢ²
Gradient contribution	λ · sign(w)	2λw
Effect on weights	Pushes to exact zero (sparse)	Shrinks toward zero (small but non-zero)
Feature selection	Yes (removes irrelevant features)	No (keeps all features)
Use case	When many features are irrelevant	General regularization

Dropout (Regularization Without Changing the Loss)

Dropout randomly sets neuron activations to zero during training:

Training:     h = [0.5, 0, 0.3, 0, 0.7]   (dropped neurons zeroed out)
Inference:    h = [0.5, 0.2, 0.3, 0.4, 0.7] × (1 - dropout_rate)

Batch Normalization

Batch normalization normalizes layer activations to have zero mean and unit variance:

x̂ = (x - μ_batch) / √(σ²_batch + ε)
y = γx̂ + β

where γ and β are learnable parameters.

Batch norm affects the loss landscape by:

Smoothing the surface (reducing sharp curvature)
Reducing sensitivity to weight initialization
Allowing higher learning rates

The Bias-Variance Tradeoff

Regularization controls the tradeoff between two types of error:

Error Type	Description	Caused By
Bias	Model is too simple to capture the pattern	Underfitting (too much regularization)
Variance	Model is too sensitive to training data noise	Overfitting (too little regularization)

Error
 ^
 |  \             /
 |   \ variance  / bias
 |    \         /
 |     \       /
 |      \_____/
 |    total error
 |       * optimal
 +──────────────────> Regularization strength λ
  none              strong

The optimal λ minimizes total error. Too little regularization and variance dominates (overfitting). Too much regularization and bias dominates (underfitting).

Regularization in Practice

Common regularization strategies for different architectures:

Architecture	Typical Regularization
Linear models	L1 or L2
Small neural networks	L2 + dropout
Large CNNs	Dropout + data augmentation
Transformers	AdamW (weight decay) + dropout
Very large models (LLMs)	Weight decay + gradient clipping

Summary

Overfitting occurs when a model memorizes training data instead of learning general patterns
L2 regularization adds λΣwᵢ² to the loss, pushing weights toward zero and producing smoother models
L1 regularization adds λΣ|wᵢ| to the loss, driving some weights to exactly zero for feature selection
The gradient of L2 is proportional to the weight (2λw); the gradient of L1 is constant (λ · sign(w))
Dropout randomly zeros activations during training, forcing redundant representations
Batch normalization smooths the loss landscape and adds implicit regularization
The bias-variance tradeoff governs the choice of regularization strength
Regularization shapes the loss function to favor solutions that generalize, not just solutions that minimize training error

The next module puts everything together: how neural networks use the chain rule, gradients, loss functions, and optimization to learn through backpropagation.

Regularization and Overfitting

What Is Overfitting?

L2 Regularization (Weight Decay)

The Gradient of L2 Regularization

Why Smaller Weights Help

L1 Regularization (Lasso)

The Gradient of L1 Regularization

L1 vs. L2 Comparison

Dropout (Regularization Without Changing the Loss)

Batch Normalization

The Bias-Variance Tradeoff

Regularization in Practice

Summary

Questions & Answers

Regularization and Overfitting

What Is Overfitting?

L2 Regularization (Weight Decay)

The Gradient of L2 Regularization

Why Smaller Weights Help

L1 Regularization (Lasso)

The Gradient of L1 Regularization

L1 vs. L2 Comparison

Dropout (Regularization Without Changing the Loss)

Batch Normalization

The Bias-Variance Tradeoff

Regularization in Practice

Summary

Questions & Answers