Regularization and Overfitting
A model that fits the training data perfectly but performs poorly on new data has overfitted. Regularization modifies the loss function to discourage overly complex models, and the calculus of regularization reveals exactly how it shapes the optimization process.
What Is Overfitting?
Overfitting occurs when a model learns the noise in the training data rather than the underlying pattern.
Underfitting: Good fit: Overfitting:
* * * * * *
* / * */ \ * */\ /\*
/ * / \* / \/ \
/ * / * \ / * * \*
/ * / * \ /* * *
straight line smooth curve wiggly curve that
misses pattern captures trend hits every point
The overfitted model memorizes training data but fails to generalize. In calculus terms: the model found a minimum of the training loss that does not correspond to low test loss.
L2 Regularization (Weight Decay)
L2 regularization adds a penalty proportional to the square of each weight:
L_total = L_data + λ Σᵢ wᵢ²
where λ (lambda) is the regularization strength and the sum is over all weights.
The Gradient of L2 Regularization
∂L_total/∂wᵢ = ∂L_data/∂wᵢ + 2λwᵢ
The extra term 2λwᵢ pushes each weight toward zero. Larger weights get pushed harder. This is why it is called weight decay — at each step, weights shrink slightly:
wᵢ_new = wᵢ_old - α(∂L_data/∂wᵢ + 2λwᵢ)
= wᵢ_old(1 - 2αλ) - α · ∂L_data/∂wᵢ
The factor (1 - 2αλ) multiplies the old weight, shrinking it by a small fraction each step. Only the data gradient term can grow weights; the regularization term always shrinks them.
Why Smaller Weights Help
Large weights amplify small input differences, making the model sensitive to noise. Small weights produce smoother, more generalizable functions.
Large weights (no regularization): Small weights (with regularization):
* ← sharp feature ~~ smooth curve ~~
/ \ / \
/ \ * / * \
/ \ / \ / * * \
/ * \ / * * \
L1 Regularization (Lasso)
L1 regularization penalizes the absolute value of weights:
L_total = L_data + λ Σᵢ |wᵢ|
The Gradient of L1 Regularization
∂L_total/∂wᵢ = ∂L_data/∂wᵢ + λ · sign(wᵢ)
where sign(wᵢ) is +1 if wᵢ > 0 and -1 if wᵢ < 0.
Unlike L2, the penalty is constant regardless of weight magnitude. This means L1 can push small weights all the way to exactly zero, effectively removing features from the model. This produces sparse models.
L1 vs. L2 Comparison
| Property | L1 | L2 |
|---|---|---|
| Penalty term | λΣ|wᵢ| | λΣwᵢ² |
| Gradient contribution | λ · sign(w) | 2λw |
| Effect on weights | Pushes to exact zero (sparse) | Shrinks toward zero (small but non-zero) |
| Feature selection | Yes (removes irrelevant features) | No (keeps all features) |
| Use case | When many features are irrelevant | General regularization |
Dropout (Regularization Without Changing the Loss)
Dropout randomly sets neuron activations to zero during training:
Training: h = [0.5, 0, 0.3, 0, 0.7] (dropped neurons zeroed out)
Inference: h = [0.5, 0.2, 0.3, 0.4, 0.7] × (1 - dropout_rate)
From a calculus perspective, dropout changes which paths in the computational graph are active. The gradient flows only through non-dropped neurons, and the model is forced to learn redundant representations — no single neuron can be solely responsible for a feature.
Batch Normalization
Batch normalization normalizes layer activations to have zero mean and unit variance:
x̂ = (x - μ_batch) / √(σ²_batch + ε)
y = γx̂ + β
where γ and β are learnable parameters.
Batch norm affects the loss landscape by:
- Smoothing the surface (reducing sharp curvature)
- Reducing sensitivity to weight initialization
- Allowing higher learning rates
The derivatives through batch norm are more complex because the mean and variance depend on the entire batch. Frameworks handle this automatically, but the effect is to add an implicit regularization by coupling each example's gradient to the entire batch.
The Bias-Variance Tradeoff
Regularization controls the tradeoff between two types of error:
| Error Type | Description | Caused By |
|---|---|---|
| Bias | Model is too simple to capture the pattern | Underfitting (too much regularization) |
| Variance | Model is too sensitive to training data noise | Overfitting (too little regularization) |
Error
^
| \ /
| \ variance / bias
| \ /
| \ /
| \_____/
| total error
| * optimal
+──────────────────> Regularization strength λ
none strong
The optimal λ minimizes total error. Too little regularization and variance dominates (overfitting). Too much regularization and bias dominates (underfitting).
Regularization in Practice
Common regularization strategies for different architectures:
| Architecture | Typical Regularization |
|---|---|
| Linear models | L1 or L2 |
| Small neural networks | L2 + dropout |
| Large CNNs | Dropout + data augmentation |
| Transformers | AdamW (weight decay) + dropout |
| Very large models (LLMs) | Weight decay + gradient clipping |
Summary
- Overfitting occurs when a model memorizes training data instead of learning general patterns
- L2 regularization adds λΣwᵢ² to the loss, pushing weights toward zero and producing smoother models
- L1 regularization adds λΣ|wᵢ| to the loss, driving some weights to exactly zero for feature selection
- The gradient of L2 is proportional to the weight (2λw); the gradient of L1 is constant (λ · sign(w))
- Dropout randomly zeros activations during training, forcing redundant representations
- Batch normalization smooths the loss landscape and adds implicit regularization
- The bias-variance tradeoff governs the choice of regularization strength
- Regularization shapes the loss function to favor solutions that generalize, not just solutions that minimize training error
The next module puts everything together: how neural networks use the chain rule, gradients, loss functions, and optimization to learn through backpropagation.

