Variance and Standard Deviation
While expected value tells us the "center" of a distribution, variance tells us how spread out the values are. Understanding variance is crucial for AI systems that need to quantify uncertainty and risk.
The Problem Expected Value Doesn't Solve
Consider two games:
Game A: Always win $10 Game B: Win $20 with P=0.5, win $0 with P=0.5
Both have E[Winnings] = $10, but they feel very different!
Game A is certain. Game B is risky. Variance captures this difference.
Variance Definition
Variance measures the average squared deviation from the mean:
Var(X) = E[(X - μ)²]
Equivalently:
Var(X) = E[X²] - (E[X])²
The second form is often easier to compute.
Computing Variance
Step 1: Find the expected value μ = E[X] Step 2: For each outcome, compute (x - μ)² Step 3: Take the expected value of these squared deviations
Dice Example
μ = E[X] = 3.5
| x | P(x) | (x - 3.5)² | P(x) × (x - 3.5)² |
|---|---|---|---|
| 1 | 1/6 | 6.25 | 1.042 |
| 2 | 1/6 | 2.25 | 0.375 |
| 3 | 1/6 | 0.25 | 0.042 |
| 4 | 1/6 | 0.25 | 0.042 |
| 5 | 1/6 | 2.25 | 0.375 |
| 6 | 1/6 | 6.25 | 1.042 |
Var(X) = 1.042 + 0.375 + 0.042 + 0.042 + 0.375 + 1.042 = 2.917
Standard Deviation
The standard deviation is the square root of variance:
σ = √Var(X)
Standard deviation has the same units as the original data, making it more interpretable.
For our dice: σ = √2.917 ≈ 1.71
Why Squared Deviations?
Why not just average |x - μ|?
- Mathematical convenience: Squares are differentiable everywhere (absolute value is not at 0)
- Penalizes outliers: Squaring makes large deviations contribute more
- Nice properties: Variance has clean formulas for sums of random variables
Properties of Variance
Scaling
Var(aX) = a² × Var(X)
Var(X + b) = Var(X) # Adding constant doesn't change spread
Adding a constant shifts the distribution but doesn't change its spread.
Sum of Independent Variables
If X and Y are independent:
Var(X + Y) = Var(X) + Var(Y)
Example: Two independent dice
Var(Sum) = Var(Die1) + Var(Die2) = 2.917 + 2.917 = 5.833
General Sum (Correlated)
For any X and Y:
Var(X + Y) = Var(X) + Var(Y) + 2×Cov(X, Y)
Where Cov(X, Y) is the covariance (measures correlation).
Variance of Common Distributions
| Distribution | Variance |
|---|---|
| Bernoulli(p) | p(1-p) |
| Binomial(n, p) | np(1-p) |
| Poisson(λ) | λ |
| Uniform(a, b) | (b-a)²/12 |
| Normal(μ, σ²) | σ² |
| Exponential(λ) | 1/λ² |
Note on Bernoulli
Variance p(1-p) is maximized when p = 0.5:
- p = 0.5: Var = 0.25 (maximum uncertainty)
- p = 0.9: Var = 0.09 (less uncertainty)
- p = 1.0: Var = 0 (certain outcome)
Variance in AI
Model Confidence
A classifier with output [0.98, 0.01, 0.01] has low variance in its predictions—it's confident.
A classifier with output [0.34, 0.33, 0.33] has high variance—it's uncertain.
Ensemble Methods
Combining multiple models reduces variance:
Var(average of n independent models) = Var(single model) / n
This is why random forests and ensemble methods work!
Bias-Variance Tradeoff
Model error can be decomposed:
Error = Bias² + Variance + Irreducible Noise
- Bias: Systematic error (model is consistently wrong)
- Variance: Sensitivity to training data (model changes a lot with different samples)
Simple models: High bias, low variance Complex models: Low bias, high variance
Gradient Variance
In training neural networks, gradients have variance:
- Stochastic Gradient Descent (SGD): High variance (single sample)
- Mini-batch: Medium variance
- Full batch: Low variance (but expensive)
Variance reduction techniques like Adam and momentum help stabilize training.
Coefficient of Variation
The coefficient of variation (CV) is a normalized measure of spread:
CV = σ / μ
Useful for comparing spread across different scales.
Example:
- Heights: μ = 170 cm, σ = 10 cm → CV = 5.9%
- Weights: μ = 70 kg, σ = 10 kg → CV = 14.3%
Weights are more variable relative to their mean.
Sample Variance
When estimating variance from data, we use sample variance:
s² = (1/(n-1)) × Σ(xᵢ - x̄)²
Why n-1 instead of n? This is Bessel's correction—it gives an unbiased estimate.
Degrees of Freedom
We divide by (n-1) because:
- We used one degree of freedom to estimate the mean
- Only (n-1) values are free to vary independently
- Dividing by n would underestimate the true variance
Computing Variance
From Distribution
def variance(values, probabilities):
mean = sum(v * p for v, p in zip(values, probabilities))
return sum(p * (v - mean)**2 for v, p in zip(values, probabilities))
From Samples
def sample_variance(samples):
n = len(samples)
mean = sum(samples) / n
return sum((x - mean)**2 for x in samples) / (n - 1) # Bessel's correction
Using NumPy
import numpy as np
data = np.array([1, 2, 3, 4, 5])
variance = np.var(data, ddof=1) # ddof=1 for sample variance
std_dev = np.std(data, ddof=1)
Variance in Neural Networks
Weight Initialization
Proper weight initialization controls variance through layers:
Xavier/Glorot: Var(W) = 2 / (fan_in + fan_out) He: Var(W) = 2 / fan_in (for ReLU)
This prevents variance from exploding or vanishing during forward/backward passes.
Batch Normalization
Explicitly normalizes to unit variance:
x_normalized = (x - μ) / σ
Then learns new mean and variance:
output = γ × x_normalized + β
Layer Normalization
Similar to batch norm but normalizes across features instead of batch:
- More stable for transformers
- Works with any batch size
Confidence Intervals
Variance relates to uncertainty about estimates:
95% Confidence Interval for Mean:
x̄ ± 1.96 × (s / √n)
Higher variance means wider confidence intervals (more uncertainty).
Summary
- Variance measures spread around the expected value
- Var(X) = E[(X - μ)²] = E[X²] - (E[X])²
- Standard deviation σ = √Var(X) has the same units as data
- Independent variances add: Var(X + Y) = Var(X) + Var(Y)
- Bias-variance tradeoff is fundamental to machine learning
- Proper variance initialization is crucial for deep networks
- Sample variance uses (n-1) for unbiased estimation
Next, we'll see how uncertainty is measured and used in AI predictions.

