Variance and Standard Deviation

While expected value tells us the "center" of a distribution, variance tells us how spread out the values are. Understanding variance is crucial for AI systems that need to quantify uncertainty and risk.

The Problem Expected Value Doesn't Solve

Consider two games:

Game A: Always win $10 Game B: Win $20 with P=0.5, win $0 with P=0.5

Both have E[Winnings] = $10, but they feel very different!

Game A is certain. Game B is risky. Variance captures this difference.

Variance Definition

Variance measures the average squared deviation from the mean:

Var(X) = E[(X - μ)²]

Equivalently:

Var(X) = E[X²] - (E[X])²

The second form is often easier to compute.

Computing Variance

Step 1: Find the expected value μ = E[X] Step 2: For each outcome, compute (x - μ)² Step 3: Take the expected value of these squared deviations

Dice Example

μ = E[X] = 3.5

x	P(x)	(x - 3.5)²	P(x) × (x - 3.5)²
1	1/6	6.25	1.042
2	1/6	2.25	0.375
3	1/6	0.25	0.042
4	1/6	0.25	0.042
5	1/6	2.25	0.375
6	1/6	6.25	1.042

Var(X) = 1.042 + 0.375 + 0.042 + 0.042 + 0.375 + 1.042 = 2.917

Standard Deviation

The standard deviation is the square root of variance:

σ = √Var(X)

Standard deviation has the same units as the original data, making it more interpretable.

For our dice: σ = √2.917 ≈ 1.71

Why Squared Deviations?

Why not just average |x - μ|?

Mathematical convenience: Squares are differentiable everywhere (absolute value is not at 0)
Penalizes outliers: Squaring makes large deviations contribute more
Nice properties: Variance has clean formulas for sums of random variables

Properties of Variance

Scaling

Var(aX) = a² × Var(X)
Var(X + b) = Var(X)      # Adding constant doesn't change spread

Adding a constant shifts the distribution but doesn't change its spread.

Sum of Independent Variables

If X and Y are independent:

Var(X + Y) = Var(X) + Var(Y)

Example: Two independent dice

Var(Sum) = Var(Die1) + Var(Die2) = 2.917 + 2.917 = 5.833

General Sum (Correlated)

For any X and Y:

Var(X + Y) = Var(X) + Var(Y) + 2×Cov(X, Y)

Where Cov(X, Y) is the covariance (measures correlation).

Variance of Common Distributions

Distribution	Variance
Bernoulli(p)	p(1-p)
Binomial(n, p)	np(1-p)
Poisson(λ)	λ
Uniform(a, b)	(b-a)²/12
Normal(μ, σ²)	σ²
Exponential(λ)	1/λ²

Note on Bernoulli

Variance p(1-p) is maximized when p = 0.5:

p = 0.5: Var = 0.25 (maximum uncertainty)
p = 0.9: Var = 0.09 (less uncertainty)
p = 1.0: Var = 0 (certain outcome)

Variance in AI

Model Confidence

A classifier with output [0.98, 0.01, 0.01] has low variance in its predictions—it's confident.

A classifier with output [0.34, 0.33, 0.33] has high variance—it's uncertain.

Ensemble Methods

Combining multiple models reduces variance:

Var(average of n independent models) = Var(single model) / n

This is why random forests and ensemble methods work!

Bias-Variance Tradeoff

Model error can be decomposed:

Error = Bias² + Variance + Irreducible Noise

Bias: Systematic error (model is consistently wrong)
Variance: Sensitivity to training data (model changes a lot with different samples)

Simple models: High bias, low variance Complex models: Low bias, high variance

Gradient Variance

In training neural networks, gradients have variance:

Stochastic Gradient Descent (SGD): High variance (single sample)
Mini-batch: Medium variance
Full batch: Low variance (but expensive)

Variance reduction techniques like Adam and momentum help stabilize training.

Coefficient of Variation

The coefficient of variation (CV) is a normalized measure of spread:

CV = σ / μ

Useful for comparing spread across different scales.

Example:

Heights: μ = 170 cm, σ = 10 cm → CV = 5.9%
Weights: μ = 70 kg, σ = 10 kg → CV = 14.3%

Weights are more variable relative to their mean.

Sample Variance

When estimating variance from data, we use sample variance:

s² = (1/(n-1)) × Σ(xᵢ - x̄)²

Why n-1 instead of n? This is Bessel's correction—it gives an unbiased estimate.

Degrees of Freedom

We divide by (n-1) because:

We used one degree of freedom to estimate the mean
Only (n-1) values are free to vary independently
Dividing by n would underestimate the true variance

Computing Variance

From Distribution

def variance(values, probabilities):
    mean = sum(v * p for v, p in zip(values, probabilities))
    return sum(p * (v - mean)**2 for v, p in zip(values, probabilities))

From Samples

def sample_variance(samples):
    n = len(samples)
    mean = sum(samples) / n
    return sum((x - mean)**2 for x in samples) / (n - 1)  # Bessel's correction

Using NumPy

import numpy as np

data = np.array([1, 2, 3, 4, 5])
variance = np.var(data, ddof=1)  # ddof=1 for sample variance
std_dev = np.std(data, ddof=1)

Variance in Neural Networks

Weight Initialization

Proper weight initialization controls variance through layers:

Xavier/Glorot: Var(W) = 2 / (fan_in + fan_out) He: Var(W) = 2 / fan_in (for ReLU)

This prevents variance from exploding or vanishing during forward/backward passes.

Batch Normalization

Explicitly normalizes to unit variance:

x_normalized = (x - μ) / σ

Then learns new mean and variance:

output = γ × x_normalized + β

Layer Normalization

Similar to batch norm but normalizes across features instead of batch:

More stable for transformers
Works with any batch size

Confidence Intervals

Variance relates to uncertainty about estimates:

95% Confidence Interval for Mean:

x̄ ± 1.96 × (s / √n)

Higher variance means wider confidence intervals (more uncertainty).

Summary

Variance measures spread around the expected value
Var(X) = E[(X - μ)²] = E[X²] - (E[X])²
Standard deviation σ = √Var(X) has the same units as data
Independent variances add: Var(X + Y) = Var(X) + Var(Y)
Bias-variance tradeoff is fundamental to machine learning
Proper variance initialization is crucial for deep networks
Sample variance uses (n-1) for unbiased estimation

Next, we'll see how uncertainty is measured and used in AI predictions.

Variance and Standard Deviation

The Problem Expected Value Doesn't Solve

Consider two games:

Game A: Always win $10 Game B: Win $20 with P=0.5, win $0 with P=0.5

Both have E[Winnings] = $10, but they feel very different!

Game A is certain. Game B is risky. Variance captures this difference.

Variance Definition

Variance measures the average squared deviation from the mean:

Var(X) = E[(X - μ)²]

Equivalently:

Var(X) = E[X²] - (E[X])²

The second form is often easier to compute.

Computing Variance

Step 1: Find the expected value μ = E[X] Step 2: For each outcome, compute (x - μ)² Step 3: Take the expected value of these squared deviations

Dice Example

μ = E[X] = 3.5

x	P(x)	(x - 3.5)²	P(x) × (x - 3.5)²
1	1/6	6.25	1.042
2	1/6	2.25	0.375
3	1/6	0.25	0.042
4	1/6	0.25	0.042
5	1/6	2.25	0.375
6	1/6	6.25	1.042

Var(X) = 1.042 + 0.375 + 0.042 + 0.042 + 0.375 + 1.042 = 2.917

Standard Deviation

The standard deviation is the square root of variance:

σ = √Var(X)

Standard deviation has the same units as the original data, making it more interpretable.

For our dice: σ = √2.917 ≈ 1.71

Why Squared Deviations?

Why not just average |x - μ|?

Mathematical convenience: Squares are differentiable everywhere (absolute value is not at 0)
Penalizes outliers: Squaring makes large deviations contribute more
Nice properties: Variance has clean formulas for sums of random variables

Properties of Variance

Scaling

Var(aX) = a² × Var(X)
Var(X + b) = Var(X)      # Adding constant doesn't change spread

Adding a constant shifts the distribution but doesn't change its spread.

Sum of Independent Variables

If X and Y are independent:

Var(X + Y) = Var(X) + Var(Y)

Example: Two independent dice

Var(Sum) = Var(Die1) + Var(Die2) = 2.917 + 2.917 = 5.833

General Sum (Correlated)

For any X and Y:

Var(X + Y) = Var(X) + Var(Y) + 2×Cov(X, Y)

Where Cov(X, Y) is the covariance (measures correlation).

Variance of Common Distributions

Distribution	Variance
Bernoulli(p)	p(1-p)
Binomial(n, p)	np(1-p)
Poisson(λ)	λ
Uniform(a, b)	(b-a)²/12
Normal(μ, σ²)	σ²
Exponential(λ)	1/λ²

Note on Bernoulli

Variance p(1-p) is maximized when p = 0.5:

p = 0.5: Var = 0.25 (maximum uncertainty)
p = 0.9: Var = 0.09 (less uncertainty)
p = 1.0: Var = 0 (certain outcome)

Variance in AI

Model Confidence

A classifier with output [0.98, 0.01, 0.01] has low variance in its predictions—it's confident.

A classifier with output [0.34, 0.33, 0.33] has high variance—it's uncertain.

Ensemble Methods

Combining multiple models reduces variance:

Var(average of n independent models) = Var(single model) / n

This is why random forests and ensemble methods work!

Bias-Variance Tradeoff

Model error can be decomposed:

Error = Bias² + Variance + Irreducible Noise

Bias: Systematic error (model is consistently wrong)
Variance: Sensitivity to training data (model changes a lot with different samples)

Simple models: High bias, low variance Complex models: Low bias, high variance

Gradient Variance

In training neural networks, gradients have variance:

Stochastic Gradient Descent (SGD): High variance (single sample)
Mini-batch: Medium variance
Full batch: Low variance (but expensive)

Variance reduction techniques like Adam and momentum help stabilize training.

Coefficient of Variation

The coefficient of variation (CV) is a normalized measure of spread:

CV = σ / μ

Useful for comparing spread across different scales.

Example:

Heights: μ = 170 cm, σ = 10 cm → CV = 5.9%
Weights: μ = 70 kg, σ = 10 kg → CV = 14.3%

Weights are more variable relative to their mean.

Sample Variance

When estimating variance from data, we use sample variance:

s² = (1/(n-1)) × Σ(xᵢ - x̄)²

Why n-1 instead of n? This is Bessel's correction—it gives an unbiased estimate.

Degrees of Freedom

We divide by (n-1) because:

We used one degree of freedom to estimate the mean
Only (n-1) values are free to vary independently
Dividing by n would underestimate the true variance

Computing Variance

From Distribution

def variance(values, probabilities):
    mean = sum(v * p for v, p in zip(values, probabilities))
    return sum(p * (v - mean)**2 for v, p in zip(values, probabilities))

From Samples

def sample_variance(samples):
    n = len(samples)
    mean = sum(samples) / n
    return sum((x - mean)**2 for x in samples) / (n - 1)  # Bessel's correction

Using NumPy

import numpy as np

data = np.array([1, 2, 3, 4, 5])
variance = np.var(data, ddof=1)  # ddof=1 for sample variance
std_dev = np.std(data, ddof=1)

Variance in Neural Networks

Weight Initialization

Proper weight initialization controls variance through layers:

Xavier/Glorot: Var(W) = 2 / (fan_in + fan_out) He: Var(W) = 2 / fan_in (for ReLU)

This prevents variance from exploding or vanishing during forward/backward passes.

Batch Normalization

Explicitly normalizes to unit variance:

x_normalized = (x - μ) / σ

Then learns new mean and variance:

output = γ × x_normalized + β

Layer Normalization

Similar to batch norm but normalizes across features instead of batch:

More stable for transformers
Works with any batch size

Confidence Intervals

Variance relates to uncertainty about estimates:

95% Confidence Interval for Mean:

x̄ ± 1.96 × (s / √n)

Higher variance means wider confidence intervals (more uncertainty).

Summary

Variance measures spread around the expected value
Var(X) = E[(X - μ)²] = E[X²] - (E[X])²
Standard deviation σ = √Var(X) has the same units as data
Independent variances add: Var(X + Y) = Var(X) + Var(Y)
Bias-variance tradeoff is fundamental to machine learning
Proper variance initialization is crucial for deep networks
Sample variance uses (n-1) for unbiased estimation

Next, we'll see how uncertainty is measured and used in AI predictions.