Continuous Distributions and the Normal Distribution

While discrete distributions handle countable outcomes, continuous distributions describe quantities that can take any value in a range—like temperatures, heights, or neural network weights. The normal distribution is the most important of these and appears throughout AI and machine learning.

Discrete vs. Continuous

Discrete: Countable outcomes (dice roll: 1, 2, 3, 4, 5, 6)

P(X = 3) makes sense

Continuous: Infinite outcomes in a range (temperature: 20.0°C, 20.001°C, 20.0001°C, ...)

P(X = exactly 20.0000...°C) = 0 (infinitely precise value)
Instead, we ask: P(19.9 < X < 20.1)

Probability Density Function (PDF)

For continuous distributions, we use a probability density function f(x):

f(x) ≥ 0 for all x
Area under the entire curve = 1
P(a < X < b) = area under curve from a to b

f(x)
  |      ****
  |    **    **
  |  **        **
  | *            *
  |*              *
  +-----|------|-------
        a      b

P(a < X < b) = shaded area

The Normal (Gaussian) Distribution

The most important continuous distribution in statistics and AI.

The Formula

f(x) = (1 / (σ√(2π))) × e^(-(x-μ)²/(2σ²))

Parameters:

μ (mu): Mean (center of the distribution)
σ (sigma): Standard deviation (spread)

The Bell Curve

                    μ
                    ↓
          ┌────────█████────────┐
         ╱   ██████     ██████   ╲
        ╱████                 ████╲
       ╱                           ╲
──────╱─────────────────────────────╲──────
     μ-3σ  μ-2σ  μ-σ   μ   μ+σ  μ+2σ  μ+3σ

The 68-95-99.7 Rule

68% of data falls within 1 standard deviation of the mean
95% falls within 2 standard deviations
99.7% falls within 3 standard deviations

Why Normal Distributions Appear Everywhere

The Central Limit Theorem

One of the most important theorems in statistics:

The sum (or average) of many independent random variables tends toward a normal distribution, regardless of the original distributions.

This is why:

Heights are normally distributed (sum of many genetic factors)
Measurement errors are normally distributed (sum of many small errors)
Stock returns over long periods approximate normal
Neural network weight updates approximate normal

In AI Specifically

Weight Initialization: Networks are initialized with normally distributed weights
Noise Injection: Gaussian noise is added for regularization and privacy
Latent Spaces: VAEs assume normally distributed latent variables
Diffusion Models: Work by gradually adding and removing Gaussian noise
Uncertainty: Prediction uncertainty is often modeled as Gaussian

Standard Normal Distribution

The standard normal has μ = 0 and σ = 1.

Any normal distribution can be converted to standard normal:

Z = (X - μ) / σ

This z-score tells you how many standard deviations from the mean a value is.

Example

If heights are normally distributed with μ = 170 cm, σ = 10 cm:

A height of 190 cm: z = (190 - 170) / 10 = 2.0
This is 2 standard deviations above average
Only about 2.3% of people are taller

Other Important Continuous Distributions

Uniform Distribution

All values in a range are equally likely.

Parameters: a (minimum), b (maximum)

f(x) = 1/(b-a) for a ≤ x ≤ b
       0       otherwise

AI Uses:

Random initialization
Random sampling
Baseline for comparison

Exponential Distribution

Time until an event occurs (memoryless).

Parameter: λ (rate)

f(x) = λ × e^(-λx) for x ≥ 0

AI Uses:

Modeling wait times
Lifetime analysis
Certain attention mechanisms

Beta Distribution

Probability values (between 0 and 1).

Parameters: α, β (shape parameters)

AI Uses:

Prior for probabilities in Bayesian models
Modeling uncertainty about rates
Thompson sampling for bandits

Gamma Distribution

Generalizes exponential to sum of exponentials.

Parameters: α (shape), β (rate)

AI Uses:

Prior for positive quantities
Modeling variances
Neural network hyperparameters

Multivariate Normal Distribution

When we have multiple continuous variables, we use the multivariate normal.

Parameters:

μ: Mean vector
Σ: Covariance matrix

In 2D, visualized as elliptical contours:

     ╭──────╮
   ╭╱        ╲╮
  ╱            ╲
 (    ●         )  ← Center at μ
  ╲            ╱
   ╰╲        ╱╯
     ╰──────╯

Ellipse orientation shows correlation
Ellipse size shows variance

In AI

Word embeddings often assumed normally distributed
Variational autoencoders use multivariate normal
Gaussian processes model functions as multivariate normal
Uncertainty in multi-output predictions

Probability from Continuous Distributions

Computing Probabilities

For P(a < X < b), integrate the PDF:

P(a < X < b) = ∫[a to b] f(x) dx

In practice, we use:

Standard normal tables (historical)
Software functions (modern)

Python Example

from scipy import stats

# Normal distribution with μ=100, σ=15 (like IQ)
dist = stats.norm(loc=100, scale=15)

# P(X < 115)
p = dist.cdf(115)  # ≈ 0.841

# P(85 < X < 115)
p = dist.cdf(115) - dist.cdf(85)  # ≈ 0.683

# Value where 95% of data falls below
x = dist.ppf(0.95)  # ≈ 124.7

Mixture Distributions

Real data often doesn't follow a single distribution. Mixture models combine multiple distributions:

P(X) = π₁ × N(μ₁, σ₁) + π₂ × N(μ₂, σ₂) + ...

Where π₁, π₂, ... are mixing weights (sum to 1).

Gaussian Mixture Models (GMMs)

Used for:

Clustering
Density estimation
Generative modeling

Two-component mixture:

      ╭──╮     ╭──╮
    ╭╱    ╲╮ ╭╱    ╲╮
   ╱        ╳        ╲
  ╱        ╱ ╲        ╲
──────────────────────────
   Mode 1     Mode 2

Transformations

Box-Cox Transform

Convert non-normal data to approximately normal:

Often used as preprocessing for ML algorithms
Makes data more symmetric

Log-Normal Distribution

When log(X) is normal, X is log-normal:

Always positive
Right-skewed
Common for: incomes, file sizes, word frequencies

Summary

Continuous distributions describe quantities that can take any value in a range
The PDF gives probability density; integrate to get probability
The normal distribution is central due to the Central Limit Theorem
The 68-95-99.7 rule describes spread around the mean
Z-scores standardize any normal distribution
Multivariate normal extends to multiple correlated variables
Mixture models combine multiple distributions

Next, we'll explore softmax and temperature—how neural networks convert raw outputs into probability distributions.

Continuous Distributions and the Normal Distribution

Discrete vs. Continuous

Discrete: Countable outcomes (dice roll: 1, 2, 3, 4, 5, 6)

P(X = 3) makes sense

Continuous: Infinite outcomes in a range (temperature: 20.0°C, 20.001°C, 20.0001°C, ...)

P(X = exactly 20.0000...°C) = 0 (infinitely precise value)
Instead, we ask: P(19.9 < X < 20.1)

Probability Density Function (PDF)

For continuous distributions, we use a probability density function f(x):

f(x) ≥ 0 for all x
Area under the entire curve = 1
P(a < X < b) = area under curve from a to b

f(x)
  |      ****
  |    **    **
  |  **        **
  | *            *
  |*              *
  +-----|------|-------
        a      b

P(a < X < b) = shaded area

The Normal (Gaussian) Distribution

The most important continuous distribution in statistics and AI.

The Formula

f(x) = (1 / (σ√(2π))) × e^(-(x-μ)²/(2σ²))

Parameters:

μ (mu): Mean (center of the distribution)
σ (sigma): Standard deviation (spread)

The Bell Curve

                    μ
                    ↓
          ┌────────█████────────┐
         ╱   ██████     ██████   ╲
        ╱████                 ████╲
       ╱                           ╲
──────╱─────────────────────────────╲──────
     μ-3σ  μ-2σ  μ-σ   μ   μ+σ  μ+2σ  μ+3σ

The 68-95-99.7 Rule

68% of data falls within 1 standard deviation of the mean
95% falls within 2 standard deviations
99.7% falls within 3 standard deviations

Why Normal Distributions Appear Everywhere

The Central Limit Theorem

One of the most important theorems in statistics:

The sum (or average) of many independent random variables tends toward a normal distribution, regardless of the original distributions.

This is why:

Heights are normally distributed (sum of many genetic factors)
Measurement errors are normally distributed (sum of many small errors)
Stock returns over long periods approximate normal
Neural network weight updates approximate normal

In AI Specifically

Weight Initialization: Networks are initialized with normally distributed weights
Noise Injection: Gaussian noise is added for regularization and privacy
Latent Spaces: VAEs assume normally distributed latent variables
Diffusion Models: Work by gradually adding and removing Gaussian noise
Uncertainty: Prediction uncertainty is often modeled as Gaussian

Standard Normal Distribution

The standard normal has μ = 0 and σ = 1.

Any normal distribution can be converted to standard normal:

Z = (X - μ) / σ

This z-score tells you how many standard deviations from the mean a value is.

Example

If heights are normally distributed with μ = 170 cm, σ = 10 cm:

A height of 190 cm: z = (190 - 170) / 10 = 2.0
This is 2 standard deviations above average
Only about 2.3% of people are taller

Other Important Continuous Distributions

Uniform Distribution

All values in a range are equally likely.

Parameters: a (minimum), b (maximum)

f(x) = 1/(b-a) for a ≤ x ≤ b
       0       otherwise

AI Uses:

Random initialization
Random sampling
Baseline for comparison

Exponential Distribution

Time until an event occurs (memoryless).

Parameter: λ (rate)

f(x) = λ × e^(-λx) for x ≥ 0

AI Uses:

Modeling wait times
Lifetime analysis
Certain attention mechanisms

Beta Distribution

Probability values (between 0 and 1).

Parameters: α, β (shape parameters)

AI Uses:

Prior for probabilities in Bayesian models
Modeling uncertainty about rates
Thompson sampling for bandits

Gamma Distribution

Generalizes exponential to sum of exponentials.

Parameters: α (shape), β (rate)

AI Uses:

Prior for positive quantities
Modeling variances
Neural network hyperparameters

Multivariate Normal Distribution

When we have multiple continuous variables, we use the multivariate normal.

Parameters:

μ: Mean vector
Σ: Covariance matrix

In 2D, visualized as elliptical contours:

     ╭──────╮
   ╭╱        ╲╮
  ╱            ╲
 (    ●         )  ← Center at μ
  ╲            ╱
   ╰╲        ╱╯
     ╰──────╯

Ellipse orientation shows correlation
Ellipse size shows variance

In AI

Word embeddings often assumed normally distributed
Variational autoencoders use multivariate normal
Gaussian processes model functions as multivariate normal
Uncertainty in multi-output predictions

Probability from Continuous Distributions

Computing Probabilities

For P(a < X < b), integrate the PDF:

P(a < X < b) = ∫[a to b] f(x) dx

In practice, we use:

Standard normal tables (historical)
Software functions (modern)

Python Example

from scipy import stats

# Normal distribution with μ=100, σ=15 (like IQ)
dist = stats.norm(loc=100, scale=15)

# P(X < 115)
p = dist.cdf(115)  # ≈ 0.841

# P(85 < X < 115)
p = dist.cdf(115) - dist.cdf(85)  # ≈ 0.683

# Value where 95% of data falls below
x = dist.ppf(0.95)  # ≈ 124.7

Mixture Distributions

Real data often doesn't follow a single distribution. Mixture models combine multiple distributions:

P(X) = π₁ × N(μ₁, σ₁) + π₂ × N(μ₂, σ₂) + ...

Where π₁, π₂, ... are mixing weights (sum to 1).

Gaussian Mixture Models (GMMs)

Used for:

Clustering
Density estimation
Generative modeling

Two-component mixture:

      ╭──╮     ╭──╮
    ╭╱    ╲╮ ╭╱    ╲╮
   ╱        ╳        ╲
  ╱        ╱ ╲        ╲
──────────────────────────
   Mode 1     Mode 2

Transformations

Box-Cox Transform

Convert non-normal data to approximately normal:

Often used as preprocessing for ML algorithms
Makes data more symmetric

Log-Normal Distribution

When log(X) is normal, X is log-normal:

Always positive
Right-skewed
Common for: incomes, file sizes, word frequencies

Summary

Continuous distributions describe quantities that can take any value in a range
The PDF gives probability density; integrate to get probability
The normal distribution is central due to the Central Limit Theorem
The 68-95-99.7 rule describes spread around the mean
Z-scores standardize any normal distribution
Multivariate normal extends to multiple correlated variables
Mixture models combine multiple distributions

Next, we'll explore softmax and temperature—how neural networks convert raw outputs into probability distributions.