Continuous Distributions and the Normal Distribution
While discrete distributions handle countable outcomes, continuous distributions describe quantities that can take any value in a range—like temperatures, heights, or neural network weights. The normal distribution is the most important of these and appears throughout AI and machine learning.
Discrete vs. Continuous
Discrete: Countable outcomes (dice roll: 1, 2, 3, 4, 5, 6)
- P(X = 3) makes sense
Continuous: Infinite outcomes in a range (temperature: 20.0°C, 20.001°C, 20.0001°C, ...)
- P(X = exactly 20.0000...°C) = 0 (infinitely precise value)
- Instead, we ask: P(19.9 < X < 20.1)
Probability Density Function (PDF)
For continuous distributions, we use a probability density function f(x):
- f(x) ≥ 0 for all x
- Area under the entire curve = 1
- P(a < X < b) = area under curve from a to b
f(x)
| ****
| ** **
| ** **
| * *
|* *
+-----|------|-------
a b
P(a < X < b) = shaded area
The Normal (Gaussian) Distribution
The most important continuous distribution in statistics and AI.
The Formula
f(x) = (1 / (σ√(2π))) × e^(-(x-μ)²/(2σ²))
Parameters:
- μ (mu): Mean (center of the distribution)
- σ (sigma): Standard deviation (spread)
The Bell Curve
μ
↓
┌────────█████────────┐
╱ ██████ ██████ ╲
╱████ ████╲
╱ ╲
──────╱─────────────────────────────╲──────
μ-3σ μ-2σ μ-σ μ μ+σ μ+2σ μ+3σ
The 68-95-99.7 Rule
- 68% of data falls within 1 standard deviation of the mean
- 95% falls within 2 standard deviations
- 99.7% falls within 3 standard deviations
Why Normal Distributions Appear Everywhere
The Central Limit Theorem
One of the most important theorems in statistics:
The sum (or average) of many independent random variables tends toward a normal distribution, regardless of the original distributions.
This is why:
- Heights are normally distributed (sum of many genetic factors)
- Measurement errors are normally distributed (sum of many small errors)
- Stock returns over long periods approximate normal
- Neural network weight updates approximate normal
In AI Specifically
- Weight Initialization: Networks are initialized with normally distributed weights
- Noise Injection: Gaussian noise is added for regularization and privacy
- Latent Spaces: VAEs assume normally distributed latent variables
- Diffusion Models: Work by gradually adding and removing Gaussian noise
- Uncertainty: Prediction uncertainty is often modeled as Gaussian
Standard Normal Distribution
The standard normal has μ = 0 and σ = 1.
Any normal distribution can be converted to standard normal:
Z = (X - μ) / σ
This z-score tells you how many standard deviations from the mean a value is.
Example
If heights are normally distributed with μ = 170 cm, σ = 10 cm:
- A height of 190 cm: z = (190 - 170) / 10 = 2.0
- This is 2 standard deviations above average
- Only about 2.3% of people are taller
Other Important Continuous Distributions
Uniform Distribution
All values in a range are equally likely.
Parameters: a (minimum), b (maximum)
f(x) = 1/(b-a) for a ≤ x ≤ b
0 otherwise
AI Uses:
- Random initialization
- Random sampling
- Baseline for comparison
Exponential Distribution
Time until an event occurs (memoryless).
Parameter: λ (rate)
f(x) = λ × e^(-λx) for x ≥ 0
AI Uses:
- Modeling wait times
- Lifetime analysis
- Certain attention mechanisms
Beta Distribution
Probability values (between 0 and 1).
Parameters: α, β (shape parameters)
AI Uses:
- Prior for probabilities in Bayesian models
- Modeling uncertainty about rates
- Thompson sampling for bandits
Gamma Distribution
Generalizes exponential to sum of exponentials.
Parameters: α (shape), β (rate)
AI Uses:
- Prior for positive quantities
- Modeling variances
- Neural network hyperparameters
Multivariate Normal Distribution
When we have multiple continuous variables, we use the multivariate normal.
Parameters:
- μ: Mean vector
- Σ: Covariance matrix
In 2D, visualized as elliptical contours:
╭──────╮
╭╱ ╲╮
╱ ╲
( ● ) ← Center at μ
╲ ╱
╰╲ ╱╯
╰──────╯
Ellipse orientation shows correlation
Ellipse size shows variance
In AI
- Word embeddings often assumed normally distributed
- Variational autoencoders use multivariate normal
- Gaussian processes model functions as multivariate normal
- Uncertainty in multi-output predictions
Probability from Continuous Distributions
Computing Probabilities
For P(a < X < b), integrate the PDF:
P(a < X < b) = ∫[a to b] f(x) dx
In practice, we use:
- Standard normal tables (historical)
- Software functions (modern)
Python Example
from scipy import stats
# Normal distribution with μ=100, σ=15 (like IQ)
dist = stats.norm(loc=100, scale=15)
# P(X < 115)
p = dist.cdf(115) # ≈ 0.841
# P(85 < X < 115)
p = dist.cdf(115) - dist.cdf(85) # ≈ 0.683
# Value where 95% of data falls below
x = dist.ppf(0.95) # ≈ 124.7
Mixture Distributions
Real data often doesn't follow a single distribution. Mixture models combine multiple distributions:
P(X) = π₁ × N(μ₁, σ₁) + π₂ × N(μ₂, σ₂) + ...
Where π₁, π₂, ... are mixing weights (sum to 1).
Gaussian Mixture Models (GMMs)
Used for:
- Clustering
- Density estimation
- Generative modeling
Two-component mixture:
╭──╮ ╭──╮
╭╱ ╲╮ ╭╱ ╲╮
╱ ╳ ╲
╱ ╱ ╲ ╲
──────────────────────────
Mode 1 Mode 2
Transformations
Box-Cox Transform
Convert non-normal data to approximately normal:
- Often used as preprocessing for ML algorithms
- Makes data more symmetric
Log-Normal Distribution
When log(X) is normal, X is log-normal:
- Always positive
- Right-skewed
- Common for: incomes, file sizes, word frequencies
Summary
- Continuous distributions describe quantities that can take any value in a range
- The PDF gives probability density; integrate to get probability
- The normal distribution is central due to the Central Limit Theorem
- The 68-95-99.7 rule describes spread around the mean
- Z-scores standardize any normal distribution
- Multivariate normal extends to multiple correlated variables
- Mixture models combine multiple distributions
Next, we'll explore softmax and temperature—how neural networks convert raw outputs into probability distributions.

