Discrete Probability Distributions

A probability distribution describes all possible outcomes and their probabilities. For discrete distributions, we're dealing with countable outcomes—like dice rolls, classification labels, or word choices. These distributions are fundamental to how AI systems represent and compute with uncertainty.

What is a Discrete Distribution?

A discrete probability distribution assigns a probability to each possible outcome in a countable set.

Requirements:

Each probability is between 0 and 1
All probabilities sum to 1

P(X = x₁) + P(X = x₂) + ... + P(X = xₙ) = 1

Common Discrete Distributions

Bernoulli Distribution

The simplest distribution: a single yes/no outcome.

Parameters: p (probability of success)

Examples in AI:

Single binary classification: spam or not
Single coin flip: heads or tails
Click or no click

P(X = 1) = p
P(X = 0) = 1 - p

Binomial Distribution

The number of successes in n independent Bernoulli trials.

Parameters: n (number of trials), p (probability of success per trial)

Formula:

P(X = k) = C(n,k) × p^k × (1-p)^(n-k)

Where C(n,k) is "n choose k" = n! / (k! × (n-k)!)

Examples in AI:

Number of spam emails in 100 emails
Number of correct predictions out of 50 test cases
Number of users who click in a batch

Categorical Distribution

Generalization of Bernoulli to multiple categories.

Parameters: p₁, p₂, ..., pₖ where all sum to 1

Examples in AI:

Classification output: [P(cat)=0.7, P(dog)=0.2, P(bird)=0.1]
Next word prediction in language models
Multi-class sentiment: positive/negative/neutral

This is the output of the softmax function in neural networks!

Multinomial Distribution

Number of outcomes in each category over n trials (generalization of Binomial).

Parameters: n trials, probabilities p₁, p₂, ..., pₖ

Examples in AI:

Word counts in a document
Class distribution in a dataset
Token frequencies in generated text

Poisson Distribution

Number of events in a fixed interval when events occur independently at a constant rate.

Parameters: λ (average rate)

Formula:

P(X = k) = (λ^k × e^(-λ)) / k!

Examples in AI:

Number of requests per second to a server
Number of anomalies in a time window
Rare event counts

Geometric Distribution

Number of trials until first success.

Parameters: p (probability of success)

Formula:

P(X = k) = (1-p)^(k-1) × p

Examples in AI:

Trials until a model correctly classifies an image
Attempts until a generated password is accepted

Visualizing Distributions

Probability Mass Function (PMF)

The PMF shows P(X = x) for each possible value:

Binomial(n=10, p=0.3)

P(X=k)
0.25 |   ██
0.20 |  ████
0.15 | ██████
0.10 |████████
0.05 |██████████
     +--+--+--+--+--+--+--+--+--+--+--
        0  1  2  3  4  5  6  7  8  9 10
                    k

Cumulative Distribution Function (CDF)

The CDF shows P(X ≤ x)—probability up to and including x:

Binomial(n=10, p=0.3) CDF

P(X≤k)
1.0  |                              ████
0.8  |                    ██████████
0.6  |              ██████
0.4  |        ██████
0.2  |    ████
     +--+--+--+--+--+--+--+--+--+--+--
        0  1  2  3  4  5  6  7  8  9 10
                    k

Distributions in Neural Network Outputs

Classification Layer

A neural network's classification layer produces a categorical distribution:

Input → Hidden Layers → Logits [2.1, 0.5, -1.2] → Softmax → [0.73, 0.15, 0.12]
                                                            cat   dog   bird

The output [0.73, 0.15, 0.12] is a valid categorical distribution:

All values between 0 and 1 ✓
Sum = 0.73 + 0.15 + 0.12 = 1.0 ✓

Language Model Next-Token Distribution

When a language model predicts the next word:

Input: "The cat sat on the"

Output distribution over vocabulary:
P("mat") = 0.35
P("floor") = 0.25
P("chair") = 0.15
P("bed") = 0.10
P("couch") = 0.08
... (all 50,000 tokens have some probability)

Sum of all = 1.0

Sampling from Distributions

To generate outputs, AI systems sample from distributions:

Argmax (Greedy)

Always pick the highest probability:

distribution = [0.35, 0.25, 0.15, 0.10, 0.08, ...]
output = "mat"  # Always the same

Random Sampling

Pick according to probabilities:

distribution = [0.35, 0.25, 0.15, 0.10, 0.08, ...]
output = random_choice(distribution)  # Varies each time

Top-k Sampling

Only consider top k options:

distribution = [0.35, 0.25, 0.15, 0.10, 0.08, ...]
top_3 = [0.35, 0.25, 0.15]  # Renormalize to sum to 1
output = random_choice(top_3 / sum(top_3))

Parameters and Statistics

Mean (Expected Value)

The average outcome if we repeated the experiment many times:

Distribution	Mean
Bernoulli(p)	p
Binomial(n, p)	n × p
Poisson(λ)	λ
Geometric(p)	1/p

Variance

How spread out the outcomes are:

Distribution	Variance
Bernoulli(p)	p(1-p)
Binomial(n, p)	n × p × (1-p)
Poisson(λ)	λ
Geometric(p)	(1-p)/p²

Using Distributions in Practice

Training: Learning Distributions

During training, we learn the parameters of distributions:

What's P(spam | this email)?
What's P(next word | context)?

Cross-entropy loss measures how well our learned distribution matches the true distribution.

Inference: Using Distributions

During inference, we use the learned distribution:

Sample to generate text
Take argmax to classify
Use probabilities to rank options

Comparing Distributions

Entropy

Measures how "spread out" or uncertain a distribution is:

H(p) = -∑ p(x) × log p(x)

Low entropy: Peaked distribution (confident)
High entropy: Flat distribution (uncertain)

Examples:

Confident classifier: [0.98, 0.01, 0.01] → Low entropy
Uncertain classifier: [0.34, 0.33, 0.33] → High entropy

KL Divergence

Measures how different two distributions are:

KL(P || Q) = ∑ P(x) × log(P(x) / Q(x))

Used in:

Training (minimize KL between model output and true label)
Evaluating model calibration
Variational inference

Summary

Discrete distributions assign probabilities to countable outcomes
Key distributions: Bernoulli, Binomial, Categorical, Multinomial, Poisson
Neural network classification outputs are categorical distributions
Sampling strategies control how we select from distributions
Entropy measures uncertainty; KL divergence measures difference

Next, we'll explore continuous distributions and the famous normal distribution.

Discrete Probability Distributions

What is a Discrete Distribution?

A discrete probability distribution assigns a probability to each possible outcome in a countable set.

Requirements:

Each probability is between 0 and 1
All probabilities sum to 1

P(X = x₁) + P(X = x₂) + ... + P(X = xₙ) = 1

Common Discrete Distributions

Bernoulli Distribution

The simplest distribution: a single yes/no outcome.

Parameters: p (probability of success)

Examples in AI:

Single binary classification: spam or not
Single coin flip: heads or tails
Click or no click

P(X = 1) = p
P(X = 0) = 1 - p

Binomial Distribution

The number of successes in n independent Bernoulli trials.

Parameters: n (number of trials), p (probability of success per trial)

Formula:

P(X = k) = C(n,k) × p^k × (1-p)^(n-k)

Where C(n,k) is "n choose k" = n! / (k! × (n-k)!)

Examples in AI:

Number of spam emails in 100 emails
Number of correct predictions out of 50 test cases
Number of users who click in a batch

Categorical Distribution

Generalization of Bernoulli to multiple categories.

Parameters: p₁, p₂, ..., pₖ where all sum to 1

Examples in AI:

Classification output: [P(cat)=0.7, P(dog)=0.2, P(bird)=0.1]
Next word prediction in language models
Multi-class sentiment: positive/negative/neutral

This is the output of the softmax function in neural networks!

Multinomial Distribution

Number of outcomes in each category over n trials (generalization of Binomial).

Parameters: n trials, probabilities p₁, p₂, ..., pₖ

Examples in AI:

Word counts in a document
Class distribution in a dataset
Token frequencies in generated text

Poisson Distribution

Number of events in a fixed interval when events occur independently at a constant rate.

Parameters: λ (average rate)

Formula:

P(X = k) = (λ^k × e^(-λ)) / k!

Examples in AI:

Number of requests per second to a server
Number of anomalies in a time window
Rare event counts

Geometric Distribution

Number of trials until first success.

Parameters: p (probability of success)

Formula:

P(X = k) = (1-p)^(k-1) × p

Examples in AI:

Trials until a model correctly classifies an image
Attempts until a generated password is accepted

Visualizing Distributions

Probability Mass Function (PMF)

The PMF shows P(X = x) for each possible value:

Binomial(n=10, p=0.3)

P(X=k)
0.25 |   ██
0.20 |  ████
0.15 | ██████
0.10 |████████
0.05 |██████████
     +--+--+--+--+--+--+--+--+--+--+--
        0  1  2  3  4  5  6  7  8  9 10
                    k

Cumulative Distribution Function (CDF)

The CDF shows P(X ≤ x)—probability up to and including x:

Binomial(n=10, p=0.3) CDF

P(X≤k)
1.0  |                              ████
0.8  |                    ██████████
0.6  |              ██████
0.4  |        ██████
0.2  |    ████
     +--+--+--+--+--+--+--+--+--+--+--
        0  1  2  3  4  5  6  7  8  9 10
                    k

Distributions in Neural Network Outputs

Classification Layer

A neural network's classification layer produces a categorical distribution:

Input → Hidden Layers → Logits [2.1, 0.5, -1.2] → Softmax → [0.73, 0.15, 0.12]
                                                            cat   dog   bird

The output [0.73, 0.15, 0.12] is a valid categorical distribution:

All values between 0 and 1 ✓
Sum = 0.73 + 0.15 + 0.12 = 1.0 ✓

Language Model Next-Token Distribution

When a language model predicts the next word:

Input: "The cat sat on the"

Output distribution over vocabulary:
P("mat") = 0.35
P("floor") = 0.25
P("chair") = 0.15
P("bed") = 0.10
P("couch") = 0.08
... (all 50,000 tokens have some probability)

Sum of all = 1.0

Sampling from Distributions

To generate outputs, AI systems sample from distributions:

Argmax (Greedy)

Always pick the highest probability:

distribution = [0.35, 0.25, 0.15, 0.10, 0.08, ...]
output = "mat"  # Always the same

Random Sampling

Pick according to probabilities:

distribution = [0.35, 0.25, 0.15, 0.10, 0.08, ...]
output = random_choice(distribution)  # Varies each time

Top-k Sampling

Only consider top k options:

distribution = [0.35, 0.25, 0.15, 0.10, 0.08, ...]
top_3 = [0.35, 0.25, 0.15]  # Renormalize to sum to 1
output = random_choice(top_3 / sum(top_3))

Parameters and Statistics

Mean (Expected Value)

The average outcome if we repeated the experiment many times:

Distribution	Mean
Bernoulli(p)	p
Binomial(n, p)	n × p
Poisson(λ)	λ
Geometric(p)	1/p

Variance

How spread out the outcomes are:

Distribution	Variance
Bernoulli(p)	p(1-p)
Binomial(n, p)	n × p × (1-p)
Poisson(λ)	λ
Geometric(p)	(1-p)/p²

Using Distributions in Practice

Training: Learning Distributions

During training, we learn the parameters of distributions:

What's P(spam | this email)?
What's P(next word | context)?

Cross-entropy loss measures how well our learned distribution matches the true distribution.

Inference: Using Distributions

During inference, we use the learned distribution:

Sample to generate text
Take argmax to classify
Use probabilities to rank options

Comparing Distributions

Entropy

Measures how "spread out" or uncertain a distribution is:

H(p) = -∑ p(x) × log p(x)

Low entropy: Peaked distribution (confident)
High entropy: Flat distribution (uncertain)

Examples:

Confident classifier: [0.98, 0.01, 0.01] → Low entropy
Uncertain classifier: [0.34, 0.33, 0.33] → High entropy

KL Divergence

Measures how different two distributions are:

KL(P || Q) = ∑ P(x) × log(P(x) / Q(x))

Used in:

Training (minimize KL between model output and true label)
Evaluating model calibration
Variational inference

Summary

Discrete distributions assign probabilities to countable outcomes
Key distributions: Bernoulli, Binomial, Categorical, Multinomial, Poisson
Neural network classification outputs are categorical distributions
Sampling strategies control how we select from distributions
Entropy measures uncertainty; KL divergence measures difference

Next, we'll explore continuous distributions and the famous normal distribution.