Discrete Probability Distributions
A probability distribution describes all possible outcomes and their probabilities. For discrete distributions, we're dealing with countable outcomes—like dice rolls, classification labels, or word choices. These distributions are fundamental to how AI systems represent and compute with uncertainty.
What is a Discrete Distribution?
A discrete probability distribution assigns a probability to each possible outcome in a countable set.
Requirements:
- Each probability is between 0 and 1
- All probabilities sum to 1
P(X = x₁) + P(X = x₂) + ... + P(X = xₙ) = 1
Common Discrete Distributions
Bernoulli Distribution
The simplest distribution: a single yes/no outcome.
Parameters: p (probability of success)
Examples in AI:
- Single binary classification: spam or not
- Single coin flip: heads or tails
- Click or no click
P(X = 1) = p
P(X = 0) = 1 - p
Binomial Distribution
The number of successes in n independent Bernoulli trials.
Parameters: n (number of trials), p (probability of success per trial)
Formula:
P(X = k) = C(n,k) × p^k × (1-p)^(n-k)
Where C(n,k) is "n choose k" = n! / (k! × (n-k)!)
Examples in AI:
- Number of spam emails in 100 emails
- Number of correct predictions out of 50 test cases
- Number of users who click in a batch
Categorical Distribution
Generalization of Bernoulli to multiple categories.
Parameters: p₁, p₂, ..., pₖ where all sum to 1
Examples in AI:
- Classification output: [P(cat)=0.7, P(dog)=0.2, P(bird)=0.1]
- Next word prediction in language models
- Multi-class sentiment: positive/negative/neutral
This is the output of the softmax function in neural networks!
Multinomial Distribution
Number of outcomes in each category over n trials (generalization of Binomial).
Parameters: n trials, probabilities p₁, p₂, ..., pₖ
Examples in AI:
- Word counts in a document
- Class distribution in a dataset
- Token frequencies in generated text
Poisson Distribution
Number of events in a fixed interval when events occur independently at a constant rate.
Parameters: λ (average rate)
Formula:
P(X = k) = (λ^k × e^(-λ)) / k!
Examples in AI:
- Number of requests per second to a server
- Number of anomalies in a time window
- Rare event counts
Geometric Distribution
Number of trials until first success.
Parameters: p (probability of success)
Formula:
P(X = k) = (1-p)^(k-1) × p
Examples in AI:
- Trials until a model correctly classifies an image
- Attempts until a generated password is accepted
Visualizing Distributions
Probability Mass Function (PMF)
The PMF shows P(X = x) for each possible value:
Binomial(n=10, p=0.3)
P(X=k)
0.25 | ██
0.20 | ████
0.15 | ██████
0.10 |████████
0.05 |██████████
+--+--+--+--+--+--+--+--+--+--+--
0 1 2 3 4 5 6 7 8 9 10
k
Cumulative Distribution Function (CDF)
The CDF shows P(X ≤ x)—probability up to and including x:
Binomial(n=10, p=0.3) CDF
P(X≤k)
1.0 | ████
0.8 | ██████████
0.6 | ██████
0.4 | ██████
0.2 | ████
+--+--+--+--+--+--+--+--+--+--+--
0 1 2 3 4 5 6 7 8 9 10
k
Distributions in Neural Network Outputs
Classification Layer
A neural network's classification layer produces a categorical distribution:
Input → Hidden Layers → Logits [2.1, 0.5, -1.2] → Softmax → [0.73, 0.15, 0.12]
cat dog bird
The output [0.73, 0.15, 0.12] is a valid categorical distribution:
- All values between 0 and 1 ✓
- Sum = 0.73 + 0.15 + 0.12 = 1.0 ✓
Language Model Next-Token Distribution
When a language model predicts the next word:
Input: "The cat sat on the"
Output distribution over vocabulary:
P("mat") = 0.35
P("floor") = 0.25
P("chair") = 0.15
P("bed") = 0.10
P("couch") = 0.08
... (all 50,000 tokens have some probability)
Sum of all = 1.0
Sampling from Distributions
To generate outputs, AI systems sample from distributions:
Argmax (Greedy)
Always pick the highest probability:
distribution = [0.35, 0.25, 0.15, 0.10, 0.08, ...]
output = "mat" # Always the same
Random Sampling
Pick according to probabilities:
distribution = [0.35, 0.25, 0.15, 0.10, 0.08, ...]
output = random_choice(distribution) # Varies each time
Top-k Sampling
Only consider top k options:
distribution = [0.35, 0.25, 0.15, 0.10, 0.08, ...]
top_3 = [0.35, 0.25, 0.15] # Renormalize to sum to 1
output = random_choice(top_3 / sum(top_3))
Parameters and Statistics
Mean (Expected Value)
The average outcome if we repeated the experiment many times:
| Distribution | Mean |
|---|---|
| Bernoulli(p) | p |
| Binomial(n, p) | n × p |
| Poisson(λ) | λ |
| Geometric(p) | 1/p |
Variance
How spread out the outcomes are:
| Distribution | Variance |
|---|---|
| Bernoulli(p) | p(1-p) |
| Binomial(n, p) | n × p × (1-p) |
| Poisson(λ) | λ |
| Geometric(p) | (1-p)/p² |
Using Distributions in Practice
Training: Learning Distributions
During training, we learn the parameters of distributions:
- What's P(spam | this email)?
- What's P(next word | context)?
Cross-entropy loss measures how well our learned distribution matches the true distribution.
Inference: Using Distributions
During inference, we use the learned distribution:
- Sample to generate text
- Take argmax to classify
- Use probabilities to rank options
Comparing Distributions
Entropy
Measures how "spread out" or uncertain a distribution is:
H(p) = -∑ p(x) × log p(x)
- Low entropy: Peaked distribution (confident)
- High entropy: Flat distribution (uncertain)
Examples:
- Confident classifier: [0.98, 0.01, 0.01] → Low entropy
- Uncertain classifier: [0.34, 0.33, 0.33] → High entropy
KL Divergence
Measures how different two distributions are:
KL(P || Q) = ∑ P(x) × log(P(x) / Q(x))
Used in:
- Training (minimize KL between model output and true label)
- Evaluating model calibration
- Variational inference
Summary
- Discrete distributions assign probabilities to countable outcomes
- Key distributions: Bernoulli, Binomial, Categorical, Multinomial, Poisson
- Neural network classification outputs are categorical distributions
- Sampling strategies control how we select from distributions
- Entropy measures uncertainty; KL divergence measures difference
Next, we'll explore continuous distributions and the famous normal distribution.

