Measuring Uncertainty in AI Predictions

AI systems don't just make predictions—they should also tell us how confident they are. Understanding and quantifying uncertainty is crucial for building trustworthy AI systems that know when to ask for help or defer to humans.

Why Uncertainty Matters

Consider an AI medical diagnosis system:

High confidence, correct: "This is definitely pneumonia" (and it is) → Good High confidence, wrong: "This is definitely not cancer" (but it is) → Dangerous Low confidence: "I'm not sure—please consult a specialist" → Safe

A well-calibrated system knows what it doesn't know.

Types of Uncertainty

Aleatoric Uncertainty (Data Uncertainty)

Inherent randomness that can't be reduced with more data.

Examples:

Rolling a die—even with perfect knowledge, it's random
Quantum mechanics—fundamentally probabilistic
Future stock prices—inherently unpredictable

Aleatoric uncertainty is irreducible.

Epistemic Uncertainty (Model Uncertainty)

Uncertainty due to lack of knowledge—reducible with more data.

Examples:

Is this a new species of bird? (Not in training data)
What's the weather in an unexplored region? (No historical data)
How will this new drug interact? (Not yet studied)

Epistemic uncertainty can be reduced by gathering more data or improving the model.

Distinguishing Them

Aspect	Aleatoric	Epistemic
Source	Data noise	Model limitations
Reducible?	No	Yes (with more data)
Example	Coin flip outcome	Coin's bias
AI strategy	Model it	Quantify and communicate

Measuring Prediction Confidence

Classification Entropy

Entropy measures how "spread out" a probability distribution is:

H(p) = -Σ pᵢ × log(pᵢ)

Low entropy: Confident (peaked distribution)

[0.95, 0.03, 0.02] → H ≈ 0.26

High entropy: Uncertain (flat distribution)

[0.34, 0.33, 0.33] → H ≈ 1.10

Maximum Probability

Simplest confidence measure:

Confidence = max(probabilities)

[0.95, 0.03, 0.02] → Confidence = 0.95 [0.40, 0.35, 0.25] → Confidence = 0.40

Margin

Difference between top two probabilities:

Margin = p_top1 - p_top2

[0.95, 0.03, 0.02] → Margin = 0.92 (very confident) [0.40, 0.35, 0.25] → Margin = 0.05 (uncertain)

Calibration

A model is well-calibrated if its confidence matches its accuracy:

When it says "90% confident," it should be correct 90% of the time
When it says "50% confident," it should be correct 50% of the time

Calibration Plot

Accuracy (%)
100 |                    ·····
 80 |              ······
 60 |        ······ (Perfect)
 40 |    ·····
 20 | ···
    +----+----+----+----+----
     20   40   60   80  100
           Confidence (%)

Perfect calibration: Points on the diagonal Overconfident: Points below the diagonal Underconfident: Points above the diagonal

Expected Calibration Error (ECE)

Measures how far from perfectly calibrated:

ECE = Σ (n_bin / n_total) × |accuracy_bin - confidence_bin|

Lower ECE = better calibrated.

Temperature Scaling for Calibration

A simple post-hoc calibration method:

Hold out a validation set
Learn a temperature T that minimizes calibration error
Scale logits by 1/T before softmax

calibrated_probs = softmax(logits / T)

Uncertainty in Regression

For continuous predictions, uncertainty is often a confidence interval:

Prediction Intervals

Instead of predicting just μ, predict μ ± interval:

Prediction: 42.5 ± 3.2 (95% confidence)

This means: "I'm 95% sure the true value is between 39.3 and 45.7"

Predicting Distributions

Neural networks can output distribution parameters:

Input → Network → [μ, σ]

The output represents a normal distribution N(μ, σ²).

Loss function: Negative log-likelihood

Loss = log(σ) + (y - μ)² / (2σ²)

The network learns to output larger σ when uncertain.

Bayesian Approaches

Bayesian Neural Networks

Instead of fixed weights, maintain distributions over weights:

P(weights | data)

Prediction uncertainty comes from:

Sampling multiple weight configurations
Making predictions with each
Observing the variance in predictions

Monte Carlo Dropout

A practical approximation:

Keep dropout enabled at inference time
Run the same input multiple times
Observe variance in outputs

predictions = []
for _ in range(100):
    pred = model.predict(x, training=True)  # Dropout active
    predictions.append(pred)

mean_prediction = np.mean(predictions)
uncertainty = np.std(predictions)

Deep Ensembles

Train multiple models and aggregate:

models = [train_model() for _ in range(5)]
predictions = [m.predict(x) for m in models]

mean = np.mean(predictions)
uncertainty = np.std(predictions)

High variance across models = high epistemic uncertainty.

Out-of-Distribution Detection

Models should recognize when inputs are unlike training data:

Softmax Confidence Issues

Standard softmax can be overconfident on OOD inputs:

A cat classifier might output [0.85, 0.15] for a car image
The model has never seen a car, but it's forced to choose

Solutions

Outlier detection:

Train on "normal" data
Flag inputs that look different

Abstaining classifiers:

Add a "don't know" class
Train on OOD examples

Temperature scaling for OOD:

OOD inputs often have different logit distributions
Temperature adjustment can help detect them

Uncertainty in Language Models

Token-Level Uncertainty

LLMs have uncertainty about each token:

"The capital of Burkina Faso is ___"

P("Ouagadougou") = 0.65
P("Ouaga") = 0.10
P("Unknown") = 0.08
...

The model is fairly confident but not certain.

Sequence-Level Uncertainty

For full responses, uncertainty compounds:

P(response) = P(token1) × P(token2|token1) × ...

Longer responses generally have lower overall probability.

Verbalized Uncertainty

Modern LLMs can express uncertainty in words:

"I'm confident that..."
"I'm not entirely sure, but..."
"I don't have enough information to..."

This is a form of calibrated uncertainty expression.

When to Defer to Humans

Use uncertainty to decide when AI should step back:

def make_decision(input):
    prediction, uncertainty = model.predict_with_uncertainty(input)

    if uncertainty > threshold:
        return "Please consult a human expert"
    else:
        return prediction

This is called selective prediction or learning to defer.

Summary

Aleatoric uncertainty: Inherent data randomness (irreducible)
Epistemic uncertainty: Model's lack of knowledge (reducible)
Entropy, max probability, and margin measure classification confidence
Calibration ensures confidence matches accuracy
Temperature scaling is a simple calibration technique
Bayesian methods and ensembles quantify epistemic uncertainty
Monte Carlo dropout is a practical approximation
LLMs can verbalize uncertainty
High uncertainty should trigger deferral to humans

This completes Module 4! Next, we'll explore Maximum Likelihood Estimation—how AI learns from data.

Measuring Uncertainty in AI Predictions

Why Uncertainty Matters

Consider an AI medical diagnosis system:

A well-calibrated system knows what it doesn't know.

Types of Uncertainty

Aleatoric Uncertainty (Data Uncertainty)

Inherent randomness that can't be reduced with more data.

Examples:

Rolling a die—even with perfect knowledge, it's random
Quantum mechanics—fundamentally probabilistic
Future stock prices—inherently unpredictable

Aleatoric uncertainty is irreducible.

Epistemic Uncertainty (Model Uncertainty)

Uncertainty due to lack of knowledge—reducible with more data.

Examples:

Is this a new species of bird? (Not in training data)
What's the weather in an unexplored region? (No historical data)
How will this new drug interact? (Not yet studied)

Epistemic uncertainty can be reduced by gathering more data or improving the model.

Distinguishing Them

Aspect	Aleatoric	Epistemic
Source	Data noise	Model limitations
Reducible?	No	Yes (with more data)
Example	Coin flip outcome	Coin's bias
AI strategy	Model it	Quantify and communicate

Measuring Prediction Confidence

Classification Entropy

Entropy measures how "spread out" a probability distribution is:

H(p) = -Σ pᵢ × log(pᵢ)

Low entropy: Confident (peaked distribution)

[0.95, 0.03, 0.02] → H ≈ 0.26

High entropy: Uncertain (flat distribution)

[0.34, 0.33, 0.33] → H ≈ 1.10

Maximum Probability

Simplest confidence measure:

Confidence = max(probabilities)

[0.95, 0.03, 0.02] → Confidence = 0.95 [0.40, 0.35, 0.25] → Confidence = 0.40

Margin

Difference between top two probabilities:

Margin = p_top1 - p_top2

[0.95, 0.03, 0.02] → Margin = 0.92 (very confident) [0.40, 0.35, 0.25] → Margin = 0.05 (uncertain)

Calibration

A model is well-calibrated if its confidence matches its accuracy:

When it says "90% confident," it should be correct 90% of the time
When it says "50% confident," it should be correct 50% of the time

Calibration Plot

Accuracy (%)
100 |                    ·····
 80 |              ······
 60 |        ······ (Perfect)
 40 |    ·····
 20 | ···
    +----+----+----+----+----
     20   40   60   80  100
           Confidence (%)

Perfect calibration: Points on the diagonal Overconfident: Points below the diagonal Underconfident: Points above the diagonal

Expected Calibration Error (ECE)

Measures how far from perfectly calibrated:

ECE = Σ (n_bin / n_total) × |accuracy_bin - confidence_bin|

Lower ECE = better calibrated.

Temperature Scaling for Calibration

A simple post-hoc calibration method:

Hold out a validation set
Learn a temperature T that minimizes calibration error
Scale logits by 1/T before softmax

calibrated_probs = softmax(logits / T)

Uncertainty in Regression

For continuous predictions, uncertainty is often a confidence interval:

Prediction Intervals

Instead of predicting just μ, predict μ ± interval:

Prediction: 42.5 ± 3.2 (95% confidence)

This means: "I'm 95% sure the true value is between 39.3 and 45.7"

Predicting Distributions

Neural networks can output distribution parameters:

Input → Network → [μ, σ]

The output represents a normal distribution N(μ, σ²).

Loss function: Negative log-likelihood

Loss = log(σ) + (y - μ)² / (2σ²)

The network learns to output larger σ when uncertain.

Bayesian Approaches

Bayesian Neural Networks

Instead of fixed weights, maintain distributions over weights:

P(weights | data)

Prediction uncertainty comes from:

Sampling multiple weight configurations
Making predictions with each
Observing the variance in predictions

Monte Carlo Dropout

A practical approximation:

Keep dropout enabled at inference time
Run the same input multiple times
Observe variance in outputs

predictions = []
for _ in range(100):
    pred = model.predict(x, training=True)  # Dropout active
    predictions.append(pred)

mean_prediction = np.mean(predictions)
uncertainty = np.std(predictions)

Deep Ensembles

Train multiple models and aggregate:

models = [train_model() for _ in range(5)]
predictions = [m.predict(x) for m in models]

mean = np.mean(predictions)
uncertainty = np.std(predictions)

High variance across models = high epistemic uncertainty.

Out-of-Distribution Detection

Models should recognize when inputs are unlike training data:

Softmax Confidence Issues

Standard softmax can be overconfident on OOD inputs:

A cat classifier might output [0.85, 0.15] for a car image
The model has never seen a car, but it's forced to choose

Solutions

Outlier detection:

Train on "normal" data
Flag inputs that look different

Abstaining classifiers:

Add a "don't know" class
Train on OOD examples

Temperature scaling for OOD:

OOD inputs often have different logit distributions
Temperature adjustment can help detect them

Uncertainty in Language Models

Token-Level Uncertainty

LLMs have uncertainty about each token:

"The capital of Burkina Faso is ___"

P("Ouagadougou") = 0.65
P("Ouaga") = 0.10
P("Unknown") = 0.08
...

The model is fairly confident but not certain.

Sequence-Level Uncertainty

For full responses, uncertainty compounds:

P(response) = P(token1) × P(token2|token1) × ...

Longer responses generally have lower overall probability.

Verbalized Uncertainty

Modern LLMs can express uncertainty in words:

"I'm confident that..."
"I'm not entirely sure, but..."
"I don't have enough information to..."

This is a form of calibrated uncertainty expression.

When to Defer to Humans

Use uncertainty to decide when AI should step back:

def make_decision(input):
    prediction, uncertainty = model.predict_with_uncertainty(input)

    if uncertainty > threshold:
        return "Please consult a human expert"
    else:
        return prediction

This is called selective prediction or learning to defer.

Summary

Aleatoric uncertainty: Inherent data randomness (irreducible)
Epistemic uncertainty: Model's lack of knowledge (reducible)
Entropy, max probability, and margin measure classification confidence
Calibration ensures confidence matches accuracy
Temperature scaling is a simple calibration technique
Bayesian methods and ensembles quantify epistemic uncertainty
Monte Carlo dropout is a practical approximation
LLMs can verbalize uncertainty
High uncertainty should trigger deferral to humans

This completes Module 4! Next, we'll explore Maximum Likelihood Estimation—how AI learns from data.