Measuring Uncertainty in AI Predictions
AI systems don't just make predictions—they should also tell us how confident they are. Understanding and quantifying uncertainty is crucial for building trustworthy AI systems that know when to ask for help or defer to humans.
Why Uncertainty Matters
Consider an AI medical diagnosis system:
High confidence, correct: "This is definitely pneumonia" (and it is) → Good High confidence, wrong: "This is definitely not cancer" (but it is) → Dangerous Low confidence: "I'm not sure—please consult a specialist" → Safe
A well-calibrated system knows what it doesn't know.
Types of Uncertainty
Aleatoric Uncertainty (Data Uncertainty)
Inherent randomness that can't be reduced with more data.
Examples:
- Rolling a die—even with perfect knowledge, it's random
- Quantum mechanics—fundamentally probabilistic
- Future stock prices—inherently unpredictable
Aleatoric uncertainty is irreducible.
Epistemic Uncertainty (Model Uncertainty)
Uncertainty due to lack of knowledge—reducible with more data.
Examples:
- Is this a new species of bird? (Not in training data)
- What's the weather in an unexplored region? (No historical data)
- How will this new drug interact? (Not yet studied)
Epistemic uncertainty can be reduced by gathering more data or improving the model.
Distinguishing Them
| Aspect | Aleatoric | Epistemic |
|---|---|---|
| Source | Data noise | Model limitations |
| Reducible? | No | Yes (with more data) |
| Example | Coin flip outcome | Coin's bias |
| AI strategy | Model it | Quantify and communicate |
Measuring Prediction Confidence
Classification Entropy
Entropy measures how "spread out" a probability distribution is:
H(p) = -Σ pᵢ × log(pᵢ)
Low entropy: Confident (peaked distribution)
[0.95, 0.03, 0.02] → H ≈ 0.26
High entropy: Uncertain (flat distribution)
[0.34, 0.33, 0.33] → H ≈ 1.10
Maximum Probability
Simplest confidence measure:
Confidence = max(probabilities)
[0.95, 0.03, 0.02] → Confidence = 0.95 [0.40, 0.35, 0.25] → Confidence = 0.40
Margin
Difference between top two probabilities:
Margin = p_top1 - p_top2
[0.95, 0.03, 0.02] → Margin = 0.92 (very confident) [0.40, 0.35, 0.25] → Margin = 0.05 (uncertain)
Calibration
A model is well-calibrated if its confidence matches its accuracy:
- When it says "90% confident," it should be correct 90% of the time
- When it says "50% confident," it should be correct 50% of the time
Calibration Plot
Accuracy (%)
100 | ·····
80 | ······
60 | ······ (Perfect)
40 | ·····
20 | ···
+----+----+----+----+----
20 40 60 80 100
Confidence (%)
Perfect calibration: Points on the diagonal Overconfident: Points below the diagonal Underconfident: Points above the diagonal
Expected Calibration Error (ECE)
Measures how far from perfectly calibrated:
ECE = Σ (n_bin / n_total) × |accuracy_bin - confidence_bin|
Lower ECE = better calibrated.
Temperature Scaling for Calibration
A simple post-hoc calibration method:
- Hold out a validation set
- Learn a temperature T that minimizes calibration error
- Scale logits by 1/T before softmax
calibrated_probs = softmax(logits / T)
Uncertainty in Regression
For continuous predictions, uncertainty is often a confidence interval:
Prediction Intervals
Instead of predicting just μ, predict μ ± interval:
Prediction: 42.5 ± 3.2 (95% confidence)
This means: "I'm 95% sure the true value is between 39.3 and 45.7"
Predicting Distributions
Neural networks can output distribution parameters:
Input → Network → [μ, σ]
The output represents a normal distribution N(μ, σ²).
Loss function: Negative log-likelihood
Loss = log(σ) + (y - μ)² / (2σ²)
The network learns to output larger σ when uncertain.
Bayesian Approaches
Bayesian Neural Networks
Instead of fixed weights, maintain distributions over weights:
P(weights | data)
Prediction uncertainty comes from:
- Sampling multiple weight configurations
- Making predictions with each
- Observing the variance in predictions
Monte Carlo Dropout
A practical approximation:
- Keep dropout enabled at inference time
- Run the same input multiple times
- Observe variance in outputs
predictions = []
for _ in range(100):
pred = model.predict(x, training=True) # Dropout active
predictions.append(pred)
mean_prediction = np.mean(predictions)
uncertainty = np.std(predictions)
Deep Ensembles
Train multiple models and aggregate:
models = [train_model() for _ in range(5)]
predictions = [m.predict(x) for m in models]
mean = np.mean(predictions)
uncertainty = np.std(predictions)
High variance across models = high epistemic uncertainty.
Out-of-Distribution Detection
Models should recognize when inputs are unlike training data:
Softmax Confidence Issues
Standard softmax can be overconfident on OOD inputs:
- A cat classifier might output [0.85, 0.15] for a car image
- The model has never seen a car, but it's forced to choose
Solutions
Outlier detection:
- Train on "normal" data
- Flag inputs that look different
Abstaining classifiers:
- Add a "don't know" class
- Train on OOD examples
Temperature scaling for OOD:
- OOD inputs often have different logit distributions
- Temperature adjustment can help detect them
Uncertainty in Language Models
Token-Level Uncertainty
LLMs have uncertainty about each token:
"The capital of Burkina Faso is ___"
P("Ouagadougou") = 0.65
P("Ouaga") = 0.10
P("Unknown") = 0.08
...
The model is fairly confident but not certain.
Sequence-Level Uncertainty
For full responses, uncertainty compounds:
P(response) = P(token1) × P(token2|token1) × ...
Longer responses generally have lower overall probability.
Verbalized Uncertainty
Modern LLMs can express uncertainty in words:
- "I'm confident that..."
- "I'm not entirely sure, but..."
- "I don't have enough information to..."
This is a form of calibrated uncertainty expression.
When to Defer to Humans
Use uncertainty to decide when AI should step back:
def make_decision(input):
prediction, uncertainty = model.predict_with_uncertainty(input)
if uncertainty > threshold:
return "Please consult a human expert"
else:
return prediction
This is called selective prediction or learning to defer.
Summary
- Aleatoric uncertainty: Inherent data randomness (irreducible)
- Epistemic uncertainty: Model's lack of knowledge (reducible)
- Entropy, max probability, and margin measure classification confidence
- Calibration ensures confidence matches accuracy
- Temperature scaling is a simple calibration technique
- Bayesian methods and ensembles quantify epistemic uncertainty
- Monte Carlo dropout is a practical approximation
- LLMs can verbalize uncertainty
- High uncertainty should trigger deferral to humans
This completes Module 4! Next, we'll explore Maximum Likelihood Estimation—how AI learns from data.

