Understanding Confusion Matrices
A confusion matrix is a table that visualizes classification performance by showing exactly where a model gets confused. It's one of the most useful tools for understanding what's going right and wrong with a classifier.
What is a Confusion Matrix?
A confusion matrix shows the count of predictions for each combination of actual vs. predicted class:
PREDICTED
Cat Dog Bird
Cat 45 5 2 ← 45 cats correctly identified
ACTUAL Dog 3 38 4 ← 3 dogs misclassified as cats
Bird 2 7 41 ← 7 birds misclassified as dogs
Each row represents actual classes; each column represents predictions.
Reading a Binary Confusion Matrix
For binary classification (positive/negative):
PREDICTED
Positive Negative
ACTUAL Positive TP FN
Negative FP TN
Where:
- TP (True Positive): Correctly predicted positive
- TN (True Negative): Correctly predicted negative
- FP (False Positive): Incorrectly predicted positive (Type I error)
- FN (False Negative): Incorrectly predicted negative (Type II error)
Example: Medical Diagnosis
PREDICTED
Disease Healthy
ACTUAL Disease 85 15 ← 15 missed cases (FN)
Healthy 10 890 ← 10 false alarms (FP)
From this matrix:
- TP = 85: Correctly diagnosed diseases
- TN = 890: Correctly identified healthy patients
- FP = 10: Healthy patients incorrectly told they're sick
- FN = 15: Sick patients missed (dangerous!)
Extracting Metrics
All classification metrics come from the confusion matrix:
Accuracy = (TP + TN) / Total = (85 + 890) / 1000 = 97.5%
Precision = TP / (TP + FP) = 85 / 95 = 89.5%
Recall = TP / (TP + FN) = 85 / 100 = 85%
F1 = 2 × (0.895 × 0.85) / (0.895 + 0.85) = 87.2%
Multi-Class Confusion Matrices
For K classes, the matrix is K × K:
Example: Digit Recognition (0-9)
PREDICTED: 0 1 2 3 4 5 6 7 8 9
ACTUAL 0: 980 0 2 1 0 4 5 1 3 4
1: 0 1120 3 2 0 1 2 0 5 2
2: 6 5 970 8 3 2 3 4 6 5
3: 2 3 7 960 0 10 0 4 8 6
4: 1 2 4 0 950 0 6 3 4 12
5: 4 2 1 12 3 850 9 1 5 5
6: 3 3 0 1 4 5 942 0 0 0
7: 0 4 10 4 0 1 0 990 2 17
8: 4 3 5 7 3 6 3 2 940 1
9: 3 3 1 7 15 5 1 6 5 963
What We Can Learn
- Diagonal: Correct predictions (want these high)
- Off-diagonal: Errors (want these low)
- Row i, column j: How often class i is confused with class j
From the example:
- 4 and 9 are often confused (look similar)
- 7 and 9 are confused (similar top curve)
- 1 is rarely confused (distinctive shape)
Normalized Confusion Matrix
Normalize by row to see per-class accuracy:
Raw:
Cat Dog
Cat 45 5 → 50 total cats
Dog 3 47 → 50 total dogs
Normalized:
Cat Dog
Cat 0.90 0.10 → 90% of cats correctly identified
Dog 0.06 0.94 → 94% of dogs correctly identified
This shows:
- Cats are correctly classified 90% of the time
- Dogs are correctly classified 94% of the time
- Cats are more often confused with dogs than vice versa
Visualizing Confusion Matrices
import matplotlib.pyplot as plt
import seaborn as sns
# Plot confusion matrix as heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
Predicted
Cat Dog Bird
Cat [███] [░░░] [ ]
Actual Dog [░] [███] [░]
Bird [ ] [░░] [███]
Dark = Many predictions
Light = Few predictions
Per-Class Metrics
From a multi-class confusion matrix, compute metrics for each class:
For class "Cat":
- TP: Predicted cat, actually cat (diagonal)
- FP: Predicted cat, actually not cat (column sum - diagonal)
- FN: Actually cat, predicted not cat (row sum - diagonal)
def per_class_metrics(cm, class_idx):
tp = cm[class_idx, class_idx]
fp = cm[:, class_idx].sum() - tp
fn = cm[class_idx, :].sum() - tp
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
return precision, recall
Micro vs. Macro Averaging
For multi-class, we can aggregate metrics two ways:
Macro-Average
Calculate metric for each class, then average:
macro_precision = sum(precision[c] for c in classes) / num_classes
Treats all classes equally regardless of size.
Micro-Average
Pool all predictions, then calculate:
total_tp = sum(tp[c] for c in classes)
total_fp = sum(fp[c] for c in classes)
micro_precision = total_tp / (total_tp + total_fp)
Larger classes dominate.
When to Use Which
- Macro: When you care equally about all classes
- Micro: When you care about overall performance
Common Patterns in Confusion Matrices
Diagonal Dominance (Good)
A B C
A [95] 3 2
B 2 [91] 7
C 3 5 [92]
Most predictions are correct (on diagonal).
Off-Diagonal Cluster (Confusing Classes)
A B C
A [90] [8] 2
B [9] [85] 6
C 3 4 [93]
A and B are often confused with each other.
Row/Column Bias (Prediction Imbalance)
A B C
A 80 15 5
B 10 75 15
C 40 20 40
Class C is under-predicted (row sum > column sum).
Single Row Failure (Class Collapse)
A B C
A 95 3 2
B 85 10 5
C 80 15 5
Model barely predicts B or C—just defaults to A.
Using Confusion Matrices for Debugging
1. Identify Problematic Classes
Look for rows/columns with low diagonal values.
2. Find Confusion Patterns
Which classes are confused? Often visually or semantically similar.
3. Check for Bias
Are some classes over/under predicted?
4. Guide Data Collection
If A→B confusion is high, collect more A/B training examples.
5. Consider Class Merging
If two classes are always confused, maybe they should be one class.
Threshold Analysis
For binary classification, the confusion matrix changes with threshold:
Threshold = 0.5:
Pos Neg
Actual Pos 80 20
Actual Neg 10 90
Threshold = 0.3 (more aggressive):
Pos Neg
Actual Pos 95 5 ← More TP, fewer FN
Actual Neg 30 70 ← More FP
Threshold = 0.7 (more conservative):
Pos Neg
Actual Pos 60 40 ← Fewer TP, more FN
Actual Neg 5 95 ← Fewer FP
Plotting metrics across thresholds gives ROC and PR curves.
Cost-Sensitive Analysis
Different errors have different costs:
Medical screening:
- FN (miss disease): Cost = $50,000 (late treatment)
- FP (false alarm): Cost = $500 (extra tests)
Total cost = FN × $50,000 + FP × $500
= 15 × $50,000 + 10 × $500
= $750,000 + $5,000
= $755,000
Optimize for minimum cost, not just accuracy.
Confusion Matrix in Production
Monitor confusion matrices over time:
January: Cat Dog June: Cat Dog
Cat 90 10 Cat 85 15 ← Performance dropping
Dog 5 95 Dog 25 75 ← Dog detection degraded
Possible causes:
- Data drift (new dog breeds?)
- Season change (winter coats?)
- Bug in preprocessing
Summary
- Confusion matrices show exactly where models make mistakes
- Diagonal = correct, off-diagonal = errors
- Derive precision, recall, F1 from the matrix
- Normalize by row to see per-class performance
- Use for debugging: find confusing classes, bias, and patterns
- Different thresholds give different matrices
- Monitor in production to catch degradation
This completes the Probability & Statistics for AI course! You now have the probabilistic foundations to understand how AI models learn, predict, and handle uncertainty.

