Understanding Confusion Matrices

A confusion matrix is a table that visualizes classification performance by showing exactly where a model gets confused. It's one of the most useful tools for understanding what's going right and wrong with a classifier.

What is a Confusion Matrix?

A confusion matrix shows the count of predictions for each combination of actual vs. predicted class:

                  PREDICTED
                  Cat    Dog    Bird
           Cat    45      5      2     ← 45 cats correctly identified
ACTUAL     Dog     3     38      4     ← 3 dogs misclassified as cats
           Bird    2      7     41     ← 7 birds misclassified as dogs

Each row represents actual classes; each column represents predictions.

Reading a Binary Confusion Matrix

For binary classification (positive/negative):

                    PREDICTED
                 Positive  Negative
ACTUAL  Positive    TP        FN
        Negative    FP        TN

Where:

TP (True Positive): Correctly predicted positive
TN (True Negative): Correctly predicted negative
FP (False Positive): Incorrectly predicted positive (Type I error)
FN (False Negative): Incorrectly predicted negative (Type II error)

Example: Medical Diagnosis

                    PREDICTED
                 Disease  Healthy
ACTUAL  Disease     85       15    ← 15 missed cases (FN)
        Healthy     10      890   ← 10 false alarms (FP)

From this matrix:

TP = 85: Correctly diagnosed diseases
TN = 890: Correctly identified healthy patients
FP = 10: Healthy patients incorrectly told they're sick
FN = 15: Sick patients missed (dangerous!)

Extracting Metrics

All classification metrics come from the confusion matrix:

Accuracy  = (TP + TN) / Total = (85 + 890) / 1000 = 97.5%
Precision = TP / (TP + FP) = 85 / 95 = 89.5%
Recall    = TP / (TP + FN) = 85 / 100 = 85%
F1        = 2 × (0.895 × 0.85) / (0.895 + 0.85) = 87.2%

Multi-Class Confusion Matrices

For K classes, the matrix is K × K:

Example: Digit Recognition (0-9)

PREDICTED:     0    1    2    3    4    5    6    7    8    9
ACTUAL 0:    980    0    2    1    0    4    5    1    3    4
       1:      0 1120    3    2    0    1    2    0    5    2
       2:      6    5  970    8    3    2    3    4    6    5
       3:      2    3    7  960    0   10    0    4    8    6
       4:      1    2    4    0  950    0    6    3    4   12
       5:      4    2    1   12    3  850    9    1    5    5
       6:      3    3    0    1    4    5  942    0    0    0
       7:      0    4   10    4    0    1    0  990    2   17
       8:      4    3    5    7    3    6    3    2  940    1
       9:      3    3    1    7   15    5    1    6    5  963

What We Can Learn

Diagonal: Correct predictions (want these high)
Off-diagonal: Errors (want these low)
Row i, column j: How often class i is confused with class j

From the example:

4 and 9 are often confused (look similar)
7 and 9 are confused (similar top curve)
1 is rarely confused (distinctive shape)

Normalized Confusion Matrix

Normalize by row to see per-class accuracy:

Raw:
            Cat  Dog
Cat         45    5   → 50 total cats
Dog          3   47   → 50 total dogs

Normalized:
            Cat  Dog
Cat        0.90 0.10  → 90% of cats correctly identified
Dog        0.06 0.94  → 94% of dogs correctly identified

This shows:

Cats are correctly classified 90% of the time
Dogs are correctly classified 94% of the time
Cats are more often confused with dogs than vice versa

Visualizing Confusion Matrices

import matplotlib.pyplot as plt
import seaborn as sns

# Plot confusion matrix as heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')

         Predicted
         Cat    Dog    Bird
Cat     [███]  [░░░]  [   ]
Actual  Dog    [░]    [███]  [░]
        Bird   [   ]  [░░]   [███]

Dark = Many predictions
Light = Few predictions

Per-Class Metrics

From a multi-class confusion matrix, compute metrics for each class:

For class "Cat":

TP: Predicted cat, actually cat (diagonal)
FP: Predicted cat, actually not cat (column sum - diagonal)
FN: Actually cat, predicted not cat (row sum - diagonal)

def per_class_metrics(cm, class_idx):
    tp = cm[class_idx, class_idx]
    fp = cm[:, class_idx].sum() - tp
    fn = cm[class_idx, :].sum() - tp

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0

    return precision, recall

Micro vs. Macro Averaging

For multi-class, we can aggregate metrics two ways:

Macro-Average

Calculate metric for each class, then average:

macro_precision = sum(precision[c] for c in classes) / num_classes

Treats all classes equally regardless of size.

Micro-Average

Pool all predictions, then calculate:

total_tp = sum(tp[c] for c in classes)
total_fp = sum(fp[c] for c in classes)
micro_precision = total_tp / (total_tp + total_fp)

Larger classes dominate.

When to Use Which

Macro: When you care equally about all classes
Micro: When you care about overall performance

Common Patterns in Confusion Matrices

Diagonal Dominance (Good)

        A    B    C
A     [95]   3    2
B       2  [91]   7
C       3    5  [92]

Most predictions are correct (on diagonal).

Off-Diagonal Cluster (Confusing Classes)

        A    B    C
A     [90]  [8]   2
B      [9] [85]   6
C       3    4  [93]

A and B are often confused with each other.

Row/Column Bias (Prediction Imbalance)

        A    B    C
A      80   15    5
B      10   75   15
C      40   20   40

Class C is under-predicted (row sum > column sum).

Single Row Failure (Class Collapse)

        A    B    C
A      95    3    2
B      85   10    5
C      80   15    5

Model barely predicts B or C—just defaults to A.

Using Confusion Matrices for Debugging

1. Identify Problematic Classes

Look for rows/columns with low diagonal values.

2. Find Confusion Patterns

Which classes are confused? Often visually or semantically similar.

3. Check for Bias

Are some classes over/under predicted?

4. Guide Data Collection

If A→B confusion is high, collect more A/B training examples.

5. Consider Class Merging

If two classes are always confused, maybe they should be one class.

Threshold Analysis

For binary classification, the confusion matrix changes with threshold:

Threshold = 0.5:
              Pos   Neg
Actual Pos    80    20
Actual Neg    10    90

Threshold = 0.3 (more aggressive):
              Pos   Neg
Actual Pos    95     5   ← More TP, fewer FN
Actual Neg    30    70   ← More FP

Threshold = 0.7 (more conservative):
              Pos   Neg
Actual Pos    60    40   ← Fewer TP, more FN
Actual Neg     5    95   ← Fewer FP

Plotting metrics across thresholds gives ROC and PR curves.

Cost-Sensitive Analysis

Different errors have different costs:

Medical screening:
- FN (miss disease): Cost = $50,000 (late treatment)
- FP (false alarm): Cost = $500 (extra tests)

Total cost = FN × $50,000 + FP × $500
           = 15 × $50,000 + 10 × $500
           = $750,000 + $5,000
           = $755,000

Optimize for minimum cost, not just accuracy.

Confusion Matrix in Production

Monitor confusion matrices over time:

January:    Cat   Dog        June:       Cat   Dog
Cat         90    10         Cat         85    15    ← Performance dropping
Dog          5    95         Dog         25    75    ← Dog detection degraded

Possible causes:

Data drift (new dog breeds?)
Season change (winter coats?)
Bug in preprocessing

Summary

Confusion matrices show exactly where models make mistakes
Diagonal = correct, off-diagonal = errors
Derive precision, recall, F1 from the matrix
Normalize by row to see per-class performance
Use for debugging: find confusing classes, bias, and patterns
Different thresholds give different matrices
Monitor in production to catch degradation

This completes the Probability & Statistics for AI course! You now have the probabilistic foundations to understand how AI models learn, predict, and handle uncertainty.