Evaluation Metrics: Precision, Recall, and More

How do we know if an AI model is good? Evaluation metrics quantify model performance. The right metric depends on what matters for your application—catching all positives, avoiding false alarms, or balancing both. This lesson covers the essential metrics every AI practitioner should know.

The Fundamental Trade-off

Consider a spam filter:

Be aggressive: Catch all spam, but some real emails get flagged Be conservative: Never flag real emails, but some spam gets through

You can't have both. Metrics help us measure and balance this trade-off.

Basic Classification Outcomes

For binary classification, every prediction falls into one of four categories:

	Actually Positive	Actually Negative
Predicted Positive	True Positive (TP)	False Positive (FP)
Predicted Negative	False Negative (FN)	True Negative (TN)

Example: Spam Detection

TP: Correctly flagged spam
FP: Real email flagged as spam (false alarm)
FN: Spam not caught (missed)
TN: Real email correctly let through

Accuracy

The most intuitive metric—proportion correct:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

When to use: Balanced classes, all errors equally bad

When NOT to use: Imbalanced classes

The Accuracy Trap

If 99% of emails are legitimate:

A model that says "not spam" for everything gets 99% accuracy
But it catches 0% of spam!

Accuracy is misleading for imbalanced data.

Precision

Of the things we predicted positive, how many actually were?

Precision = TP / (TP + FP)

High precision: When we flag something, we're usually right Low precision: Many false alarms

Example: Precision = 0.9 means 90% of flagged emails are actually spam

When precision matters:

When false positives are costly
Flagging financial fraud (don't accuse innocent people)
Medical screening follow-ups (expensive confirmatory tests)

Recall (Sensitivity, True Positive Rate)

Of the things that were actually positive, how many did we catch?

Recall = TP / (TP + FN)

High recall: We catch most positives Low recall: Many positives slip through

Example: Recall = 0.8 means we catch 80% of spam

When recall matters:

When false negatives are costly
Cancer screening (don't miss cancer)
Security threats (don't miss attacks)

Precision vs. Recall Trade-off

You can always increase one at the expense of the other:

Increase recall: Lower the threshold for flagging

More true positives, but also more false positives
Precision drops

Increase precision: Raise the threshold

Fewer false positives, but miss some true positives
Recall drops

                    High threshold
Precision ↑↑↑       (conservative)
Recall    ↓↓↓       ────────────

                    Low threshold
Precision ↓↓↓       (aggressive)
Recall    ↑↑↑       ────────────

F1 Score

Harmonic mean of precision and recall—balances both:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Properties:

Ranges from 0 to 1
High only if BOTH precision and recall are high
F1 = 0.9 means both metrics are reasonably high

Weighted Variants

F-beta score: Adjust the precision/recall trade-off

F_β = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)

β = 1: Standard F1 (equal weight)
β = 0.5: Precision weighted more
β = 2: Recall weighted more

Specificity

Of the things that were actually negative, how many did we correctly identify?

Specificity = TN / (TN + FP)

The "recall for negatives"—also called True Negative Rate.

Example: Specificity = 0.95 means 95% of legitimate emails correctly pass

ROC Curve and AUC

ROC Curve

Plot True Positive Rate (Recall) vs. False Positive Rate (1 - Specificity) at various thresholds:

TPR (Recall)
1.0 ┼─────────────────┐
    │           ○○○○○○│
    │       ○○○○      │
0.5 ├───○○○○          │
    │ ○○              │
    │○                │
0.0 ○─────────────────┼
    0       0.5      1.0
          FPR

Diagonal: Random guessing Top-left corner: Perfect classifier

AUC (Area Under Curve)

Single number summarizing the ROC curve:

AUC = 1.0: Perfect classifier
AUC = 0.5: Random guessing
AUC < 0.5: Worse than random

Interpretation: Probability that a random positive ranks higher than a random negative.

Precision-Recall Curve

For imbalanced datasets, PR curves are often more informative:

Precision
1.0 ┼────○
    │    │○
    │    │ ○
0.5 ├────┼──○○
    │    │    ○○
    │    │      ○○○○
0.0 ┼────┼──────────○
    0   0.5        1.0
        Recall

Average Precision (AP): Area under PR curve

Metrics for Multi-Class

For more than two classes, we have choices:

Macro-Average

Calculate metric for each class, then average:

Macro-Precision = (1/K) × Σ Precision_k

Treats all classes equally, regardless of size.

Micro-Average

Pool all predictions, then calculate:

Micro-Precision = Total TP / Total Predicted Positive

Larger classes dominate.

Weighted Average

Weight by class frequency:

Weighted-Precision = Σ (n_k / n) × Precision_k

Log Loss (Cross-Entropy)

Measures probability calibration:

Log Loss = -(1/n) × Σ [y × log(p̂) + (1-y) × log(1-p̂)]

Properties:

Penalizes confident wrong predictions heavily
Rewards well-calibrated probabilities
Used for training AND evaluation

Example:

Correct prediction with p̂ = 0.99: Loss ≈ 0.01
Correct prediction with p̂ = 0.51: Loss ≈ 0.67
Wrong prediction with p̂ = 0.99: Loss ≈ 4.6 (huge penalty!)

Brier Score

Mean squared error of probability predictions:

Brier = (1/n) × Σ (p̂ᵢ - yᵢ)²

Ranges from 0 (perfect) to 1 (worst)
Lower is better
Measures both calibration and accuracy

Regression Metrics

Mean Squared Error (MSE)

MSE = (1/n) × Σ (yᵢ - ŷᵢ)²

Penalizes large errors more (squared)
Units are squared

Root Mean Squared Error (RMSE)

RMSE = √MSE

Same units as target
More interpretable than MSE

Mean Absolute Error (MAE)

MAE = (1/n) × Σ |yᵢ - ŷᵢ|

More robust to outliers
Linear penalty for errors

R² (Coefficient of Determination)

R² = 1 - (Σ (yᵢ - ŷᵢ)² / Σ (yᵢ - ȳ)²)

1.0: Perfect prediction
0.0: Predicts mean
Negative: Worse than mean

Ranking Metrics

For recommendations and search:

Mean Average Precision (MAP)

Average precision at each relevant item:

AP = (1/R) × Σᵢ Precision@i × relevance(i)
MAP = average AP across queries

Normalized Discounted Cumulative Gain (NDCG)

Accounts for position (higher = better):

DCG = Σ relevanceᵢ / log₂(i + 1)
NDCG = DCG / Ideal DCG

Mean Reciprocal Rank (MRR)

Position of first relevant result:

MRR = (1/n) × Σ (1 / rank of first relevant)

Choosing the Right Metric

Scenario	Recommended Metric
Balanced classes, all errors equal	Accuracy
Imbalanced classes	F1, PR-AUC
False positives costly	Precision
False negatives costly	Recall
Need calibrated probabilities	Log Loss, Brier
Ranking/recommendations	MAP, NDCG
Regression	RMSE, MAE, R²

Summary

Accuracy is intuitive but misleading for imbalanced data
Precision measures false alarm rate
Recall measures miss rate
F1 balances precision and recall
AUC-ROC summarizes the precision-recall trade-off
Log Loss measures probability calibration
Choose metrics based on what errors cost in your application

Next, we'll dive deeper into confusion matrices—the foundation for understanding classification performance.

Evaluation Metrics: Precision, Recall, and More

The Fundamental Trade-off

Consider a spam filter:

Be aggressive: Catch all spam, but some real emails get flagged Be conservative: Never flag real emails, but some spam gets through

You can't have both. Metrics help us measure and balance this trade-off.

Basic Classification Outcomes

For binary classification, every prediction falls into one of four categories:

	Actually Positive	Actually Negative
Predicted Positive	True Positive (TP)	False Positive (FP)
Predicted Negative	False Negative (FN)	True Negative (TN)

Example: Spam Detection

TP: Correctly flagged spam
FP: Real email flagged as spam (false alarm)
FN: Spam not caught (missed)
TN: Real email correctly let through

Accuracy

The most intuitive metric—proportion correct:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

When to use: Balanced classes, all errors equally bad

When NOT to use: Imbalanced classes

The Accuracy Trap

If 99% of emails are legitimate:

A model that says "not spam" for everything gets 99% accuracy
But it catches 0% of spam!

Accuracy is misleading for imbalanced data.

Precision

Of the things we predicted positive, how many actually were?

Precision = TP / (TP + FP)

High precision: When we flag something, we're usually right Low precision: Many false alarms

Example: Precision = 0.9 means 90% of flagged emails are actually spam

When precision matters:

When false positives are costly
Flagging financial fraud (don't accuse innocent people)
Medical screening follow-ups (expensive confirmatory tests)

Recall (Sensitivity, True Positive Rate)

Of the things that were actually positive, how many did we catch?

Recall = TP / (TP + FN)

High recall: We catch most positives Low recall: Many positives slip through

Example: Recall = 0.8 means we catch 80% of spam

When recall matters:

When false negatives are costly
Cancer screening (don't miss cancer)
Security threats (don't miss attacks)

Precision vs. Recall Trade-off

You can always increase one at the expense of the other:

Increase recall: Lower the threshold for flagging

More true positives, but also more false positives
Precision drops

Increase precision: Raise the threshold

Fewer false positives, but miss some true positives
Recall drops

                    High threshold
Precision ↑↑↑       (conservative)
Recall    ↓↓↓       ────────────

                    Low threshold
Precision ↓↓↓       (aggressive)
Recall    ↑↑↑       ────────────

F1 Score

Harmonic mean of precision and recall—balances both:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Properties:

Ranges from 0 to 1
High only if BOTH precision and recall are high
F1 = 0.9 means both metrics are reasonably high

Weighted Variants

F-beta score: Adjust the precision/recall trade-off

F_β = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)

β = 1: Standard F1 (equal weight)
β = 0.5: Precision weighted more
β = 2: Recall weighted more

Specificity

Of the things that were actually negative, how many did we correctly identify?

Specificity = TN / (TN + FP)

The "recall for negatives"—also called True Negative Rate.

Example: Specificity = 0.95 means 95% of legitimate emails correctly pass

ROC Curve and AUC

ROC Curve

Plot True Positive Rate (Recall) vs. False Positive Rate (1 - Specificity) at various thresholds:

TPR (Recall)
1.0 ┼─────────────────┐
    │           ○○○○○○│
    │       ○○○○      │
0.5 ├───○○○○          │
    │ ○○              │
    │○                │
0.0 ○─────────────────┼
    0       0.5      1.0
          FPR

Diagonal: Random guessing Top-left corner: Perfect classifier

AUC (Area Under Curve)

Single number summarizing the ROC curve:

AUC = 1.0: Perfect classifier
AUC = 0.5: Random guessing
AUC < 0.5: Worse than random

Interpretation: Probability that a random positive ranks higher than a random negative.

Precision-Recall Curve

For imbalanced datasets, PR curves are often more informative:

Precision
1.0 ┼────○
    │    │○
    │    │ ○
0.5 ├────┼──○○
    │    │    ○○
    │    │      ○○○○
0.0 ┼────┼──────────○
    0   0.5        1.0
        Recall

Average Precision (AP): Area under PR curve

Metrics for Multi-Class

For more than two classes, we have choices:

Macro-Average

Calculate metric for each class, then average:

Macro-Precision = (1/K) × Σ Precision_k

Treats all classes equally, regardless of size.

Micro-Average

Pool all predictions, then calculate:

Micro-Precision = Total TP / Total Predicted Positive

Larger classes dominate.

Weighted Average

Weight by class frequency:

Weighted-Precision = Σ (n_k / n) × Precision_k

Log Loss (Cross-Entropy)

Measures probability calibration:

Log Loss = -(1/n) × Σ [y × log(p̂) + (1-y) × log(1-p̂)]

Properties:

Penalizes confident wrong predictions heavily
Rewards well-calibrated probabilities
Used for training AND evaluation

Example:

Correct prediction with p̂ = 0.99: Loss ≈ 0.01
Correct prediction with p̂ = 0.51: Loss ≈ 0.67
Wrong prediction with p̂ = 0.99: Loss ≈ 4.6 (huge penalty!)

Brier Score

Mean squared error of probability predictions:

Brier = (1/n) × Σ (p̂ᵢ - yᵢ)²

Ranges from 0 (perfect) to 1 (worst)
Lower is better
Measures both calibration and accuracy

Regression Metrics

Mean Squared Error (MSE)

MSE = (1/n) × Σ (yᵢ - ŷᵢ)²

Penalizes large errors more (squared)
Units are squared

Root Mean Squared Error (RMSE)

RMSE = √MSE

Same units as target
More interpretable than MSE

Mean Absolute Error (MAE)

MAE = (1/n) × Σ |yᵢ - ŷᵢ|

More robust to outliers
Linear penalty for errors

R² (Coefficient of Determination)

R² = 1 - (Σ (yᵢ - ŷᵢ)² / Σ (yᵢ - ȳ)²)

1.0: Perfect prediction
0.0: Predicts mean
Negative: Worse than mean

Ranking Metrics

For recommendations and search:

Mean Average Precision (MAP)

Average precision at each relevant item:

AP = (1/R) × Σᵢ Precision@i × relevance(i)
MAP = average AP across queries

Normalized Discounted Cumulative Gain (NDCG)

Accounts for position (higher = better):

DCG = Σ relevanceᵢ / log₂(i + 1)
NDCG = DCG / Ideal DCG

Mean Reciprocal Rank (MRR)

Position of first relevant result:

MRR = (1/n) × Σ (1 / rank of first relevant)

Choosing the Right Metric

Scenario	Recommended Metric
Balanced classes, all errors equal	Accuracy
Imbalanced classes	F1, PR-AUC
False positives costly	Precision
False negatives costly	Recall
Need calibrated probabilities	Log Loss, Brier
Ranking/recommendations	MAP, NDCG
Regression	RMSE, MAE, R²

Summary

Accuracy is intuitive but misleading for imbalanced data
Precision measures false alarm rate
Recall measures miss rate
F1 balances precision and recall
AUC-ROC summarizes the precision-recall trade-off
Log Loss measures probability calibration
Choose metrics based on what errors cost in your application

Next, we'll dive deeper into confusion matrices—the foundation for understanding classification performance.