Evaluation Metrics: Precision, Recall, and More
How do we know if an AI model is good? Evaluation metrics quantify model performance. The right metric depends on what matters for your application—catching all positives, avoiding false alarms, or balancing both. This lesson covers the essential metrics every AI practitioner should know.
The Fundamental Trade-off
Consider a spam filter:
Be aggressive: Catch all spam, but some real emails get flagged Be conservative: Never flag real emails, but some spam gets through
You can't have both. Metrics help us measure and balance this trade-off.
Basic Classification Outcomes
For binary classification, every prediction falls into one of four categories:
| Actually Positive | Actually Negative | |
|---|---|---|
| Predicted Positive | True Positive (TP) | False Positive (FP) |
| Predicted Negative | False Negative (FN) | True Negative (TN) |
Example: Spam Detection
- TP: Correctly flagged spam
- FP: Real email flagged as spam (false alarm)
- FN: Spam not caught (missed)
- TN: Real email correctly let through
Accuracy
The most intuitive metric—proportion correct:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
When to use: Balanced classes, all errors equally bad
When NOT to use: Imbalanced classes
The Accuracy Trap
If 99% of emails are legitimate:
- A model that says "not spam" for everything gets 99% accuracy
- But it catches 0% of spam!
Accuracy is misleading for imbalanced data.
Precision
Of the things we predicted positive, how many actually were?
Precision = TP / (TP + FP)
High precision: When we flag something, we're usually right Low precision: Many false alarms
Example: Precision = 0.9 means 90% of flagged emails are actually spam
When precision matters:
- When false positives are costly
- Flagging financial fraud (don't accuse innocent people)
- Medical screening follow-ups (expensive confirmatory tests)
Recall (Sensitivity, True Positive Rate)
Of the things that were actually positive, how many did we catch?
Recall = TP / (TP + FN)
High recall: We catch most positives Low recall: Many positives slip through
Example: Recall = 0.8 means we catch 80% of spam
When recall matters:
- When false negatives are costly
- Cancer screening (don't miss cancer)
- Security threats (don't miss attacks)
Precision vs. Recall Trade-off
You can always increase one at the expense of the other:
Increase recall: Lower the threshold for flagging
- More true positives, but also more false positives
- Precision drops
Increase precision: Raise the threshold
- Fewer false positives, but miss some true positives
- Recall drops
High threshold
Precision ↑↑↑ (conservative)
Recall ↓↓↓ ────────────
Low threshold
Precision ↓↓↓ (aggressive)
Recall ↑↑↑ ────────────
F1 Score
Harmonic mean of precision and recall—balances both:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Properties:
- Ranges from 0 to 1
- High only if BOTH precision and recall are high
- F1 = 0.9 means both metrics are reasonably high
Weighted Variants
F-beta score: Adjust the precision/recall trade-off
F_β = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)
- β = 1: Standard F1 (equal weight)
- β = 0.5: Precision weighted more
- β = 2: Recall weighted more
Specificity
Of the things that were actually negative, how many did we correctly identify?
Specificity = TN / (TN + FP)
The "recall for negatives"—also called True Negative Rate.
Example: Specificity = 0.95 means 95% of legitimate emails correctly pass
ROC Curve and AUC
ROC Curve
Plot True Positive Rate (Recall) vs. False Positive Rate (1 - Specificity) at various thresholds:
TPR (Recall)
1.0 ┼─────────────────┐
│ ○○○○○○│
│ ○○○○ │
0.5 ├───○○○○ │
│ ○○ │
│○ │
0.0 ○─────────────────┼
0 0.5 1.0
FPR
Diagonal: Random guessing Top-left corner: Perfect classifier
AUC (Area Under Curve)
Single number summarizing the ROC curve:
- AUC = 1.0: Perfect classifier
- AUC = 0.5: Random guessing
- AUC < 0.5: Worse than random
Interpretation: Probability that a random positive ranks higher than a random negative.
Precision-Recall Curve
For imbalanced datasets, PR curves are often more informative:
Precision
1.0 ┼────○
│ │○
│ │ ○
0.5 ├────┼──○○
│ │ ○○
│ │ ○○○○
0.0 ┼────┼──────────○
0 0.5 1.0
Recall
Average Precision (AP): Area under PR curve
Metrics for Multi-Class
For more than two classes, we have choices:
Macro-Average
Calculate metric for each class, then average:
Macro-Precision = (1/K) × Σ Precision_k
Treats all classes equally, regardless of size.
Micro-Average
Pool all predictions, then calculate:
Micro-Precision = Total TP / Total Predicted Positive
Larger classes dominate.
Weighted Average
Weight by class frequency:
Weighted-Precision = Σ (n_k / n) × Precision_k
Log Loss (Cross-Entropy)
Measures probability calibration:
Log Loss = -(1/n) × Σ [y × log(p̂) + (1-y) × log(1-p̂)]
Properties:
- Penalizes confident wrong predictions heavily
- Rewards well-calibrated probabilities
- Used for training AND evaluation
Example:
- Correct prediction with p̂ = 0.99: Loss ≈ 0.01
- Correct prediction with p̂ = 0.51: Loss ≈ 0.67
- Wrong prediction with p̂ = 0.99: Loss ≈ 4.6 (huge penalty!)
Brier Score
Mean squared error of probability predictions:
Brier = (1/n) × Σ (p̂ᵢ - yᵢ)²
- Ranges from 0 (perfect) to 1 (worst)
- Lower is better
- Measures both calibration and accuracy
Regression Metrics
Mean Squared Error (MSE)
MSE = (1/n) × Σ (yᵢ - ŷᵢ)²
- Penalizes large errors more (squared)
- Units are squared
Root Mean Squared Error (RMSE)
RMSE = √MSE
- Same units as target
- More interpretable than MSE
Mean Absolute Error (MAE)
MAE = (1/n) × Σ |yᵢ - ŷᵢ|
- More robust to outliers
- Linear penalty for errors
R² (Coefficient of Determination)
R² = 1 - (Σ (yᵢ - ŷᵢ)² / Σ (yᵢ - ȳ)²)
- 1.0: Perfect prediction
- 0.0: Predicts mean
- Negative: Worse than mean
Ranking Metrics
For recommendations and search:
Mean Average Precision (MAP)
Average precision at each relevant item:
AP = (1/R) × Σᵢ Precision@i × relevance(i)
MAP = average AP across queries
Normalized Discounted Cumulative Gain (NDCG)
Accounts for position (higher = better):
DCG = Σ relevanceᵢ / log₂(i + 1)
NDCG = DCG / Ideal DCG
Mean Reciprocal Rank (MRR)
Position of first relevant result:
MRR = (1/n) × Σ (1 / rank of first relevant)
Choosing the Right Metric
| Scenario | Recommended Metric |
|---|---|
| Balanced classes, all errors equal | Accuracy |
| Imbalanced classes | F1, PR-AUC |
| False positives costly | Precision |
| False negatives costly | Recall |
| Need calibrated probabilities | Log Loss, Brier |
| Ranking/recommendations | MAP, NDCG |
| Regression | RMSE, MAE, R² |
Summary
- Accuracy is intuitive but misleading for imbalanced data
- Precision measures false alarm rate
- Recall measures miss rate
- F1 balances precision and recall
- AUC-ROC summarizes the precision-recall trade-off
- Log Loss measures probability calibration
- Choose metrics based on what errors cost in your application
Next, we'll dive deeper into confusion matrices—the foundation for understanding classification performance.

