Evaluating and Trusting ML Models
A model that "looks accurate" can be silently catastrophic. One that looks "only okay" can be exactly what you need. The skill that separates an AI-literate professional from someone who just plays with chatbots is the ability to evaluate ML output — to ask "how good is this, really, and where might it fail?" In this lesson you'll learn the core metrics, how to read them, how to spot the most common failure modes, and a checklist you can apply to any model — including AI tools you didn't build.
What You'll Learn
- The four most important metrics for ML models, in plain English
- The difference between training accuracy and real-world accuracy (and why it matters)
- How to spot overfitting, underfitting, and biased models
- A 7-question evaluation checklist you can use anywhere
Why Evaluation Matters More Than Building
Building a model is easier than ever. Trusting one is harder than ever. A few real examples:
- Amazon scrapped an internal hiring AI in 2018 because it had learned to penalize resumes from women.
- Apple's credit card was investigated in 2019 after men got higher credit limits than equally-qualified women.
- Several COVID diagnosis models built in 2020 turned out to be detecting hospital camera angles, not the disease.
In every case, the training numbers looked great. The real-world numbers were the problem.
Metric 1: Accuracy (Useful, Sometimes Misleading)
Accuracy is the simplest metric: out of all the predictions, what fraction were correct?
Accuracy = (correct predictions) / (total predictions)
Easy to understand, easy to misinterpret. The classic gotcha: imbalanced data. Imagine a model that predicts whether emails are spam. If 95% of emails are not spam and your model just predicts "not spam" for everything, you get 95% accuracy — and a useless model.
Lesson: Always ask, "What's the accuracy of just predicting the majority class?" If your model only narrowly beats that baseline, you don't have a good model.
Metric 2: Precision and Recall (The Power Couple)
When the cost of being wrong is uneven, you need precision and recall:
- Precision — of all the things the model said were positive, how many actually were?
- Recall — of all the actual positives, how many did the model catch?
Examples:
- Spam filter — high precision matters most (you don't want real emails marked as spam)
- Cancer screening — high recall matters most (missing real cancers is much worse than a few false alarms)
- Fraud detection — needs balance; both false positives (annoyed customers) and false negatives (missed fraud) hurt
A useful intuition: precision asks "was I right when I raised the alarm?" while recall asks "did I catch all the things worth alarming about?"
Metric 3: F1 Score (When You Need One Number)
F1 combines precision and recall into a single number (technically the harmonic mean). It's useful when you need one metric to compare models, but precision and recall by themselves are usually more interpretable.
Metric 4: Confusion Matrix (The Truth Layout)
A confusion matrix is just a 2×2 table showing all four outcomes:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
Every other classification metric is just a ratio of these four cells. When in doubt, look at the confusion matrix — it shows you where the model is failing.
Training vs Real-World Accuracy
A model can be 99% accurate on training data and useless in production. The reasons:
- Overfitting — the model memorized the training data instead of learning the pattern. Like a student who memorizes practice answers but fails when questions are reworded.
- Distribution shift — the real world doesn't look like the training data anymore. Pre-COVID supply-chain models all broke in 2020.
- Data leakage — information that wouldn't be available at prediction time accidentally got into the training data. The model "cheats" without you noticing.
The most important defense is to test on data the model has never seen — your held-out test set.
Spotting Overfitting (and Underfitting)
| Symptom | Likely cause | Fix |
|---|---|---|
| Great on training, terrible on test data | Overfitting | More data, simpler model, regularization |
| Mediocre on training AND test | Underfitting | More features, more complex model, longer training |
| Great everywhere except in production | Distribution shift | Retrain on recent data; monitor over time |
| Suspiciously high accuracy | Data leakage | Audit the features; remove future-knowledge inputs |
Evaluating AI Tools (ChatGPT, Gemini, Claude)
You can apply the same evaluation mindset to AI assistants — even though you didn't train them. A useful test:
"Run this prompt in two different AI tools. Compare the answers along: factual accuracy, source citations, completeness, and the tool's expressed confidence. Score each from 1–10 in each category."
Then independently verify the most important claims (using Perplexity or a search engine). You're now doing AI evaluation — a skill many companies hire for explicitly.
The 7-Question Evaluation Checklist
Apply this to any ML system, including AI tools, before trusting its output:
- What is the baseline? (How well would the simplest possible approach do?)
- What metrics matter? (Accuracy, precision, recall, fairness, latency, cost?)
- What's the test data? (Is it similar to real-world conditions?)
- Where will it fail? (Edge cases, rare classes, under-represented groups?)
- What's the cost of wrong predictions? (False positives vs false negatives — are they symmetric?)
- Is there bias? (Does performance differ across demographics, regions, or other groups?)
- How will it stay good over time? (Monitoring, retraining, re-evaluation cadence?)
Run this checklist on any model you build OR any AI tool you rely on. It will save you embarrassment.
Hands-On: Evaluate ChatGPT on a Specific Task
Try this:
- Pick a task with ground truth — say, "summarize the abstract of this Wikipedia article."
- Run it 5 times in ChatGPT, 5 times in Claude.
- Score each summary 1–5 on accuracy, completeness, and clarity.
- Calculate the average for each tool.
- Note the variance — how often did the same prompt give different answers?
You've now done a tiny but real evaluation study. Iterate it for any task you do regularly with AI to discover which tool actually wins for your use case.
Cost vs Quality (A Quiet Reality)
Modern AI tools have a cost. The free models are typically smaller / faster / cheaper variants. The paid models are slower / more expensive but more accurate. Part of evaluation is asking:
- Is the cheap version good enough for this task?
- Is the expensive version's improvement worth the price for my use case?
- Can I use a cheaper model to draft and a stronger model to check?
This is a real engineering trade-off, even at the no-code level.
Today's Hands-On Mini-Project
Pick one and complete it before moving on:
- Run the 7-question evaluation checklist on your Teachable Machine model from Lesson 7.
- Run the head-to-head ChatGPT vs Claude experiment described above on a task you do often.
- Find one news story about an ML failure (Amazon hiring, COMPAS recidivism, etc.) and write a 5-bullet summary of what went wrong using the metrics in this lesson.
Key Takeaways
- A model that looks accurate can be silently broken; evaluation is the most undervalued ML skill
- Accuracy alone is misleading on imbalanced data — use precision, recall, and the confusion matrix
- Overfitting, distribution shift, and data leakage are the main reasons "good" models fail in production
- Always evaluate on data the model never saw during training
- The 7-question checklist works for both models you build and AI tools you use
- Pair models or tools (cheap + expensive, ChatGPT + Claude) for quality control
Next: the ethical and bias issues that make all of this matter — because models with great metrics can still cause real harm if they encode the wrong assumptions.

