Data: The Fuel of Machine Learning
There's a saying in the ML world: "Garbage in, garbage out." It's been repeated so many times it has lost some of its punch, but it remains the single most important rule in the entire field. The fanciest model in the world will produce useless predictions if you feed it bad data. Conversely, a simple model on great data will outperform a complicated model on lousy data — almost every time.
This lesson is about data: what it actually is, what makes it good or bad, and how to evaluate it before you trust any ML system built on top of it.
What You'll Learn
- The difference between rows, columns, features, and labels
- The four qualities of "good" ML data
- Common data problems that quietly ruin models
- How to assess data quality with ChatGPT or Claude
What "Data" Actually Looks Like
For most ML, data is a table — exactly like a spreadsheet. Rows are examples, columns are attributes:
| Square Feet | Bedrooms | Neighborhood | Year Built | Price |
|---|---|---|---|---|
| 1200 | 2 | Downtown | 1998 | 425000 |
| 2400 | 4 | Suburb | 2010 | 680000 |
| 850 | 1 | Downtown | 1965 | 310000 |
Vocabulary you should know:
- Row = one example (one house)
- Column = one attribute (square footage)
- Feature = an input column (square feet, bedrooms, neighborhood, year built)
- Label = the answer column you're trying to predict (price)
- Dataset = the whole table
You'll hear "rows and columns" from spreadsheet people and "samples and features" from ML people. Same thing.
Not all data is tabular — images, audio, and text are also data — but tables are the easiest place to start, and they're what you'll use in this course.
What Makes Data "Good"?
Four qualities matter most:
1. Relevant
Every column should plausibly help predict the answer. House price is influenced by size, location, and age. It's not influenced by the seller's birthday. Feeding irrelevant features to a model adds noise and can actively hurt accuracy.
2. Representative
Your training data needs to look like the real world the model will face. If you train a face-recognition model only on light-skinned faces, it will fail on dark-skinned faces. If you train a fraud detector only on data from one country, it won't catch fraud patterns from another. Bias in data becomes bias in predictions. This isn't theoretical — it has caused major real-world harms.
3. Sufficient
You need enough examples for the model to find a real pattern, not just memorize the few cases you showed it. Rule of thumb: hundreds for simple tasks, thousands for moderate ones, millions for complex ones (like image recognition). With modern AI tools you can sometimes get away with less, but more is almost always better.
4. Clean
Real-world data is messy. Common problems:
- Missing values — empty cells where data was never collected
- Inconsistent formats — "USA" vs "U.S.A." vs "United States"
- Duplicates — the same row appearing twice
- Outliers — values that are unusually large or small (sometimes errors, sometimes important)
- Wrong types — numbers stored as text, dates stored as strings
A huge chunk of any data scientist's day is cleaning data. Even no-code ML tools assume your data is reasonably tidy.
The Three Stages of Data in an ML Project
- Training data — the examples the model learns from (typically 70–80% of your dataset)
- Validation data — used to tune the model and pick the best version (10–15%)
- Test data — held back until the very end to check real performance (10–15%)
Why split it? Because a model can memorize examples it has already seen. The only honest test of whether it learned the pattern is to check it against examples it has never seen before.
Hands-On: Audit a Dataset with AI
Let's practice spotting data problems. Open ChatGPT or Claude and paste this prompt:
"I'm going to share a small dataset. Act as a data quality auditor. List every potential issue you can spot — missing values, inconsistencies, irrelevant columns, biases, suspicious outliers — and rate the dataset's overall ML-readiness from 1–10 with a one-line justification.
Dataset (CSV):
Name,Age,Email,Salary,JoinedYear,FavoriteColor Alice Chen,29,alice@x.com,72000,2021,Blue Bob,,bob@,68000,2022, Carol Lee,34,carol@y.com,72000,2021,Green Carol Lee,34,carol@y.com,72000,2021,Green David Park,1992,david@z.com,1850000,2023,Red Eve,28,eve@y.com,69000,not sure,blue ```"
Within seconds you'll get a list pointing out:
- Missing age and email for Bob, missing favorite color
- Duplicate row for Carol Lee
- David's age looks like a birth year (1992)
- David's salary is suspiciously high — likely typo
- "Blue" vs "blue" — case inconsistency
- "JoinedYear" is sometimes "not sure"
- FavoriteColor has nothing to do with salary — irrelevant
Try this prompt with any messy dataset you have. It's one of the most useful no-code data skills you can develop.
How Data Volume Affects Modern Tools
A myth worth busting: not every ML problem needs Big Data. With tools you'll use later in this course:
- Google Teachable Machine — works with as few as 10–20 images per class
- ChatGPT data analysis — happily works with a small CSV
- AI auto-classifiers — modern foundation models can classify with just a handful of examples ("few-shot learning")
The "you need millions of examples" advice was true for old ML. It's only partly true now.
Data Privacy: A Five-Second Sanity Check
Before you upload any dataset to an AI tool, ask:
- Does it contain personally identifiable information (PII): names, emails, phone numbers, IDs?
- Is it confidential to your employer or school?
- Would you be okay if it leaked publicly?
If any answer makes you nervous, anonymize first (replace names with "Person 1", strip emails, redact IDs) or use a tool that supports private workspaces. We'll come back to this in Module 4.
Key Takeaways
- ML data is usually a table where rows are examples and columns are features (inputs) and labels (answers)
- Good data is relevant, representative, sufficient, and clean — in that order of impact
- Bias in training data becomes bias in predictions, often invisibly
- Always split data into training, validation, and test sets to honestly measure performance
- ChatGPT and Claude are excellent first-pass data auditors — use them before building any model
- Never upload sensitive data to public AI tools without anonymizing first
You now have the conceptual foundation. In Module 2 we get hands-on: using ChatGPT, Claude, Gemini, and Perplexity to actually do ML work.

