Data: The Fuel of Machine Learning

There's a saying in the ML world: "Garbage in, garbage out." It's been repeated so many times it has lost some of its punch, but it remains the single most important rule in the entire field. The fanciest model in the world will produce useless predictions if you feed it bad data. Conversely, a simple model on great data will outperform a complicated model on lousy data — almost every time.

This lesson is about data: what it actually is, what makes it good or bad, and how to evaluate it before you trust any ML system built on top of it.

What You'll Learn

The difference between rows, columns, features, and labels
The four qualities of "good" ML data
Common data problems that quietly ruin models
How to assess data quality with ChatGPT or Claude

What "Data" Actually Looks Like

For most ML, data is a table — exactly like a spreadsheet. Rows are examples, columns are attributes:

Square Feet	Bedrooms	Neighborhood	Year Built	Price
1200	2	Downtown	1998	425000
2400	4	Suburb	2010	680000
850	1	Downtown	1965	310000

Vocabulary you should know:

Row = one example (one house)
Column = one attribute (square footage)
Feature = an input column (square feet, bedrooms, neighborhood, year built)
Label = the answer column you're trying to predict (price)
Dataset = the whole table

You'll hear "rows and columns" from spreadsheet people and "samples and features" from ML people. Same thing.

Not all data is tabular — images, audio, and text are also data — but tables are the easiest place to start, and they're what you'll use in this course.

What Makes Data "Good"?

Four qualities matter most:

1. Relevant

Every column should plausibly help predict the answer. House price is influenced by size, location, and age. It's not influenced by the seller's birthday. Feeding irrelevant features to a model adds noise and can actively hurt accuracy.

2. Representative

Your training data needs to look like the real world the model will face. If you train a face-recognition model only on light-skinned faces, it will fail on dark-skinned faces. If you train a fraud detector only on data from one country, it won't catch fraud patterns from another. Bias in data becomes bias in predictions. This isn't theoretical — it has caused major real-world harms.

3. Sufficient

You need enough examples for the model to find a real pattern, not just memorize the few cases you showed it. Rule of thumb: hundreds for simple tasks, thousands for moderate ones, millions for complex ones (like image recognition). With modern AI tools you can sometimes get away with less, but more is almost always better.

4. Clean

Real-world data is messy. Common problems:

Missing values — empty cells where data was never collected
Inconsistent formats — "USA" vs "U.S.A." vs "United States"
Duplicates — the same row appearing twice
Outliers — values that are unusually large or small (sometimes errors, sometimes important)
Wrong types — numbers stored as text, dates stored as strings

A huge chunk of any data scientist's day is cleaning data. Even no-code ML tools assume your data is reasonably tidy.

The Three Stages of Data in an ML Project

Training data — the examples the model learns from (typically 70–80% of your dataset)
Validation data — used to tune the model and pick the best version (10–15%)
Test data — held back until the very end to check real performance (10–15%)

Why split it? Because a model can memorize examples it has already seen. The only honest test of whether it learned the pattern is to check it against examples it has never seen before.

Hands-On: Audit a Dataset with AI

Let's practice spotting data problems. Open ChatGPT or Claude and paste this prompt:

"I'm going to share a small dataset. Act as a data quality auditor. List every potential issue you can spot — missing values, inconsistencies, irrelevant columns, biases, suspicious outliers — and rate the dataset's overall ML-readiness from 1–10 with a one-line justification.

Dataset (CSV):
Name,Age,Email,Salary,JoinedYear,FavoriteColor
Alice Chen,29,alice@x.com,72000,2021,Blue
Bob,,bob@,68000,2022,
Carol Lee,34,carol@y.com,72000,2021,Green
Carol Lee,34,carol@y.com,72000,2021,Green
David Park,1992,david@z.com,1850000,2023,Red
Eve,28,eve@y.com,69000,not sure,blue
```"

Within seconds you'll get a list pointing out:

Missing age and email for Bob, missing favorite color
Duplicate row for Carol Lee
David's age looks like a birth year (1992)
David's salary is suspiciously high — likely typo
"Blue" vs "blue" — case inconsistency
"JoinedYear" is sometimes "not sure"
FavoriteColor has nothing to do with salary — irrelevant

Try this prompt with any messy dataset you have. It's one of the most useful no-code data skills you can develop.

How Data Volume Affects Modern Tools

A myth worth busting: not every ML problem needs Big Data. With tools you'll use later in this course:

Google Teachable Machine — works with as few as 10–20 images per class
ChatGPT data analysis — happily works with a small CSV
AI auto-classifiers — modern foundation models can classify with just a handful of examples ("few-shot learning")

The "you need millions of examples" advice was true for old ML. It's only partly true now.

Data Privacy: A Five-Second Sanity Check

Before you upload any dataset to an AI tool, ask:

Does it contain personally identifiable information (PII): names, emails, phone numbers, IDs?
Is it confidential to your employer or school?
Would you be okay if it leaked publicly?

If any answer makes you nervous, anonymize first (replace names with "Person 1", strip emails, redact IDs) or use a tool that supports private workspaces. We'll come back to this in Module 4.

Key Takeaways

ML data is usually a table where rows are examples and columns are features (inputs) and labels (answers)
Good data is relevant, representative, sufficient, and clean — in that order of impact
Bias in training data becomes bias in predictions, often invisibly
Always split data into training, validation, and test sets to honestly measure performance
ChatGPT and Claude are excellent first-pass data auditors — use them before building any model
Never upload sensitive data to public AI tools without anonymizing first

You now have the conceptual foundation. In Module 2 we get hands-on: using ChatGPT, Claude, Gemini, and Perplexity to actually do ML work.

Data: The Fuel of Machine Learning

This lesson is about data: what it actually is, what makes it good or bad, and how to evaluate it before you trust any ML system built on top of it.

What You'll Learn

The difference between rows, columns, features, and labels
The four qualities of "good" ML data
Common data problems that quietly ruin models
How to assess data quality with ChatGPT or Claude

What "Data" Actually Looks Like

For most ML, data is a table — exactly like a spreadsheet. Rows are examples, columns are attributes:

Square Feet	Bedrooms	Neighborhood	Year Built	Price
1200	2	Downtown	1998	425000
2400	4	Suburb	2010	680000
850	1	Downtown	1965	310000

Vocabulary you should know:

Row = one example (one house)
Column = one attribute (square footage)
Feature = an input column (square feet, bedrooms, neighborhood, year built)
Label = the answer column you're trying to predict (price)
Dataset = the whole table

You'll hear "rows and columns" from spreadsheet people and "samples and features" from ML people. Same thing.

Not all data is tabular — images, audio, and text are also data — but tables are the easiest place to start, and they're what you'll use in this course.

What Makes Data "Good"?

Four qualities matter most:

1. Relevant

2. Representative

3. Sufficient

4. Clean

Real-world data is messy. Common problems:

Missing values — empty cells where data was never collected
Inconsistent formats — "USA" vs "U.S.A." vs "United States"
Duplicates — the same row appearing twice
Outliers — values that are unusually large or small (sometimes errors, sometimes important)
Wrong types — numbers stored as text, dates stored as strings

A huge chunk of any data scientist's day is cleaning data. Even no-code ML tools assume your data is reasonably tidy.

The Three Stages of Data in an ML Project

Training data — the examples the model learns from (typically 70–80% of your dataset)
Validation data — used to tune the model and pick the best version (10–15%)
Test data — held back until the very end to check real performance (10–15%)

Why split it? Because a model can memorize examples it has already seen. The only honest test of whether it learned the pattern is to check it against examples it has never seen before.

Hands-On: Audit a Dataset with AI

Let's practice spotting data problems. Open ChatGPT or Claude and paste this prompt:

"I'm going to share a small dataset. Act as a data quality auditor. List every potential issue you can spot — missing values, inconsistencies, irrelevant columns, biases, suspicious outliers — and rate the dataset's overall ML-readiness from 1–10 with a one-line justification.

Dataset (CSV):
Name,Age,Email,Salary,JoinedYear,FavoriteColor
Alice Chen,29,alice@x.com,72000,2021,Blue
Bob,,bob@,68000,2022,
Carol Lee,34,carol@y.com,72000,2021,Green
Carol Lee,34,carol@y.com,72000,2021,Green
David Park,1992,david@z.com,1850000,2023,Red
Eve,28,eve@y.com,69000,not sure,blue
```"

Within seconds you'll get a list pointing out:

Missing age and email for Bob, missing favorite color
Duplicate row for Carol Lee
David's age looks like a birth year (1992)
David's salary is suspiciously high — likely typo
"Blue" vs "blue" — case inconsistency
"JoinedYear" is sometimes "not sure"
FavoriteColor has nothing to do with salary — irrelevant

Try this prompt with any messy dataset you have. It's one of the most useful no-code data skills you can develop.

How Data Volume Affects Modern Tools

A myth worth busting: not every ML problem needs Big Data. With tools you'll use later in this course:

Google Teachable Machine — works with as few as 10–20 images per class
ChatGPT data analysis — happily works with a small CSV
AI auto-classifiers — modern foundation models can classify with just a handful of examples ("few-shot learning")

The "you need millions of examples" advice was true for old ML. It's only partly true now.

Data Privacy: A Five-Second Sanity Check

Before you upload any dataset to an AI tool, ask:

Does it contain personally identifiable information (PII): names, emails, phone numbers, IDs?
Is it confidential to your employer or school?
Would you be okay if it leaked publicly?

If any answer makes you nervous, anonymize first (replace names with "Person 1", strip emails, redact IDs) or use a tool that supports private workspaces. We'll come back to this in Module 4.

Key Takeaways

ML data is usually a table where rows are examples and columns are features (inputs) and labels (answers)
Good data is relevant, representative, sufficient, and clean — in that order of impact
Bias in training data becomes bias in predictions, often invisibly
Always split data into training, validation, and test sets to honestly measure performance
ChatGPT and Claude are excellent first-pass data auditors — use them before building any model
Never upload sensitive data to public AI tools without anonymizing first

You now have the conceptual foundation. In Module 2 we get hands-on: using ChatGPT, Claude, Gemini, and Perplexity to actually do ML work.

Data: The Fuel of Machine Learning

What You'll Learn

What "Data" Actually Looks Like

What Makes Data "Good"?

1. Relevant

2. Representative

3. Sufficient

4. Clean

The Three Stages of Data in an ML Project

Hands-On: Audit a Dataset with AI

How Data Volume Affects Modern Tools

Data Privacy: A Five-Second Sanity Check

Key Takeaways

Quiz

Data: The Fuel of Machine Learning

What You'll Learn

What "Data" Actually Looks Like

What Makes Data "Good"?

1. Relevant

2. Representative

3. Sufficient

4. Clean

The Three Stages of Data in an ML Project

Hands-On: Audit a Dataset with AI

How Data Volume Affects Modern Tools

Data Privacy: A Five-Second Sanity Check

Key Takeaways

Quiz