Loading Real Data with pandas
The app starts with data. In this lesson you load a dataset into pandas, look at it, and clean it just enough to be useful. We are not going deep on pandas here. We use only the handful of methods the app needs, because pandas is a tool in service of the project, not the subject of the project.
You can run every example below right in the page. The first run loads the Python engine, so give it a few seconds.
What You'll Learn
- Read a CSV into a pandas DataFrame
- Inspect a dataset quickly with
head,shape,info, anddescribe - Select and filter the columns and rows you care about
- Handle the two most common messes: missing values and wrong types
- Build a small, reusable function that loads and cleans data
A DataFrame is just a table
A DataFrame is pandas' name for a table: rows and columns, like a spreadsheet. In the real app you will read it from a file with pd.read_csv("yourfile.csv"). Here in the playground there is no file system, so we create the same kind of table directly from a dictionary. Everything you learn applies identically to a loaded CSV.
Look before you leap
Before you do anything with data, look at it. Four methods cover almost every first glance.
df.head()shows the first few rows.df.shapegives(rows, columns).df.info()lists columns, their types, and how many values are missing.df.describe()gives quick statistics for the numeric columns.
That describe() output (count, mean, min, max, and so on) is gold for the AI step later. A compact statistical summary is exactly the kind of small, information-dense slice you want to send to a model.
Select the columns you care about
You rarely need every column. Pick the ones that matter by passing a list of names.
Filter the rows you care about
Filtering uses a condition inside the square brackets. The condition produces True/False for each row, and pandas keeps the True rows.
Clean the two most common messes
Real CSVs are messy. The two problems you will hit constantly are missing values and numbers stored as text.
Missing values show up as NaN. You can drop those rows with dropna() or fill them with fillna().
Wrong types happen when a number column arrives as text (often because of stray symbols). pd.to_numeric fixes it, and errors="coerce" turns anything unconvertible into NaN so it does not crash.
Notice the "Chen" row disappeared (its revenue could not be parsed) and the "Ben" row disappeared (missing region). That is the point: cleaning decides what the rest of your app gets to trust.
Wrap it in a function
Professional code packages a job into a function so you can reuse and test it. Here is a small load_and_clean that takes a DataFrame, fixes types, drops bad rows, and returns the clean result. In the real app, the first line would read from a file instead.
In your real project, the only change is the source. Instead of building raw by hand you would write raw = pd.read_csv("sales.csv") and then call load_and_clean(raw, ...) exactly as above. That clean DataFrame is what we will summarize and hand to the AI model in the next lessons.
Key Takeaways
- A DataFrame is a table; load real files with
pd.read_csv("file.csv"). - Always look first:
head,shape,info,describe. - Select columns with
df[["a", "b"]]and filter rows with a boolean condition likedf[df["x"] > 0]. - The two big cleanups are missing values (
dropna/fillna) and bad types (pd.to_numeric(..., errors="coerce")). - Package load-and-clean into a function so the rest of the app gets clean, trustworthy data.

