Pandas for Data Wrangling with AI

Pandas is the library that turns Python into a data tool. Once you can load a CSV into a DataFrame, you can filter, group, aggregate, join, and transform data faster than you ever could in Excel — and you have an audit trail of every step. Pandas is the workhorse of every data scientist, every data analyst, and every quantitative researcher.

This lesson teaches you the eighty-percent of pandas you will use eighty-percent of the time, with AI prompts to fill the gaps.

What You'll Learn

How to load a CSV and inspect it in three lines
The six pandas operations you cannot live without: select, filter, sort, group, aggregate, merge
How to handle missing values without silently corrupting your analysis
AI prompts that turn pandas questions into working code

Loading Data

Three ways you will load data in this course:

import pandas as pd

# From a URL
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")

# From a local file
df = pd.read_csv("tips.csv")

# Upload to Colab first, then read
from google.colab import files
files.upload()   # opens a file picker in your browser
df = pd.read_csv("tips.csv")

The first thing to do after loading anything is inspect it. Always.

df.head()         # first 5 rows
df.tail()         # last 5 rows
df.info()         # column names, types, missing-value counts
df.describe()     # quick stats for numeric columns
df.shape          # (rows, columns)
df.columns        # list of column names

If you skip these checks, you will compute averages over the wrong column, miss missing data, and end up debugging "the data" when in fact it was always ready. Always inspect.

Selecting Columns and Rows

df["total_bill"]                  # one column (a Series)
df[["total_bill", "tip"]]         # two columns (a DataFrame)

df.iloc[0]                        # first row by position
df.iloc[0:5]                      # first 5 rows by position

df.loc[df["sex"] == "Female"]     # all rows where sex is Female

iloc is by integer position. loc is by label or boolean condition. The most common mistake is mixing them up — when in doubt, ask AI:

What is the difference between df.iloc[0] and df.loc[0]? When would each give a different result?

The answer reveals that iloc always uses position, while loc uses the index — which may not be sequential numbers if you have done filtering or grouping.

Filtering Rows

The same boolean mask trick from NumPy works here:

big_tippers = df[df["tip"] > 5]
female_smokers = df[(df["sex"] == "Female") & (df["smoker"] == "Yes")]
weekend = df[df["day"].isin(["Sat", "Sun"])]

& is and, | is or, and parentheses around each condition are required — leaving them off gives a cryptic error.

Sorting

df.sort_values("tip", ascending=False).head(10)
df.sort_values(["day", "tip"], ascending=[True, False])

The first sorts by tip descending. The second sorts by day ascending, then within each day by tip descending — multi-level sorts are common when ranking inside categories.

Group By and Aggregate

This is the most powerful pandas operation, and the one that most often replaces an Excel pivot table.

df.groupby("day")["total_bill"].mean()

That returns the average bill per day of the week. Read it like a sentence: "Group by day. Take the total_bill column. Compute the mean for each group."

You can compute multiple aggregations at once:

df.groupby("day").agg({
    "total_bill": ["mean", "median", "count"],
    "tip": ["mean", "max"],
})

You can group by multiple columns:

df.groupby(["day", "sex"])["tip"].mean().reset_index()

The .reset_index() at the end converts the grouped result back into a flat DataFrame, which is what you want for plotting and exporting.

Adding and Modifying Columns

df["tip_pct"] = df["tip"] / df["total_bill"] * 100

df["big_tipper"] = df["tip"] > 5

df["meal_size"] = df["size"].apply(
    lambda s: "small" if s <= 2 else "medium" if s <= 4 else "large"
)

The third example uses .apply() with a lambda — apply a function to each value in a column. Useful for one-off transformations.

For real production code, you would use np.where() or pd.cut() instead, but .apply() with a lambda is fine for learning.

Handling Missing Values

Real data has holes. Pandas represents missing values as NaN (not a number). Your job is to know when they exist and what to do about them.

df.isna().sum()                # count of missing values per column

df.dropna()                    # drop rows with any missing value
df.dropna(subset=["tip"])      # drop only if tip is missing

df["age"].fillna(df["age"].mean())   # fill with the mean
df["category"].fillna("Unknown")     # fill with a placeholder

The dangerous pattern is calling .mean() on a column with missing values without realizing it. By default, pandas skips them, so df["score"].mean() averages over the non-missing scores only. If you have a lot of missing values, that average can be misleading.

Always run df.isna().sum() after loading and again after any complex transformation. Make missing values visible.

Merging Two DataFrames

merge is pandas' equivalent of a SQL JOIN. You combine two DataFrames on a shared key.

customers = pd.DataFrame({
    "customer_id": [1, 2, 3],
    "name": ["Alice", "Bob", "Cara"],
})
orders = pd.DataFrame({
    "customer_id": [1, 1, 2, 4],
    "amount": [100, 200, 50, 300],
})

joined = orders.merge(customers, on="customer_id", how="left")

how="left" keeps every row in the left DataFrame (orders), filling missing customer info with NaN where it does not match. The four how values: left, right, inner, outer — same meaning as SQL. inner is the default.

After every merge, count rows. If you started with 1,000 orders and the joined DataFrame has 1,500, you have unintentional duplicates from a many-to-many join.

A Real Example: Group, Filter, Visualize

Run this end-to-end on the tips dataset:

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")

df["tip_pct"] = df["tip"] / df["total_bill"] * 100

by_day = (
    df.groupby("day")
    .agg(
        avg_bill=("total_bill", "mean"),
        avg_tip_pct=("tip_pct", "mean"),
        count=("total_bill", "count"),
    )
    .reset_index()
    .sort_values("avg_tip_pct", ascending=False)
)

print(by_day)

You should see something like Sat and Sun on top with the largest average bills, and one of the days having the highest tip percentage. Read the output. Sanity-check it. Numbers look right? Move on.

AI Prompts for Pandas

The pandas API is huge. Use AI when you forget the syntax:

I have a DataFrame df with columns name, department, salary, hired_date. Please write pandas code that finds, for each department, the median salary of employees hired in the last 12 months. Return a sorted DataFrame.

The AI will give you a one-line groupby plus a date filter. Read every line, test it, and modify if the answer is not exactly what you wanted.

When working on private or sensitive data, you can describe the columns and a few synthetic rows without uploading the real data:

My data has columns [list] and looks like this small synthetic example: [paste fake rows]. Write pandas code to do [task].

The AI never sees your real data; it just learns the shape.

Key Takeaways

Always inspect with head, info, describe, isna().sum() immediately after loading
Boolean masks with & and | (with parentheses) drive most filtering
groupby plus .agg() replaces nearly every Excel pivot table
Missing values are silent by default — make them visible with isna().sum()
After every merge, count rows to catch duplicates from many-to-many joins
Use AI to look up syntax, but always read and test the resulting code yourself

Pandas for Data Wrangling with AI

This lesson teaches you the eighty-percent of pandas you will use eighty-percent of the time, with AI prompts to fill the gaps.

What You'll Learn

How to load a CSV and inspect it in three lines
The six pandas operations you cannot live without: select, filter, sort, group, aggregate, merge
How to handle missing values without silently corrupting your analysis
AI prompts that turn pandas questions into working code

Loading Data

Three ways you will load data in this course:

import pandas as pd

# From a URL
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")

# From a local file
df = pd.read_csv("tips.csv")

# Upload to Colab first, then read
from google.colab import files
files.upload()   # opens a file picker in your browser
df = pd.read_csv("tips.csv")

The first thing to do after loading anything is inspect it. Always.

df.head()         # first 5 rows
df.tail()         # last 5 rows
df.info()         # column names, types, missing-value counts
df.describe()     # quick stats for numeric columns
df.shape          # (rows, columns)
df.columns        # list of column names

If you skip these checks, you will compute averages over the wrong column, miss missing data, and end up debugging "the data" when in fact it was always ready. Always inspect.

Selecting Columns and Rows

df["total_bill"]                  # one column (a Series)
df[["total_bill", "tip"]]         # two columns (a DataFrame)

df.iloc[0]                        # first row by position
df.iloc[0:5]                      # first 5 rows by position

df.loc[df["sex"] == "Female"]     # all rows where sex is Female

iloc is by integer position. loc is by label or boolean condition. The most common mistake is mixing them up — when in doubt, ask AI:

What is the difference between df.iloc[0] and df.loc[0]? When would each give a different result?

The answer reveals that iloc always uses position, while loc uses the index — which may not be sequential numbers if you have done filtering or grouping.

Filtering Rows

The same boolean mask trick from NumPy works here:

big_tippers = df[df["tip"] > 5]
female_smokers = df[(df["sex"] == "Female") & (df["smoker"] == "Yes")]
weekend = df[df["day"].isin(["Sat", "Sun"])]

& is and, | is or, and parentheses around each condition are required — leaving them off gives a cryptic error.

Sorting

df.sort_values("tip", ascending=False).head(10)
df.sort_values(["day", "tip"], ascending=[True, False])

The first sorts by tip descending. The second sorts by day ascending, then within each day by tip descending — multi-level sorts are common when ranking inside categories.

Group By and Aggregate

This is the most powerful pandas operation, and the one that most often replaces an Excel pivot table.

df.groupby("day")["total_bill"].mean()

That returns the average bill per day of the week. Read it like a sentence: "Group by day. Take the total_bill column. Compute the mean for each group."

You can compute multiple aggregations at once:

df.groupby("day").agg({
    "total_bill": ["mean", "median", "count"],
    "tip": ["mean", "max"],
})

You can group by multiple columns:

df.groupby(["day", "sex"])["tip"].mean().reset_index()

The .reset_index() at the end converts the grouped result back into a flat DataFrame, which is what you want for plotting and exporting.

Adding and Modifying Columns

df["tip_pct"] = df["tip"] / df["total_bill"] * 100

df["big_tipper"] = df["tip"] > 5

df["meal_size"] = df["size"].apply(
    lambda s: "small" if s <= 2 else "medium" if s <= 4 else "large"
)

The third example uses .apply() with a lambda — apply a function to each value in a column. Useful for one-off transformations.

For real production code, you would use np.where() or pd.cut() instead, but .apply() with a lambda is fine for learning.

Handling Missing Values

Real data has holes. Pandas represents missing values as NaN (not a number). Your job is to know when they exist and what to do about them.

df.isna().sum()                # count of missing values per column

df.dropna()                    # drop rows with any missing value
df.dropna(subset=["tip"])      # drop only if tip is missing

df["age"].fillna(df["age"].mean())   # fill with the mean
df["category"].fillna("Unknown")     # fill with a placeholder

Always run df.isna().sum() after loading and again after any complex transformation. Make missing values visible.

Merging Two DataFrames

merge is pandas' equivalent of a SQL JOIN. You combine two DataFrames on a shared key.

customers = pd.DataFrame({
    "customer_id": [1, 2, 3],
    "name": ["Alice", "Bob", "Cara"],
})
orders = pd.DataFrame({
    "customer_id": [1, 1, 2, 4],
    "amount": [100, 200, 50, 300],
})

joined = orders.merge(customers, on="customer_id", how="left")

After every merge, count rows. If you started with 1,000 orders and the joined DataFrame has 1,500, you have unintentional duplicates from a many-to-many join.

A Real Example: Group, Filter, Visualize

Run this end-to-end on the tips dataset:

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")

df["tip_pct"] = df["tip"] / df["total_bill"] * 100

by_day = (
    df.groupby("day")
    .agg(
        avg_bill=("total_bill", "mean"),
        avg_tip_pct=("tip_pct", "mean"),
        count=("total_bill", "count"),
    )
    .reset_index()
    .sort_values("avg_tip_pct", ascending=False)
)

print(by_day)

You should see something like Sat and Sun on top with the largest average bills, and one of the days having the highest tip percentage. Read the output. Sanity-check it. Numbers look right? Move on.

AI Prompts for Pandas

The pandas API is huge. Use AI when you forget the syntax:

I have a DataFrame df with columns name, department, salary, hired_date. Please write pandas code that finds, for each department, the median salary of employees hired in the last 12 months. Return a sorted DataFrame.

The AI will give you a one-line groupby plus a date filter. Read every line, test it, and modify if the answer is not exactly what you wanted.

When working on private or sensitive data, you can describe the columns and a few synthetic rows without uploading the real data:

My data has columns [list] and looks like this small synthetic example: [paste fake rows]. Write pandas code to do [task].

The AI never sees your real data; it just learns the shape.

Key Takeaways

Always inspect with head, info, describe, isna().sum() immediately after loading
Boolean masks with & and | (with parentheses) drive most filtering
groupby plus .agg() replaces nearly every Excel pivot table
Missing values are silent by default — make them visible with isna().sum()
After every merge, count rows to catch duplicates from many-to-many joins
Use AI to look up syntax, but always read and test the resulting code yourself

Pandas for Data Wrangling with AI

What You'll Learn

Loading Data

Selecting Columns and Rows

Filtering Rows

Sorting

Group By and Aggregate

Adding and Modifying Columns

Handling Missing Values

Merging Two DataFrames

A Real Example: Group, Filter, Visualize

AI Prompts for Pandas

Key Takeaways

Quiz

Pandas for Data Wrangling with AI

What You'll Learn

Loading Data

Selecting Columns and Rows

Filtering Rows

Sorting

Group By and Aggregate

Adding and Modifying Columns

Handling Missing Values

Merging Two DataFrames

A Real Example: Group, Filter, Visualize

AI Prompts for Pandas

Key Takeaways

Quiz