Pandas for Data Wrangling with AI
Pandas is the library that turns Python into a data tool. Once you can load a CSV into a DataFrame, you can filter, group, aggregate, join, and transform data faster than you ever could in Excel — and you have an audit trail of every step. Pandas is the workhorse of every data scientist, every data analyst, and every quantitative researcher.
This lesson teaches you the eighty-percent of pandas you will use eighty-percent of the time, with AI prompts to fill the gaps.
What You'll Learn
- How to load a CSV and inspect it in three lines
- The six pandas operations you cannot live without: select, filter, sort, group, aggregate, merge
- How to handle missing values without silently corrupting your analysis
- AI prompts that turn pandas questions into working code
Loading Data
Three ways you will load data in this course:
import pandas as pd
# From a URL
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")
# From a local file
df = pd.read_csv("tips.csv")
# Upload to Colab first, then read
from google.colab import files
files.upload() # opens a file picker in your browser
df = pd.read_csv("tips.csv")
The first thing to do after loading anything is inspect it. Always.
df.head() # first 5 rows
df.tail() # last 5 rows
df.info() # column names, types, missing-value counts
df.describe() # quick stats for numeric columns
df.shape # (rows, columns)
df.columns # list of column names
If you skip these checks, you will compute averages over the wrong column, miss missing data, and end up debugging "the data" when in fact it was always ready. Always inspect.
Selecting Columns and Rows
df["total_bill"] # one column (a Series)
df[["total_bill", "tip"]] # two columns (a DataFrame)
df.iloc[0] # first row by position
df.iloc[0:5] # first 5 rows by position
df.loc[df["sex"] == "Female"] # all rows where sex is Female
iloc is by integer position. loc is by label or boolean condition. The most common mistake is mixing them up — when in doubt, ask AI:
What is the difference between
df.iloc[0]anddf.loc[0]? When would each give a different result?
The answer reveals that iloc always uses position, while loc uses the index — which may not be sequential numbers if you have done filtering or grouping.
Filtering Rows
The same boolean mask trick from NumPy works here:
big_tippers = df[df["tip"] > 5]
female_smokers = df[(df["sex"] == "Female") & (df["smoker"] == "Yes")]
weekend = df[df["day"].isin(["Sat", "Sun"])]
& is and, | is or, and parentheses around each condition are required — leaving them off gives a cryptic error.
Sorting
df.sort_values("tip", ascending=False).head(10)
df.sort_values(["day", "tip"], ascending=[True, False])
The first sorts by tip descending. The second sorts by day ascending, then within each day by tip descending — multi-level sorts are common when ranking inside categories.
Group By and Aggregate
This is the most powerful pandas operation, and the one that most often replaces an Excel pivot table.
df.groupby("day")["total_bill"].mean()
That returns the average bill per day of the week. Read it like a sentence: "Group by day. Take the total_bill column. Compute the mean for each group."
You can compute multiple aggregations at once:
df.groupby("day").agg({
"total_bill": ["mean", "median", "count"],
"tip": ["mean", "max"],
})
You can group by multiple columns:
df.groupby(["day", "sex"])["tip"].mean().reset_index()
The .reset_index() at the end converts the grouped result back into a flat DataFrame, which is what you want for plotting and exporting.
Adding and Modifying Columns
df["tip_pct"] = df["tip"] / df["total_bill"] * 100
df["big_tipper"] = df["tip"] > 5
df["meal_size"] = df["size"].apply(
lambda s: "small" if s <= 2 else "medium" if s <= 4 else "large"
)
The third example uses .apply() with a lambda — apply a function to each value in a column. Useful for one-off transformations.
For real production code, you would use np.where() or pd.cut() instead, but .apply() with a lambda is fine for learning.
Handling Missing Values
Real data has holes. Pandas represents missing values as NaN (not a number). Your job is to know when they exist and what to do about them.
df.isna().sum() # count of missing values per column
df.dropna() # drop rows with any missing value
df.dropna(subset=["tip"]) # drop only if tip is missing
df["age"].fillna(df["age"].mean()) # fill with the mean
df["category"].fillna("Unknown") # fill with a placeholder
The dangerous pattern is calling .mean() on a column with missing values without realizing it. By default, pandas skips them, so df["score"].mean() averages over the non-missing scores only. If you have a lot of missing values, that average can be misleading.
Always run df.isna().sum() after loading and again after any complex transformation. Make missing values visible.
Merging Two DataFrames
merge is pandas' equivalent of a SQL JOIN. You combine two DataFrames on a shared key.
customers = pd.DataFrame({
"customer_id": [1, 2, 3],
"name": ["Alice", "Bob", "Cara"],
})
orders = pd.DataFrame({
"customer_id": [1, 1, 2, 4],
"amount": [100, 200, 50, 300],
})
joined = orders.merge(customers, on="customer_id", how="left")
how="left" keeps every row in the left DataFrame (orders), filling missing customer info with NaN where it does not match. The four how values: left, right, inner, outer — same meaning as SQL. inner is the default.
After every merge, count rows. If you started with 1,000 orders and the joined DataFrame has 1,500, you have unintentional duplicates from a many-to-many join.
A Real Example: Group, Filter, Visualize
Run this end-to-end on the tips dataset:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")
df["tip_pct"] = df["tip"] / df["total_bill"] * 100
by_day = (
df.groupby("day")
.agg(
avg_bill=("total_bill", "mean"),
avg_tip_pct=("tip_pct", "mean"),
count=("total_bill", "count"),
)
.reset_index()
.sort_values("avg_tip_pct", ascending=False)
)
print(by_day)
You should see something like Sat and Sun on top with the largest average bills, and one of the days having the highest tip percentage. Read the output. Sanity-check it. Numbers look right? Move on.
AI Prompts for Pandas
The pandas API is huge. Use AI when you forget the syntax:
I have a DataFrame
dfwith columnsname,department,salary,hired_date. Please write pandas code that finds, for each department, the median salary of employees hired in the last 12 months. Return a sorted DataFrame.
The AI will give you a one-line groupby plus a date filter. Read every line, test it, and modify if the answer is not exactly what you wanted.
When working on private or sensitive data, you can describe the columns and a few synthetic rows without uploading the real data:
My data has columns
[list]and looks like this small synthetic example:[paste fake rows]. Write pandas code to do[task].
The AI never sees your real data; it just learns the shape.
Key Takeaways
- Always inspect with
head,info,describe,isna().sum()immediately after loading - Boolean masks with
&and|(with parentheses) drive most filtering groupbyplus.agg()replaces nearly every Excel pivot table- Missing values are silent by default — make them visible with
isna().sum() - After every merge, count rows to catch duplicates from many-to-many joins
- Use AI to look up syntax, but always read and test the resulting code yourself

