Writing Pandas and Python Code with AI

If you have ever spent twenty minutes trying to remember the difference between apply, map, and transform, or whether a pandas operation returns a view or a copy, AI is about to become your best tool. Pandas is wide, its idioms are inconsistent, and documentation is scattered. LLMs have read all of it and can write idiomatic, tested code for common analyst tasks in seconds.

This lesson covers how to work with ChatGPT and Claude to generate, refactor, and debug pandas code — and when to use a code interpreter for end-to-end analysis.

What You'll Learn

Writing pandas code from natural-language descriptions, correctly
Refactoring slow or unreadable pandas into clean pipelines
Using ChatGPT Advanced Data Analysis (code interpreter) end-to-end on a CSV
Moving analysis into Jupyter with AI-assisted notebook scaffolding

Writing Pandas Code from a Description

The template:

pandas 2.2. I have a DataFrame df with columns:

user_id (int64)

event_name (string, one of view, click, purchase)

timestamp (datetime64[ns, UTC])

revenue (float64, only present for purchase events, otherwise NaN)

About 8M rows. I want a DataFrame with one row per user, columns:

user_id

total_purchases (count of purchase events)

total_revenue (sum of revenue)

first_event (earliest timestamp)

last_event (latest timestamp)

days_active (number of distinct calendar days with any event)

Use a single groupby(...).agg(...) if possible. Do not use for loops. Name the result user_summary.

This prompt produces clean, correct code on the first try. Note the three specific details that make it work: pandas version, exact dtypes, and a constraint (one groupby).

Ten Pandas Patterns That Are Worth Knowing

AI is particularly reliable at these recurring analyst patterns. Ask for them by name.

Groupby with multiple aggregations. df.groupby("col").agg(total=("x", "sum"), avg=("y", "mean")).
Window functions. df.sort_values("ts").groupby("user")["revenue"].cumsum().
Pivot and melt. Reshaping long to wide and back.
Merge with indicator. pd.merge(a, b, on="k", how="outer", indicator=True) so you can see which rows matched.
Resample time series. df.set_index("ts").resample("1W").agg(...).
Rolling and expanding. 7-day moving average, cumulative sums.
Categorical types. Memory savings and proper sorting for columns with low cardinality.
Type-safe chaining with .pipe(). Each step is a named function.
Query-string filtering. df.query("status == 'completed' and revenue > 10").
Explode and join back. For list-valued columns.

If your prompt says "use pattern X" the AI will write canonical code.

Refactoring Slow Pandas

Slow pandas is usually caused by row-by-row .apply() calls, iterrows(), or repeated concatenation. Diagnose with:

Here is a pandas function that takes 40 seconds on 500,000 rows. Identify the three slowest operations and suggest vectorized replacements. Do not change the output. Explain the speedup mechanism for each (e.g., "vectorized string operations use the C code path").

[Paste code.]

Typical speed-ups the AI will identify:

.apply(lambda row: ...) replaced with np.where, pd.Series.str.*, or a vectorized expression
iterrows() replaced with boolean indexing
pd.concat inside a loop replaced with a list-then-concat-once pattern
dict lookups via .map() instead of apply with a lookup

A 10x or 100x speedup is common on the first refactor.

Using a Code Interpreter End-to-End

Both ChatGPT Advanced Data Analysis and Claude's analysis tool can read a CSV and actually run code on it. This unlocks workflows like:

Upload a CSV of orders.
"Load the file, tell me the shape, dtypes, and a 5-row sample."
"Generate a data-quality report with exact counts."
"For each country, show the monthly trend in orders and revenue for the last 12 months. Output as a matplotlib chart with two subplots."
"Fit a simple linear regression of revenue against month for the UK cohort only. Report R-squared, the slope, and whether the slope is statistically distinguishable from zero."
"Export the cleaned DataFrame as an Excel file with three sheets: raw, cleaned, summary."

Each step runs real code you can inspect. The intermediate CSV, cleaned file, and chart are all downloadable. This is by far the fastest way to go from raw CSV to shareable deliverable.

When to trust the code interpreter

The code is real; the numbers are real. But the interpretation layered on top can still be wrong. If the AI says "revenue grew 14% year over year," verify the math yourself: the number is either right or wrong, and code-interpreter output is reproducible because you can download the notebook.

Notebook Scaffolding

For larger analyses where you want to keep the work in your own Jupyter environment:

Draft a Jupyter notebook for the following analysis: customer churn in Q1 2026.

Required sections:

Imports and configuration (pandas, numpy, matplotlib, seaborn)

Data loading from s3://data-lake/orders/2026/ (parquet, last 18 months)

Feature engineering (RFM scores)

Churn definition and labeling

Exploratory charts (histograms, churn rate by segment)

A simple logistic regression baseline with statsmodels

Discussion section with bullets to fill in

For each code cell, add a markdown cell above it explaining what the cell does and why. Include assert statements after data loads to catch schema drift.

The AI produces a notebook skeleton you can fill in step by step. This makes the work reproducible and shareable instead of a stream-of-consciousness mess.

Debugging Pandas Errors

Pandas errors are notoriously cryptic. Feed them to AI verbatim:

I got this error in pandas 2.2:

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.

Here is the code that triggered it: {paste}

Explain what the warning means, why my code triggered it, and show two safe ways to rewrite it. Which one should I prefer and why?

Other common errors AI handles cleanly: KeyError on a groupby, ValueError: cannot convert float NaN to integer, TypeError: '<' not supported between instances of 'str' and 'float', and MergeError on mismatched keys.

Pandas Alternatives AI Knows About

For larger datasets, AI can translate a pandas pipeline to faster alternatives:

Translate this pandas pipeline to Polars 1.0. Preserve exact semantics. Point out any behavioral differences I should be aware of (e.g., null handling, string ops).

The same works for DuckDB SQL (run SQL on a local parquet file) and for PySpark (for cluster-scale data). When your dataset outgrows pandas, AI smooths the migration.

A Working Example

Try this end-to-end in ChatGPT with code interpreter:

Upload any CSV with a date column and a numeric column.
"Parse the date column correctly, compute a 7-day rolling average of the numeric column, and plot raw values plus the smoothed line on the same chart."
"What is the average value for the last 30 days versus the preceding 30 days? Is the difference statistically significant using a Welch's t-test?"
"Export the cleaned DataFrame and the chart as a zipped file I can download."

From raw CSV to shareable artifact in under three minutes — and the code is all inspectable.

Key Takeaways

Give AI pandas version, exact dtypes, and constraints for clean code first try
Know ten pandas patterns by name so you can request them directly
Refactor slow code by asking for vectorized replacements with an explained mechanism
Code interpreter (ChatGPT or Claude) lets you run real pandas end-to-end on your CSV
Use notebook scaffolding prompts to keep larger analyses reproducible

Writing Pandas and Python Code with AI

This lesson covers how to work with ChatGPT and Claude to generate, refactor, and debug pandas code — and when to use a code interpreter for end-to-end analysis.

What You'll Learn

Writing pandas code from natural-language descriptions, correctly
Refactoring slow or unreadable pandas into clean pipelines
Using ChatGPT Advanced Data Analysis (code interpreter) end-to-end on a CSV
Moving analysis into Jupyter with AI-assisted notebook scaffolding

Writing Pandas Code from a Description

The template:

pandas 2.2. I have a DataFrame df with columns:

user_id (int64)

event_name (string, one of view, click, purchase)

timestamp (datetime64[ns, UTC])

revenue (float64, only present for purchase events, otherwise NaN)

About 8M rows. I want a DataFrame with one row per user, columns:

user_id

total_purchases (count of purchase events)

total_revenue (sum of revenue)

first_event (earliest timestamp)

last_event (latest timestamp)

days_active (number of distinct calendar days with any event)

Use a single groupby(...).agg(...) if possible. Do not use for loops. Name the result user_summary.

This prompt produces clean, correct code on the first try. Note the three specific details that make it work: pandas version, exact dtypes, and a constraint (one groupby).

Ten Pandas Patterns That Are Worth Knowing

AI is particularly reliable at these recurring analyst patterns. Ask for them by name.

Groupby with multiple aggregations. df.groupby("col").agg(total=("x", "sum"), avg=("y", "mean")).
Window functions. df.sort_values("ts").groupby("user")["revenue"].cumsum().
Pivot and melt. Reshaping long to wide and back.
Merge with indicator. pd.merge(a, b, on="k", how="outer", indicator=True) so you can see which rows matched.
Resample time series. df.set_index("ts").resample("1W").agg(...).
Rolling and expanding. 7-day moving average, cumulative sums.
Categorical types. Memory savings and proper sorting for columns with low cardinality.
Type-safe chaining with .pipe(). Each step is a named function.
Query-string filtering. df.query("status == 'completed' and revenue > 10").
Explode and join back. For list-valued columns.

If your prompt says "use pattern X" the AI will write canonical code.

Refactoring Slow Pandas

Slow pandas is usually caused by row-by-row .apply() calls, iterrows(), or repeated concatenation. Diagnose with:

Here is a pandas function that takes 40 seconds on 500,000 rows. Identify the three slowest operations and suggest vectorized replacements. Do not change the output. Explain the speedup mechanism for each (e.g., "vectorized string operations use the C code path").

[Paste code.]

Typical speed-ups the AI will identify:

.apply(lambda row: ...) replaced with np.where, pd.Series.str.*, or a vectorized expression
iterrows() replaced with boolean indexing
pd.concat inside a loop replaced with a list-then-concat-once pattern
dict lookups via .map() instead of apply with a lookup

A 10x or 100x speedup is common on the first refactor.

Using a Code Interpreter End-to-End

Both ChatGPT Advanced Data Analysis and Claude's analysis tool can read a CSV and actually run code on it. This unlocks workflows like:

Upload a CSV of orders.
"Load the file, tell me the shape, dtypes, and a 5-row sample."
"Generate a data-quality report with exact counts."
"For each country, show the monthly trend in orders and revenue for the last 12 months. Output as a matplotlib chart with two subplots."
"Fit a simple linear regression of revenue against month for the UK cohort only. Report R-squared, the slope, and whether the slope is statistically distinguishable from zero."
"Export the cleaned DataFrame as an Excel file with three sheets: raw, cleaned, summary."

Each step runs real code you can inspect. The intermediate CSV, cleaned file, and chart are all downloadable. This is by far the fastest way to go from raw CSV to shareable deliverable.

When to trust the code interpreter

Notebook Scaffolding

For larger analyses where you want to keep the work in your own Jupyter environment:

Draft a Jupyter notebook for the following analysis: customer churn in Q1 2026.

Required sections:

Imports and configuration (pandas, numpy, matplotlib, seaborn)

Data loading from s3://data-lake/orders/2026/ (parquet, last 18 months)

Feature engineering (RFM scores)

Churn definition and labeling

Exploratory charts (histograms, churn rate by segment)

A simple logistic regression baseline with statsmodels

Discussion section with bullets to fill in

For each code cell, add a markdown cell above it explaining what the cell does and why. Include assert statements after data loads to catch schema drift.

The AI produces a notebook skeleton you can fill in step by step. This makes the work reproducible and shareable instead of a stream-of-consciousness mess.

Debugging Pandas Errors

Pandas errors are notoriously cryptic. Feed them to AI verbatim:

I got this error in pandas 2.2:

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.

Here is the code that triggered it: {paste}

Explain what the warning means, why my code triggered it, and show two safe ways to rewrite it. Which one should I prefer and why?

Pandas Alternatives AI Knows About

For larger datasets, AI can translate a pandas pipeline to faster alternatives:

Translate this pandas pipeline to Polars 1.0. Preserve exact semantics. Point out any behavioral differences I should be aware of (e.g., null handling, string ops).

The same works for DuckDB SQL (run SQL on a local parquet file) and for PySpark (for cluster-scale data). When your dataset outgrows pandas, AI smooths the migration.

A Working Example

Try this end-to-end in ChatGPT with code interpreter:

Upload any CSV with a date column and a numeric column.
"Parse the date column correctly, compute a 7-day rolling average of the numeric column, and plot raw values plus the smoothed line on the same chart."
"What is the average value for the last 30 days versus the preceding 30 days? Is the difference statistically significant using a Welch's t-test?"
"Export the cleaned DataFrame and the chart as a zipped file I can download."

From raw CSV to shareable artifact in under three minutes — and the code is all inspectable.

Key Takeaways

Give AI pandas version, exact dtypes, and constraints for clean code first try
Know ten pandas patterns by name so you can request them directly
Refactor slow code by asking for vectorized replacements with an explained mechanism
Code interpreter (ChatGPT or Claude) lets you run real pandas end-to-end on your CSV
Use notebook scaffolding prompts to keep larger analyses reproducible

Writing Pandas and Python Code with AI

What You'll Learn

Writing Pandas Code from a Description

Ten Pandas Patterns That Are Worth Knowing

Refactoring Slow Pandas

Using a Code Interpreter End-to-End

When to trust the code interpreter

Notebook Scaffolding

Debugging Pandas Errors

Pandas Alternatives AI Knows About

A Working Example

Key Takeaways

Quiz

Questions & Answers

Writing Pandas and Python Code with AI

What You'll Learn

Writing Pandas Code from a Description

Ten Pandas Patterns That Are Worth Knowing

Refactoring Slow Pandas

Using a Code Interpreter End-to-End

When to trust the code interpreter

Notebook Scaffolding

Debugging Pandas Errors

Pandas Alternatives AI Knows About

A Working Example

Key Takeaways

Quiz

Questions & Answers