Writing Pandas and Python Code with AI
If you have ever spent twenty minutes trying to remember the difference between apply, map, and transform, or whether a pandas operation returns a view or a copy, AI is about to become your best tool. Pandas is wide, its idioms are inconsistent, and documentation is scattered. LLMs have read all of it and can write idiomatic, tested code for common analyst tasks in seconds.
This lesson covers how to work with ChatGPT and Claude to generate, refactor, and debug pandas code — and when to use a code interpreter for end-to-end analysis.
What You'll Learn
- Writing pandas code from natural-language descriptions, correctly
- Refactoring slow or unreadable pandas into clean pipelines
- Using ChatGPT Advanced Data Analysis (code interpreter) end-to-end on a CSV
- Moving analysis into Jupyter with AI-assisted notebook scaffolding
Writing Pandas Code from a Description
The template:
pandas 2.2. I have a DataFrame
dfwith columns:
user_id(int64)event_name(string, one ofview,click,purchase)timestamp(datetime64[ns, UTC])revenue(float64, only present forpurchaseevents, otherwise NaN)About 8M rows. I want a DataFrame with one row per user, columns:
user_idtotal_purchases(count of purchase events)total_revenue(sum of revenue)first_event(earliest timestamp)last_event(latest timestamp)days_active(number of distinct calendar days with any event)Use a single
groupby(...).agg(...)if possible. Do not useforloops. Name the resultuser_summary.
This prompt produces clean, correct code on the first try. Note the three specific details that make it work: pandas version, exact dtypes, and a constraint (one groupby).
Ten Pandas Patterns That Are Worth Knowing
AI is particularly reliable at these recurring analyst patterns. Ask for them by name.
- Groupby with multiple aggregations.
df.groupby("col").agg(total=("x", "sum"), avg=("y", "mean")). - Window functions.
df.sort_values("ts").groupby("user")["revenue"].cumsum(). - Pivot and melt. Reshaping long to wide and back.
- Merge with indicator.
pd.merge(a, b, on="k", how="outer", indicator=True)so you can see which rows matched. - Resample time series.
df.set_index("ts").resample("1W").agg(...). - Rolling and expanding. 7-day moving average, cumulative sums.
- Categorical types. Memory savings and proper sorting for columns with low cardinality.
- Type-safe chaining with
.pipe(). Each step is a named function. - Query-string filtering.
df.query("status == 'completed' and revenue > 10"). - Explode and join back. For list-valued columns.
If your prompt says "use pattern X" the AI will write canonical code.
Refactoring Slow Pandas
Slow pandas is usually caused by row-by-row .apply() calls, iterrows(), or repeated concatenation. Diagnose with:
Here is a pandas function that takes 40 seconds on 500,000 rows. Identify the three slowest operations and suggest vectorized replacements. Do not change the output. Explain the speedup mechanism for each (e.g., "vectorized string operations use the C code path").
[Paste code.]
Typical speed-ups the AI will identify:
.apply(lambda row: ...)replaced withnp.where,pd.Series.str.*, or a vectorized expressioniterrows()replaced with boolean indexingpd.concatinside a loop replaced with a list-then-concat-once patterndictlookups via.map()instead ofapplywith a lookup
A 10x or 100x speedup is common on the first refactor.
Using a Code Interpreter End-to-End
Both ChatGPT Advanced Data Analysis and Claude's analysis tool can read a CSV and actually run code on it. This unlocks workflows like:
- Upload a CSV of orders.
- "Load the file, tell me the shape, dtypes, and a 5-row sample."
- "Generate a data-quality report with exact counts."
- "For each country, show the monthly trend in orders and revenue for the last 12 months. Output as a matplotlib chart with two subplots."
- "Fit a simple linear regression of revenue against month for the UK cohort only. Report R-squared, the slope, and whether the slope is statistically distinguishable from zero."
- "Export the cleaned DataFrame as an Excel file with three sheets: raw, cleaned, summary."
Each step runs real code you can inspect. The intermediate CSV, cleaned file, and chart are all downloadable. This is by far the fastest way to go from raw CSV to shareable deliverable.
When to trust the code interpreter
The code is real; the numbers are real. But the interpretation layered on top can still be wrong. If the AI says "revenue grew 14% year over year," verify the math yourself: the number is either right or wrong, and code-interpreter output is reproducible because you can download the notebook.
Notebook Scaffolding
For larger analyses where you want to keep the work in your own Jupyter environment:
Draft a Jupyter notebook for the following analysis: customer churn in Q1 2026.
Required sections:
- Imports and configuration (pandas, numpy, matplotlib, seaborn)
- Data loading from
s3://data-lake/orders/2026/(parquet, last 18 months)- Feature engineering (RFM scores)
- Churn definition and labeling
- Exploratory charts (histograms, churn rate by segment)
- A simple logistic regression baseline with statsmodels
- Discussion section with bullets to fill in
For each code cell, add a markdown cell above it explaining what the cell does and why. Include
assertstatements after data loads to catch schema drift.
The AI produces a notebook skeleton you can fill in step by step. This makes the work reproducible and shareable instead of a stream-of-consciousness mess.
Debugging Pandas Errors
Pandas errors are notoriously cryptic. Feed them to AI verbatim:
I got this error in pandas 2.2:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Here is the code that triggered it: {paste}
Explain what the warning means, why my code triggered it, and show two safe ways to rewrite it. Which one should I prefer and why?
Other common errors AI handles cleanly: KeyError on a groupby, ValueError: cannot convert float NaN to integer, TypeError: '<' not supported between instances of 'str' and 'float', and MergeError on mismatched keys.
Pandas Alternatives AI Knows About
For larger datasets, AI can translate a pandas pipeline to faster alternatives:
Translate this pandas pipeline to Polars 1.0. Preserve exact semantics. Point out any behavioral differences I should be aware of (e.g., null handling, string ops).
The same works for DuckDB SQL (run SQL on a local parquet file) and for PySpark (for cluster-scale data). When your dataset outgrows pandas, AI smooths the migration.
A Working Example
Try this end-to-end in ChatGPT with code interpreter:
- Upload any CSV with a date column and a numeric column.
- "Parse the date column correctly, compute a 7-day rolling average of the numeric column, and plot raw values plus the smoothed line on the same chart."
- "What is the average value for the last 30 days versus the preceding 30 days? Is the difference statistically significant using a Welch's t-test?"
- "Export the cleaned DataFrame and the chart as a zipped file I can download."
From raw CSV to shareable artifact in under three minutes — and the code is all inspectable.
Key Takeaways
- Give AI pandas version, exact dtypes, and constraints for clean code first try
- Know ten pandas patterns by name so you can request them directly
- Refactor slow code by asking for vectorized replacements with an explained mechanism
- Code interpreter (ChatGPT or Claude) lets you run real pandas end-to-end on your CSV
- Use notebook scaffolding prompts to keep larger analyses reproducible

