Data Cleaning and Transformation with AI
Data cleaning is the part of the job that takes 60% of the time and nobody writes blog posts about. Messy dates, inconsistent capitalization, silent nulls, survey data entered seventeen different ways — every real dataset needs cleanup before you can trust a single chart.
AI compresses this work dramatically. This lesson covers how to use ChatGPT and Claude as cleaning copilots on CSVs, pandas DataFrames, and SQL-based staging tables.
What You'll Learn
- A repeatable workflow for diagnosing data quality issues with AI
- How to turn a raw CSV into a pandas cleaning pipeline in minutes
- SQL patterns for staging and cleaning in dbt or plain warehouse jobs
- Techniques for deduplication, standardization, and imputation you can trust
Step 1: The Data-Quality Audit
Before you clean anything, you need to see what is broken. Upload your CSV or paste a sample and use this prompt:
I have a CSV with 12,000 rows and the following columns:
\{paste schema\}. Here are the first 30 rows and the last 10: {paste}.Generate a data-quality report with these sections:
- Missing values per column (count and percent)
- Obvious type mismatches (numbers stored as strings, etc.)
- Format inconsistencies (date formats, casing, whitespace, encoding)
- Suspected duplicates and the criteria that suggest them
- Outliers and suspicious values
- Columns that look like they should be foreign keys to something else
For each issue, include severity (blocker, warning, cosmetic) and a proposed fix.
In ChatGPT with code interpreter or Claude with the analysis tool, you can skip the paste step — upload the file directly and the AI will compute exact counts, not estimates.
Step 2: Generate the Cleaning Pipeline
Once the issues are identified, ask for an executable fix:
Generate a pandas 2.2 cleaning pipeline for the issues above. Requirements:
- Use
.pipe()so each step is a reusable function- Add a comment above each step explaining why the cleaning is needed (not just what it does)
- Preserve the original DataFrame — return a new one
- Log row counts before and after each step so I can tell if a filter was too aggressive
- Do not drop rows silently; collect them in a
rejected_dfI can inspect laterAlso include a test cell that asserts the cleaned DataFrame has exactly 11,842 rows and zero nulls in
user_idandsignup_date.
This gives you code that is not just correct but auditable — every step is documented, nothing disappears without a trace.
Common Cleaning Patterns
Standardizing messy text
Survey data famously comes in as "United States", "USA", "U.S.A.", "us", "America". AI handles this well:
Here are the unique values in my
countrycolumn: {paste}. Map each to the ISO-3166 alpha-2 code. Return the mapping as a Python dict. For values that are ambiguous or non-countries (e.g. "Other", "N/A", "earth"), map them toNoneand list them separately so I can decide.
Parsing inconsistent dates
My
signup_datecolumn has these formats mixed together:03/14/2026,2026-03-14,14-Mar-2026,March 14, 2026. Write a pandas function that parses all of them into a singledatetime64column. Usedateutil.parseras a fallback for odd cases. Flag rows that cannot be parsed in a separatedate_parse_failedboolean column instead of silently coercing toNaT.
Deduplication with business logic
Simple drop_duplicates() rarely matches business rules. Try:
I have a DataFrame of customer records. Duplicates exist because customers sometimes re-sign-up with variations of the same email (
jane.doe@example.com,JaneDoe@example.com,jane.doe+1@example.com).Write a pandas function that:
- Normalizes emails: lowercase, trim whitespace, strip
+tagfrom the local part- Groups by normalized email and keeps the earliest signup
- Records the count of merged duplicates in a new column
merged_count- Returns both the deduped DataFrame and a
merged_auditDataFrame showing which originals were merged
Imputation you can defend
Do not let AI silently fill nulls. Ask it to surface the choice:
My
agecolumn has 4% missing values. Give me three defensible imputation strategies (median by cohort, predictive imputation, or flag-and-exclude), with the pandas code for each and a short note on when each is appropriate. Do not pick one — explain the trade-offs so I can decide.
Cleaning in SQL or dbt
If your cleaning happens in the warehouse rather than pandas, AI is equally good at dbt-style staging models:
Write a dbt staging model
stg_ordersfor our raw Shopify data inraw.shopify.orders. Apply these cleaning rules:
- Cast
created_atfrom string totimestamp_tzassuming UTC- Lowercase and trim
- Null out
customer_idwhen it equals'guest'- Exclude rows where
financial_status IN ('cancelled', 'voided')- Add a
is_test_orderboolean based on whether email ends with@test.example- Include
dbt_valid_fromanddbt_valid_toif any row appears more than once (late-arriving data)Use CTEs, trailing commas, and lowercase keywords. Add
{{ config(materialized='view') }}at the top. Include a{{ tests }}block assertingorder_idis unique and not null.
Validation: Does the Cleaning Actually Work?
Every cleaning pipeline needs a validation layer. Ask AI to generate it alongside the pipeline:
For the cleaning pipeline you just wrote, generate a validation notebook that:
- Compares row counts before and after
- Asserts no negative values in
revenue- Asserts
signup_dateis between 2020-01-01 and today- Asserts
- Shows a before/after histogram for numeric columns
- Shows a before/after value count for categorical columns
If any assertion fails, raise a clear error with the row that failed, not a generic message.
Run this validation every time the raw data refreshes. It will catch upstream breakage before your dashboards do.
Anti-Patterns
A few cleaning approaches that feel quick but will hurt later:
- Dropping rows without logging. You will lose evidence of upstream problems.
- Blanket
.fillna(0). Treats missing as zero, which distorts averages and totals. - Regex that over-matches. Always test on edge cases before applying to the whole column.
- One giant mega-function. Multiple small
.pipe()steps are easier to debug and easier for AI to regenerate when rules change.
Key Takeaways
- Start with an AI-generated data-quality report, then generate a pipeline to fix each issue
- Use
.pipe()and explicit before/after counts so every transformation is auditable - Let AI surface imputation trade-offs — do not let it silently pick a strategy
- Cleaning in SQL or dbt works the same way as pandas; the pattern transfers
- Always pair a cleaning pipeline with a validation step that fails loudly on bad input

