Data Cleaning and Transformation with AI

Data cleaning is the part of the job that takes 60% of the time and nobody writes blog posts about. Messy dates, inconsistent capitalization, silent nulls, survey data entered seventeen different ways — every real dataset needs cleanup before you can trust a single chart.

AI compresses this work dramatically. This lesson covers how to use ChatGPT and Claude as cleaning copilots on CSVs, pandas DataFrames, and SQL-based staging tables.

What You'll Learn

A repeatable workflow for diagnosing data quality issues with AI
How to turn a raw CSV into a pandas cleaning pipeline in minutes
SQL patterns for staging and cleaning in dbt or plain warehouse jobs
Techniques for deduplication, standardization, and imputation you can trust

Step 1: The Data-Quality Audit

Before you clean anything, you need to see what is broken. Upload your CSV or paste a sample and use this prompt:

I have a CSV with 12,000 rows and the following columns: \{paste schema\}. Here are the first 30 rows and the last 10: {paste}.

Generate a data-quality report with these sections:

Missing values per column (count and percent)

Obvious type mismatches (numbers stored as strings, etc.)

Format inconsistencies (date formats, casing, whitespace, encoding)

Suspected duplicates and the criteria that suggest them

Outliers and suspicious values

Columns that look like they should be foreign keys to something else

For each issue, include severity (blocker, warning, cosmetic) and a proposed fix.

In ChatGPT with code interpreter or Claude with the analysis tool, you can skip the paste step — upload the file directly and the AI will compute exact counts, not estimates.

Step 2: Generate the Cleaning Pipeline

Once the issues are identified, ask for an executable fix:

Generate a pandas 2.2 cleaning pipeline for the issues above. Requirements:

Use .pipe() so each step is a reusable function

Add a comment above each step explaining why the cleaning is needed (not just what it does)

Preserve the original DataFrame — return a new one

Log row counts before and after each step so I can tell if a filter was too aggressive

Do not drop rows silently; collect them in a rejected_df I can inspect later

Also include a test cell that asserts the cleaned DataFrame has exactly 11,842 rows and zero nulls in user_id and signup_date.

This gives you code that is not just correct but auditable — every step is documented, nothing disappears without a trace.

Common Cleaning Patterns

Standardizing messy text

Survey data famously comes in as "United States", "USA", "U.S.A.", "us", "America". AI handles this well:

Here are the unique values in my country column: {paste}. Map each to the ISO-3166 alpha-2 code. Return the mapping as a Python dict. For values that are ambiguous or non-countries (e.g. "Other", "N/A", "earth"), map them to None and list them separately so I can decide.

Parsing inconsistent dates

My signup_date column has these formats mixed together: 03/14/2026, 2026-03-14, 14-Mar-2026, March 14, 2026. Write a pandas function that parses all of them into a single datetime64 column. Use dateutil.parser as a fallback for odd cases. Flag rows that cannot be parsed in a separate date_parse_failed boolean column instead of silently coercing to NaT.

Deduplication with business logic

Simple drop_duplicates() rarely matches business rules. Try:

I have a DataFrame of customer records. Duplicates exist because customers sometimes re-sign-up with variations of the same email (jane.doe@example.com, JaneDoe@example.com, jane.doe+1@example.com).

Write a pandas function that:

Normalizes emails: lowercase, trim whitespace, strip +tag from the local part

Groups by normalized email and keeps the earliest signup

Records the count of merged duplicates in a new column merged_count

Returns both the deduped DataFrame and a merged_audit DataFrame showing which originals were merged

Imputation you can defend

Do not let AI silently fill nulls. Ask it to surface the choice:

My age column has 4% missing values. Give me three defensible imputation strategies (median by cohort, predictive imputation, or flag-and-exclude), with the pandas code for each and a short note on when each is appropriate. Do not pick one — explain the trade-offs so I can decide.

Cleaning in SQL or dbt

If your cleaning happens in the warehouse rather than pandas, AI is equally good at dbt-style staging models:

Write a dbt staging model stg_orders for our raw Shopify data in raw.shopify.orders. Apply these cleaning rules:

Cast created_at from string to timestamp_tz assuming UTC

Lowercase and trim email

Null out customer_id when it equals 'guest'

Exclude rows where financial_status IN ('cancelled', 'voided')

Add a is_test_order boolean based on whether email ends with @test.example

Include dbt_valid_from and dbt_valid_to if any row appears more than once (late-arriving data)

Use CTEs, trailing commas, and lowercase keywords. Add {{ config(materialized='view') }} at the top. Include a {{ tests }} block asserting order_id is unique and not null.

Validation: Does the Cleaning Actually Work?

Every cleaning pipeline needs a validation layer. Ask AI to generate it alongside the pipeline:

For the cleaning pipeline you just wrote, generate a validation notebook that:

Compares row counts before and after

Asserts no negative values in revenue

Asserts signup_date is between 2020-01-01 and today

Asserts email matches a basic regex

Shows a before/after histogram for numeric columns

Shows a before/after value count for categorical columns

If any assertion fails, raise a clear error with the row that failed, not a generic message.

Run this validation every time the raw data refreshes. It will catch upstream breakage before your dashboards do.

Anti-Patterns

A few cleaning approaches that feel quick but will hurt later:

Dropping rows without logging. You will lose evidence of upstream problems.
Blanket .fillna(0). Treats missing as zero, which distorts averages and totals.
Regex that over-matches. Always test on edge cases before applying to the whole column.
One giant mega-function. Multiple small .pipe() steps are easier to debug and easier for AI to regenerate when rules change.

Key Takeaways

Start with an AI-generated data-quality report, then generate a pipeline to fix each issue
Use .pipe() and explicit before/after counts so every transformation is auditable
Let AI surface imputation trade-offs — do not let it silently pick a strategy
Cleaning in SQL or dbt works the same way as pandas; the pattern transfers
Always pair a cleaning pipeline with a validation step that fails loudly on bad input

Data Cleaning and Transformation with AI

AI compresses this work dramatically. This lesson covers how to use ChatGPT and Claude as cleaning copilots on CSVs, pandas DataFrames, and SQL-based staging tables.

What You'll Learn

A repeatable workflow for diagnosing data quality issues with AI
How to turn a raw CSV into a pandas cleaning pipeline in minutes
SQL patterns for staging and cleaning in dbt or plain warehouse jobs
Techniques for deduplication, standardization, and imputation you can trust

Step 1: The Data-Quality Audit

Before you clean anything, you need to see what is broken. Upload your CSV or paste a sample and use this prompt:

I have a CSV with 12,000 rows and the following columns: \{paste schema\}. Here are the first 30 rows and the last 10: {paste}.

Generate a data-quality report with these sections:

Missing values per column (count and percent)

Obvious type mismatches (numbers stored as strings, etc.)

Format inconsistencies (date formats, casing, whitespace, encoding)

Suspected duplicates and the criteria that suggest them

Outliers and suspicious values

Columns that look like they should be foreign keys to something else

For each issue, include severity (blocker, warning, cosmetic) and a proposed fix.

In ChatGPT with code interpreter or Claude with the analysis tool, you can skip the paste step — upload the file directly and the AI will compute exact counts, not estimates.

Step 2: Generate the Cleaning Pipeline

Once the issues are identified, ask for an executable fix:

Generate a pandas 2.2 cleaning pipeline for the issues above. Requirements:

Use .pipe() so each step is a reusable function

Add a comment above each step explaining why the cleaning is needed (not just what it does)

Preserve the original DataFrame — return a new one

Log row counts before and after each step so I can tell if a filter was too aggressive

Do not drop rows silently; collect them in a rejected_df I can inspect later

Also include a test cell that asserts the cleaned DataFrame has exactly 11,842 rows and zero nulls in user_id and signup_date.

This gives you code that is not just correct but auditable — every step is documented, nothing disappears without a trace.

Common Cleaning Patterns

Standardizing messy text

Survey data famously comes in as "United States", "USA", "U.S.A.", "us", "America". AI handles this well:

Here are the unique values in my country column: {paste}. Map each to the ISO-3166 alpha-2 code. Return the mapping as a Python dict. For values that are ambiguous or non-countries (e.g. "Other", "N/A", "earth"), map them to None and list them separately so I can decide.

Parsing inconsistent dates

My signup_date column has these formats mixed together: 03/14/2026, 2026-03-14, 14-Mar-2026, March 14, 2026. Write a pandas function that parses all of them into a single datetime64 column. Use dateutil.parser as a fallback for odd cases. Flag rows that cannot be parsed in a separate date_parse_failed boolean column instead of silently coercing to NaT.

Deduplication with business logic

Simple drop_duplicates() rarely matches business rules. Try:

I have a DataFrame of customer records. Duplicates exist because customers sometimes re-sign-up with variations of the same email (jane.doe@example.com, JaneDoe@example.com, jane.doe+1@example.com).

Write a pandas function that:

Normalizes emails: lowercase, trim whitespace, strip +tag from the local part

Groups by normalized email and keeps the earliest signup

Records the count of merged duplicates in a new column merged_count

Returns both the deduped DataFrame and a merged_audit DataFrame showing which originals were merged

Imputation you can defend

Do not let AI silently fill nulls. Ask it to surface the choice:

My age column has 4% missing values. Give me three defensible imputation strategies (median by cohort, predictive imputation, or flag-and-exclude), with the pandas code for each and a short note on when each is appropriate. Do not pick one — explain the trade-offs so I can decide.

Cleaning in SQL or dbt

If your cleaning happens in the warehouse rather than pandas, AI is equally good at dbt-style staging models:

Write a dbt staging model stg_orders for our raw Shopify data in raw.shopify.orders. Apply these cleaning rules:

Cast created_at from string to timestamp_tz assuming UTC

Lowercase and trim email

Null out customer_id when it equals 'guest'

Exclude rows where financial_status IN ('cancelled', 'voided')

Add a is_test_order boolean based on whether email ends with @test.example

Include dbt_valid_from and dbt_valid_to if any row appears more than once (late-arriving data)

Use CTEs, trailing commas, and lowercase keywords. Add {{ config(materialized='view') }} at the top. Include a {{ tests }} block asserting order_id is unique and not null.

Validation: Does the Cleaning Actually Work?

Every cleaning pipeline needs a validation layer. Ask AI to generate it alongside the pipeline:

For the cleaning pipeline you just wrote, generate a validation notebook that:

Compares row counts before and after

Asserts no negative values in revenue

Asserts signup_date is between 2020-01-01 and today

Asserts email matches a basic regex

Shows a before/after histogram for numeric columns

Shows a before/after value count for categorical columns

If any assertion fails, raise a clear error with the row that failed, not a generic message.

Run this validation every time the raw data refreshes. It will catch upstream breakage before your dashboards do.

Anti-Patterns

A few cleaning approaches that feel quick but will hurt later:

Dropping rows without logging. You will lose evidence of upstream problems.
Blanket .fillna(0). Treats missing as zero, which distorts averages and totals.
Regex that over-matches. Always test on edge cases before applying to the whole column.
One giant mega-function. Multiple small .pipe() steps are easier to debug and easier for AI to regenerate when rules change.

Key Takeaways

Start with an AI-generated data-quality report, then generate a pipeline to fix each issue
Use .pipe() and explicit before/after counts so every transformation is auditable
Let AI surface imputation trade-offs — do not let it silently pick a strategy
Cleaning in SQL or dbt works the same way as pandas; the pattern transfers
Always pair a cleaning pipeline with a validation step that fails loudly on bad input

Data Cleaning and Transformation with AI

What You'll Learn

Step 1: The Data-Quality Audit

Step 2: Generate the Cleaning Pipeline

Common Cleaning Patterns

Standardizing messy text

Parsing inconsistent dates

Deduplication with business logic

Imputation you can defend

Cleaning in SQL or dbt

Validation: Does the Cleaning Actually Work?

Anti-Patterns

Key Takeaways

Quiz

Questions & Answers

Data Cleaning and Transformation with AI

What You'll Learn

Step 1: The Data-Quality Audit

Step 2: Generate the Cleaning Pipeline

Common Cleaning Patterns

Standardizing messy text

Parsing inconsistent dates

Deduplication with business logic

Imputation you can defend

Cleaning in SQL or dbt

Validation: Does the Cleaning Actually Work?

Anti-Patterns

Key Takeaways

Quiz

Questions & Answers