Exploratory Data Analysis with AI

Exploratory data analysis (EDA) is the part of the job that sets good analysts apart from great ones. It is about developing intuition for a dataset before you commit to a chart, a model, or a recommendation. AI cannot replace the intuition — but it can compress the manual work of calculating summary statistics, scanning for anomalies, and sketching first-draft charts from hours to minutes.

This lesson shows how to use AI as an EDA copilot: fast, systematic, and trustworthy.

What You'll Learn

A repeatable EDA workflow you can run on any new dataset
How to prompt AI for summary statistics, distributions, and correlations
Using AI to find outliers and anomalies you would have missed
Communicating EDA findings without dumping a wall of charts on stakeholders

The Five-Pass EDA Workflow

Here is a workflow that works for almost any new dataset. Run it in ChatGPT with code interpreter or Claude with the analysis tool, or translate it into your own Jupyter notebook.

Pass 1 — Shape and types

For the uploaded file, report:

Shape (rows, columns)

Column names and dtypes

Memory usage

Number of unique values per column

Suggested dtype changes for memory savings (e.g., object → category)

This is five minutes of AI work that you would normally spend pressing tab in Jupyter.

Pass 2 — Missingness and duplicates

For each column:

Count and percent of null values

Pattern of missingness (are nulls concentrated in specific rows? are they correlated across columns?)

Exact duplicate row count

Near-duplicate detection on primary-key-like columns

Then suggest an imputation or exclusion strategy for each column with missing data, with reasoning.

Pass 3 — Univariate distributions

For each numeric column, show:

Five-number summary (min, Q1, median, Q3, max)

Mean and standard deviation

Histogram with a reasonable bin count

Detection of suspicious values (negative when positive expected, exact zeros when zero is implausible)

For each categorical column:

Top 10 values with counts and percentages

Number of unique values

Flag columns where one value dominates more than 90%

Pass 4 — Bivariate relationships

Compute and visualize:

Correlation matrix for numeric columns (Spearman for ranks, Pearson for linear)

Key crosstabs for categorical columns I care about: country x segment, product x country

Scatter plots for pairs with correlation |r| > 0.3 to spot non-linearities

Grouped boxplots for one numeric vs one categorical column

Pass 5 — Time dynamics

If there is a timestamp column:

Show:

Daily / weekly / monthly row counts (was data collection steady or spiky?)

Key metrics over time (totals, averages by month)

Day-of-week and hour-of-day patterns

Anomalous days with unusually high or low values

Five passes, fifteen minutes, and you know more about the dataset than most analysts would after two hours of manual exploration.

Finding Outliers and Anomalies

AI is good at outlier detection because the techniques are standardized. A strong prompt:

Identify outliers in the numeric columns using both:

The IQR rule (values below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR)

Z-score rule (|z| > 3)

Report, for each column:

How many outliers each method found

The 10 most extreme values

Whether the two methods agree or disagree

Also flag rows where multiple columns are simultaneously extreme — those are the most interesting.

For anomaly detection across time, use:

Given a daily time series of orders_count for the last 18 months, identify the 20 most anomalous days using a seasonal decomposition (STL) plus residual z-scores. For each, report the date, the observed value, the expected value, and the z-score. Exclude known holidays I list below.

Asking "What Is Driving This?"

EDA is incomplete without hypotheses for what you are seeing. Try:

I see that March 2026 revenue is 22% higher than February 2026. Decompose the increase into:

Volume effect (more orders)

Price effect (higher average order value)

Mix effect (shift toward higher-price products)

Geography effect (shift in country distribution)

For each, show the percentage point contribution to the total increase and one sentence of interpretation.

This "decomposition" prompt is a secret weapon for explaining movement in any metric.

Finding Patterns You Would Have Missed

A good EDA move is to ask the AI to generate hypotheses you may not have considered:

Here is the univariate and bivariate analysis I just ran. Based on the patterns visible, list 10 hypotheses I should investigate further — ranked by how surprising and actionable they would be if true. For each, explain the pattern that suggested the hypothesis and the specific follow-up analysis I should run.

You will get suggestions like "purchases drop sharply at 30 days since signup — check whether trial ends match that pattern" or "90% of refunds come from 4% of products — investigate those products."

Avoiding Wall-of-Chart Syndrome

A classic mistake is to paste 30 charts into a report. Stakeholders cannot absorb that. Use AI to prune:

Here are 22 EDA findings and charts. For a VP of sales who has five minutes, pick the three most important findings to lead with. For each, explain in one sentence why it matters and what the implication is.

The AI will force-rank the findings, highlighting the two or three that deserve airtime. You keep the other 19 in an appendix.

Building Intuition with Sample Stories

One of the best EDA prompts is to ask for stories at the row level:

Pick five "interesting" customers from this dataset and write a short paragraph describing each: who they are (demographics), what they did (behavior), and why they are interesting (high value, churned, anomalous pattern). Use real row data, not fictional examples.

This anchors your understanding of the dataset in concrete examples and gives you quotes you can use in the final report.

When EDA Signals a Data Problem

Sometimes EDA reveals that the data itself is broken upstream:

Row counts drop to zero for a day — check the ETL logs
A metric shifts by 10x on a specific date — suspect a schema change
Nulls appear only after a certain date — a field was renamed or removed
Categorical values explode in count — someone changed a dropdown to free text

Prompt AI to flag these:

Review this dataset and list any patterns that suggest an upstream data pipeline problem rather than a business change. Include the date of the anomaly and the specific signal that made you suspect a pipeline issue.

This saves you from publishing a report based on broken data.

Ending EDA: The Hypothesis Shortlist

A good EDA produces a hypothesis shortlist, not a dump. Prompt:

Based on everything above, produce a one-page summary with:

Three confirmed findings (strong evidence)

Three hypotheses worth testing (suggestive evidence)

Two open questions (insufficient evidence, need more data or context)

One upstream data concern worth flagging

Use bullet points. No charts. Every bullet should be a single sentence.

That one page is what you share with your manager. The 22 charts live in an appendix notebook.

Key Takeaways

Use the five-pass workflow: shape, missingness, univariate, bivariate, time
AI finds outliers systematically using IQR + z-score + cross-column detection
Ask for decomposition prompts to explain metric movement
Have AI generate surprising hypotheses you would have missed
End every EDA with a one-page shortlist, not a 30-chart dump

Exploratory Data Analysis with AI

This lesson shows how to use AI as an EDA copilot: fast, systematic, and trustworthy.

What You'll Learn

A repeatable EDA workflow you can run on any new dataset
How to prompt AI for summary statistics, distributions, and correlations
Using AI to find outliers and anomalies you would have missed
Communicating EDA findings without dumping a wall of charts on stakeholders

The Five-Pass EDA Workflow

Here is a workflow that works for almost any new dataset. Run it in ChatGPT with code interpreter or Claude with the analysis tool, or translate it into your own Jupyter notebook.

Pass 1 — Shape and types

For the uploaded file, report:

Shape (rows, columns)

Column names and dtypes

Memory usage

Number of unique values per column

Suggested dtype changes for memory savings (e.g., object → category)

This is five minutes of AI work that you would normally spend pressing tab in Jupyter.

Pass 2 — Missingness and duplicates

For each column:

Count and percent of null values

Pattern of missingness (are nulls concentrated in specific rows? are they correlated across columns?)

Exact duplicate row count

Near-duplicate detection on primary-key-like columns

Then suggest an imputation or exclusion strategy for each column with missing data, with reasoning.

Pass 3 — Univariate distributions

For each numeric column, show:

Five-number summary (min, Q1, median, Q3, max)

Mean and standard deviation

Histogram with a reasonable bin count

Detection of suspicious values (negative when positive expected, exact zeros when zero is implausible)

For each categorical column:

Top 10 values with counts and percentages

Number of unique values

Flag columns where one value dominates more than 90%

Pass 4 — Bivariate relationships

Compute and visualize:

Correlation matrix for numeric columns (Spearman for ranks, Pearson for linear)

Key crosstabs for categorical columns I care about: country x segment, product x country

Scatter plots for pairs with correlation |r| > 0.3 to spot non-linearities

Grouped boxplots for one numeric vs one categorical column

Pass 5 — Time dynamics

If there is a timestamp column:

Show:

Daily / weekly / monthly row counts (was data collection steady or spiky?)

Key metrics over time (totals, averages by month)

Day-of-week and hour-of-day patterns

Anomalous days with unusually high or low values

Five passes, fifteen minutes, and you know more about the dataset than most analysts would after two hours of manual exploration.

Finding Outliers and Anomalies

AI is good at outlier detection because the techniques are standardized. A strong prompt:

Identify outliers in the numeric columns using both:

The IQR rule (values below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR)

Z-score rule (|z| > 3)

Report, for each column:

How many outliers each method found

The 10 most extreme values

Whether the two methods agree or disagree

Also flag rows where multiple columns are simultaneously extreme — those are the most interesting.

For anomaly detection across time, use:

Given a daily time series of orders_count for the last 18 months, identify the 20 most anomalous days using a seasonal decomposition (STL) plus residual z-scores. For each, report the date, the observed value, the expected value, and the z-score. Exclude known holidays I list below.

Asking "What Is Driving This?"

EDA is incomplete without hypotheses for what you are seeing. Try:

I see that March 2026 revenue is 22% higher than February 2026. Decompose the increase into:

Volume effect (more orders)

Price effect (higher average order value)

Mix effect (shift toward higher-price products)

Geography effect (shift in country distribution)

For each, show the percentage point contribution to the total increase and one sentence of interpretation.

This "decomposition" prompt is a secret weapon for explaining movement in any metric.

Finding Patterns You Would Have Missed

A good EDA move is to ask the AI to generate hypotheses you may not have considered:

Here is the univariate and bivariate analysis I just ran. Based on the patterns visible, list 10 hypotheses I should investigate further — ranked by how surprising and actionable they would be if true. For each, explain the pattern that suggested the hypothesis and the specific follow-up analysis I should run.

Avoiding Wall-of-Chart Syndrome

A classic mistake is to paste 30 charts into a report. Stakeholders cannot absorb that. Use AI to prune:

Here are 22 EDA findings and charts. For a VP of sales who has five minutes, pick the three most important findings to lead with. For each, explain in one sentence why it matters and what the implication is.

The AI will force-rank the findings, highlighting the two or three that deserve airtime. You keep the other 19 in an appendix.

Building Intuition with Sample Stories

One of the best EDA prompts is to ask for stories at the row level:

Pick five "interesting" customers from this dataset and write a short paragraph describing each: who they are (demographics), what they did (behavior), and why they are interesting (high value, churned, anomalous pattern). Use real row data, not fictional examples.

This anchors your understanding of the dataset in concrete examples and gives you quotes you can use in the final report.

When EDA Signals a Data Problem

Sometimes EDA reveals that the data itself is broken upstream:

Row counts drop to zero for a day — check the ETL logs
A metric shifts by 10x on a specific date — suspect a schema change
Nulls appear only after a certain date — a field was renamed or removed
Categorical values explode in count — someone changed a dropdown to free text

Prompt AI to flag these:

Review this dataset and list any patterns that suggest an upstream data pipeline problem rather than a business change. Include the date of the anomaly and the specific signal that made you suspect a pipeline issue.

This saves you from publishing a report based on broken data.

Ending EDA: The Hypothesis Shortlist

A good EDA produces a hypothesis shortlist, not a dump. Prompt:

Based on everything above, produce a one-page summary with:

Three confirmed findings (strong evidence)

Three hypotheses worth testing (suggestive evidence)

Two open questions (insufficient evidence, need more data or context)

One upstream data concern worth flagging

Use bullet points. No charts. Every bullet should be a single sentence.

That one page is what you share with your manager. The 22 charts live in an appendix notebook.

Key Takeaways

Use the five-pass workflow: shape, missingness, univariate, bivariate, time
AI finds outliers systematically using IQR + z-score + cross-column detection
Ask for decomposition prompts to explain metric movement
Have AI generate surprising hypotheses you would have missed
End every EDA with a one-page shortlist, not a 30-chart dump

Exploratory Data Analysis with AI

What You'll Learn

The Five-Pass EDA Workflow

Pass 1 — Shape and types

Pass 2 — Missingness and duplicates

Pass 3 — Univariate distributions

Pass 4 — Bivariate relationships

Pass 5 — Time dynamics

Finding Outliers and Anomalies

Asking "What Is Driving This?"

Finding Patterns You Would Have Missed

Avoiding Wall-of-Chart Syndrome

Building Intuition with Sample Stories

When EDA Signals a Data Problem

Ending EDA: The Hypothesis Shortlist

Key Takeaways

Quiz

Questions & Answers

Exploratory Data Analysis with AI

What You'll Learn

The Five-Pass EDA Workflow

Pass 1 — Shape and types

Pass 2 — Missingness and duplicates

Pass 3 — Univariate distributions

Pass 4 — Bivariate relationships

Pass 5 — Time dynamics

Finding Outliers and Anomalies

Asking "What Is Driving This?"

Finding Patterns You Would Have Missed

Avoiding Wall-of-Chart Syndrome

Building Intuition with Sample Stories

When EDA Signals a Data Problem

Ending EDA: The Hypothesis Shortlist

Key Takeaways

Quiz

Questions & Answers