Exploratory Data Analysis with AI
Exploratory data analysis (EDA) is the part of the job that sets good analysts apart from great ones. It is about developing intuition for a dataset before you commit to a chart, a model, or a recommendation. AI cannot replace the intuition — but it can compress the manual work of calculating summary statistics, scanning for anomalies, and sketching first-draft charts from hours to minutes.
This lesson shows how to use AI as an EDA copilot: fast, systematic, and trustworthy.
What You'll Learn
- A repeatable EDA workflow you can run on any new dataset
- How to prompt AI for summary statistics, distributions, and correlations
- Using AI to find outliers and anomalies you would have missed
- Communicating EDA findings without dumping a wall of charts on stakeholders
The Five-Pass EDA Workflow
Here is a workflow that works for almost any new dataset. Run it in ChatGPT with code interpreter or Claude with the analysis tool, or translate it into your own Jupyter notebook.
Pass 1 — Shape and types
For the uploaded file, report:
- Shape (rows, columns)
- Column names and dtypes
- Memory usage
- Number of unique values per column
- Suggested dtype changes for memory savings (e.g., object → category)
This is five minutes of AI work that you would normally spend pressing tab in Jupyter.
Pass 2 — Missingness and duplicates
For each column:
- Count and percent of null values
- Pattern of missingness (are nulls concentrated in specific rows? are they correlated across columns?)
- Exact duplicate row count
- Near-duplicate detection on primary-key-like columns
Then suggest an imputation or exclusion strategy for each column with missing data, with reasoning.
Pass 3 — Univariate distributions
For each numeric column, show:
- Five-number summary (min, Q1, median, Q3, max)
- Mean and standard deviation
- Histogram with a reasonable bin count
- Detection of suspicious values (negative when positive expected, exact zeros when zero is implausible)
For each categorical column:
- Top 10 values with counts and percentages
- Number of unique values
- Flag columns where one value dominates more than 90%
Pass 4 — Bivariate relationships
Compute and visualize:
- Correlation matrix for numeric columns (Spearman for ranks, Pearson for linear)
- Key crosstabs for categorical columns I care about:
country x segment,product x country- Scatter plots for pairs with correlation |r| > 0.3 to spot non-linearities
- Grouped boxplots for one numeric vs one categorical column
Pass 5 — Time dynamics
If there is a timestamp column:
Show:
- Daily / weekly / monthly row counts (was data collection steady or spiky?)
- Key metrics over time (totals, averages by month)
- Day-of-week and hour-of-day patterns
- Anomalous days with unusually high or low values
Five passes, fifteen minutes, and you know more about the dataset than most analysts would after two hours of manual exploration.
Finding Outliers and Anomalies
AI is good at outlier detection because the techniques are standardized. A strong prompt:
Identify outliers in the numeric columns using both:
- The IQR rule (values below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR)
- Z-score rule (|z| > 3)
Report, for each column:
- How many outliers each method found
- The 10 most extreme values
- Whether the two methods agree or disagree
Also flag rows where multiple columns are simultaneously extreme — those are the most interesting.
For anomaly detection across time, use:
Given a daily time series of
orders_countfor the last 18 months, identify the 20 most anomalous days using a seasonal decomposition (STL) plus residual z-scores. For each, report the date, the observed value, the expected value, and the z-score. Exclude known holidays I list below.
Asking "What Is Driving This?"
EDA is incomplete without hypotheses for what you are seeing. Try:
I see that March 2026 revenue is 22% higher than February 2026. Decompose the increase into:
- Volume effect (more orders)
- Price effect (higher average order value)
- Mix effect (shift toward higher-price products)
- Geography effect (shift in country distribution)
For each, show the percentage point contribution to the total increase and one sentence of interpretation.
This "decomposition" prompt is a secret weapon for explaining movement in any metric.
Finding Patterns You Would Have Missed
A good EDA move is to ask the AI to generate hypotheses you may not have considered:
Here is the univariate and bivariate analysis I just ran. Based on the patterns visible, list 10 hypotheses I should investigate further — ranked by how surprising and actionable they would be if true. For each, explain the pattern that suggested the hypothesis and the specific follow-up analysis I should run.
You will get suggestions like "purchases drop sharply at 30 days since signup — check whether trial ends match that pattern" or "90% of refunds come from 4% of products — investigate those products."
Avoiding Wall-of-Chart Syndrome
A classic mistake is to paste 30 charts into a report. Stakeholders cannot absorb that. Use AI to prune:
Here are 22 EDA findings and charts. For a VP of sales who has five minutes, pick the three most important findings to lead with. For each, explain in one sentence why it matters and what the implication is.
The AI will force-rank the findings, highlighting the two or three that deserve airtime. You keep the other 19 in an appendix.
Building Intuition with Sample Stories
One of the best EDA prompts is to ask for stories at the row level:
Pick five "interesting" customers from this dataset and write a short paragraph describing each: who they are (demographics), what they did (behavior), and why they are interesting (high value, churned, anomalous pattern). Use real row data, not fictional examples.
This anchors your understanding of the dataset in concrete examples and gives you quotes you can use in the final report.
When EDA Signals a Data Problem
Sometimes EDA reveals that the data itself is broken upstream:
- Row counts drop to zero for a day — check the ETL logs
- A metric shifts by 10x on a specific date — suspect a schema change
- Nulls appear only after a certain date — a field was renamed or removed
- Categorical values explode in count — someone changed a dropdown to free text
Prompt AI to flag these:
Review this dataset and list any patterns that suggest an upstream data pipeline problem rather than a business change. Include the date of the anomaly and the specific signal that made you suspect a pipeline issue.
This saves you from publishing a report based on broken data.
Ending EDA: The Hypothesis Shortlist
A good EDA produces a hypothesis shortlist, not a dump. Prompt:
Based on everything above, produce a one-page summary with:
- Three confirmed findings (strong evidence)
- Three hypotheses worth testing (suggestive evidence)
- Two open questions (insufficient evidence, need more data or context)
- One upstream data concern worth flagging
Use bullet points. No charts. Every bullet should be a single sentence.
That one page is what you share with your manager. The 22 charts live in an appendix notebook.
Key Takeaways
- Use the five-pass workflow: shape, missingness, univariate, bivariate, time
- AI finds outliers systematically using IQR + z-score + cross-column detection
- Ask for decomposition prompts to explain metric movement
- Have AI generate surprising hypotheses you would have missed
- End every EDA with a one-page shortlist, not a 30-chart dump

