Build a Mini Data Analysis Project

So far you have learned the building blocks. This lesson is where you put them together. You will load a real dataset, ask three real business questions, and produce three charts that answer them. The whole project takes 30 to 60 minutes and is short enough to drop into a portfolio or a job application.

What You'll Learn

How to structure a data analysis project from start to finish
The "question first, code second" workflow
How to use AI as a project partner without losing your reasoning
How to summarize findings the way a data scientist would

The Dataset

We will use the Titanic dataset — well known, small, and rich enough to ask interesting questions. You can load it in one line:

import seaborn as sns
df = sns.load_dataset("titanic")
df.head()

You will see columns like survived, pclass, sex, age, fare, embarked, class, who, adult_male, alive. About 891 rows.

Step 1: Inspect Before You Touch

Always begin by looking at what you have.

df.shape
df.info()
df.describe()
df.isna().sum()

You will notice that age has 177 missing values, and embarked has a couple. Decide what to do about those — drop them, fill them, or carry them through. Document your choice.

For this project we will keep age as-is (pandas skips NaN for .mean() automatically) and ignore the two missing embarked values.

Step 2: Pick Three Questions

Before you write any code, write three questions in plain English. These guide everything else.

For Titanic, three good ones:

Did women have a higher survival rate than men?
Did passenger class affect survival?
Did age affect survival, and did the effect differ by sex?

Pick questions you actually want answers to. The discipline of writing them down first stops you from doing aimless plotting.

Step 3: Answer Each Question with Code + Chart

Question 1: Survival by sex

import matplotlib.pyplot as plt

survival_by_sex = df.groupby("sex")["survived"].mean()
print(survival_by_sex)

ax = survival_by_sex.plot(kind="bar", figsize=(6, 4), color=["#ff6b6b", "#4ecdc4"])
ax.set_ylabel("Survival rate")
ax.set_title("Survival rate by sex on the Titanic")
ax.set_ylim(0, 1)
plt.xticks(rotation=0)
plt.show()

You will see female survival around 74 percent and male around 19 percent. Striking.

Question 2: Survival by class

survival_by_class = df.groupby("class")["survived"].mean().reindex(["First", "Second", "Third"])
print(survival_by_class)

ax = survival_by_class.plot(kind="bar", figsize=(6, 4), color="#3a86ff")
ax.set_ylabel("Survival rate")
ax.set_title("Survival rate by passenger class")
ax.set_ylim(0, 1)
plt.xticks(rotation=0)
plt.show()

First class around 63 percent, second around 47 percent, third around 24 percent. Class matters.

Question 3: Age and sex combined

import seaborn as sns

plt.figure(figsize=(10, 5))
sns.violinplot(data=df, x="sex", y="age", hue="survived", split=True, palette="Set2")
plt.title("Age distribution by sex, split by survival")
plt.show()

A violin plot shows the distribution. You will see that young children (under 10) survived at much higher rates regardless of sex, and that older men were the least likely to survive. Compare the shapes.

Step 4: Write Your Findings

Open a markdown cell in Colab (Insert → Text cell) and write a short summary. Three or four bullet points, no more:

Sex was the strongest predictor of survival. Women survived at roughly 74 percent, men at roughly 19 percent — a gap of 55 points.
Class amplified the gap. Among first-class passengers, the survival rate was 63 percent; among third-class, only 24 percent.
Children survived at higher rates regardless of sex. Passengers under 10 had visibly higher survival in both sexes.
Limitations. Age is missing for 177 passengers, so the age-based conclusion is based on the 80 percent of records where age is known.

Notice the structure: each finding is one sentence, with the number that supports it. The "limitations" bullet is what separates an analysis from a stunt.

Step 5: Use AI as a Reviewer

After you have a draft of the analysis, ask AI to critique it. Use this prompt:

I just finished a beginner data analysis project on the Titanic dataset. Below is my code and my written summary. Please review it as a senior data scientist would. Tell me:

Are my findings supported by the analysis?

What did I miss?

What is one chart I should add?

What would you say to a stakeholder reading this for the first time?

Code: [paste] Summary: [paste]

You will get suggestions like "consider adding a scatter showing fare vs survival probability" or "your child survival claim is based on a small subset — flag the sample size." Apply the feedback you agree with.

What Makes This Project Portfolio-Ready

A working data scientist would expect five things in a project this size:

A clear question. Not "explore the data" — three specific questions.
Evidence for each finding. A number, a table, or a chart.
Honest limitations. Missing data, sample size issues, things you could not test.
Reproducible code. Anyone can run your notebook and get the same results.
A short writeup. Two paragraphs at most.

Drop those into a notebook, save it to GitHub, and you have a project to point to in a job application.

Step 6: Save Your Work

In Colab: File → Save a copy in GitHub to push it to a repository, or File → Download → .ipynb for a local copy.

Add a README.md to the GitHub repo with your three questions and the bullet-point findings. That is the version a recruiter or grader will read first.

Try Three More Datasets

Once Titanic is done, repeat the workflow on:

sns.load_dataset("tips") — restaurant tips data, simpler than Titanic
sns.load_dataset("penguins") — three species of penguins, great for comparison plots
sns.load_dataset("diamonds") — bigger dataset, good for histograms and scatter

Each one takes 30 to 60 minutes once you have the workflow down. Three datasets in a weekend is a real portfolio.

A Common Trap: Letting AI Write the Whole Project

You can ask ChatGPT or Claude "write a Titanic analysis," and it will spit out a complete notebook. Resist that. The point is for you to internalize the workflow.

A good middle ground: write the questions yourself, write a first draft of the code yourself, then ask AI to review and improve it. The pattern is AI as reviewer, not author — the same way a senior engineer reviews a junior's code rather than writing it for them.

Key Takeaways

Always inspect data before touching it: shape, info, describe, isna().sum()
Write three plain-English questions before any code
One chart per question; one sentence per finding
Include limitations — missing data, small samples, untested confounders
Use AI as a reviewer; the workflow itself must come from you
Three short projects in a weekend is a real portfolio you can share

Build a Mini Data Analysis Project

What You'll Learn

How to structure a data analysis project from start to finish
The "question first, code second" workflow
How to use AI as a project partner without losing your reasoning
How to summarize findings the way a data scientist would

The Dataset

We will use the Titanic dataset — well known, small, and rich enough to ask interesting questions. You can load it in one line:

import seaborn as sns
df = sns.load_dataset("titanic")
df.head()

You will see columns like survived, pclass, sex, age, fare, embarked, class, who, adult_male, alive. About 891 rows.

Step 1: Inspect Before You Touch

Always begin by looking at what you have.

df.shape
df.info()
df.describe()
df.isna().sum()

You will notice that age has 177 missing values, and embarked has a couple. Decide what to do about those — drop them, fill them, or carry them through. Document your choice.

For this project we will keep age as-is (pandas skips NaN for .mean() automatically) and ignore the two missing embarked values.

Step 2: Pick Three Questions

Before you write any code, write three questions in plain English. These guide everything else.

For Titanic, three good ones:

Did women have a higher survival rate than men?
Did passenger class affect survival?
Did age affect survival, and did the effect differ by sex?

Pick questions you actually want answers to. The discipline of writing them down first stops you from doing aimless plotting.

Step 3: Answer Each Question with Code + Chart

Question 1: Survival by sex

import matplotlib.pyplot as plt

survival_by_sex = df.groupby("sex")["survived"].mean()
print(survival_by_sex)

ax = survival_by_sex.plot(kind="bar", figsize=(6, 4), color=["#ff6b6b", "#4ecdc4"])
ax.set_ylabel("Survival rate")
ax.set_title("Survival rate by sex on the Titanic")
ax.set_ylim(0, 1)
plt.xticks(rotation=0)
plt.show()

You will see female survival around 74 percent and male around 19 percent. Striking.

Question 2: Survival by class

survival_by_class = df.groupby("class")["survived"].mean().reindex(["First", "Second", "Third"])
print(survival_by_class)

ax = survival_by_class.plot(kind="bar", figsize=(6, 4), color="#3a86ff")
ax.set_ylabel("Survival rate")
ax.set_title("Survival rate by passenger class")
ax.set_ylim(0, 1)
plt.xticks(rotation=0)
plt.show()

First class around 63 percent, second around 47 percent, third around 24 percent. Class matters.

Question 3: Age and sex combined

import seaborn as sns

plt.figure(figsize=(10, 5))
sns.violinplot(data=df, x="sex", y="age", hue="survived", split=True, palette="Set2")
plt.title("Age distribution by sex, split by survival")
plt.show()

Step 4: Write Your Findings

Open a markdown cell in Colab (Insert → Text cell) and write a short summary. Three or four bullet points, no more:

Sex was the strongest predictor of survival. Women survived at roughly 74 percent, men at roughly 19 percent — a gap of 55 points.
Class amplified the gap. Among first-class passengers, the survival rate was 63 percent; among third-class, only 24 percent.
Children survived at higher rates regardless of sex. Passengers under 10 had visibly higher survival in both sexes.
Limitations. Age is missing for 177 passengers, so the age-based conclusion is based on the 80 percent of records where age is known.

Notice the structure: each finding is one sentence, with the number that supports it. The "limitations" bullet is what separates an analysis from a stunt.

Step 5: Use AI as a Reviewer

After you have a draft of the analysis, ask AI to critique it. Use this prompt:

I just finished a beginner data analysis project on the Titanic dataset. Below is my code and my written summary. Please review it as a senior data scientist would. Tell me:

Are my findings supported by the analysis?

What did I miss?

What is one chart I should add?

What would you say to a stakeholder reading this for the first time?

Code: [paste] Summary: [paste]

What Makes This Project Portfolio-Ready

A working data scientist would expect five things in a project this size:

A clear question. Not "explore the data" — three specific questions.
Evidence for each finding. A number, a table, or a chart.
Honest limitations. Missing data, sample size issues, things you could not test.
Reproducible code. Anyone can run your notebook and get the same results.
A short writeup. Two paragraphs at most.

Drop those into a notebook, save it to GitHub, and you have a project to point to in a job application.

Step 6: Save Your Work

In Colab: File → Save a copy in GitHub to push it to a repository, or File → Download → .ipynb for a local copy.

Add a README.md to the GitHub repo with your three questions and the bullet-point findings. That is the version a recruiter or grader will read first.

Try Three More Datasets

Once Titanic is done, repeat the workflow on:

sns.load_dataset("tips") — restaurant tips data, simpler than Titanic
sns.load_dataset("penguins") — three species of penguins, great for comparison plots
sns.load_dataset("diamonds") — bigger dataset, good for histograms and scatter

Each one takes 30 to 60 minutes once you have the workflow down. Three datasets in a weekend is a real portfolio.

A Common Trap: Letting AI Write the Whole Project

You can ask ChatGPT or Claude "write a Titanic analysis," and it will spit out a complete notebook. Resist that. The point is for you to internalize the workflow.

Key Takeaways

Always inspect data before touching it: shape, info, describe, isna().sum()
Write three plain-English questions before any code
One chart per question; one sentence per finding
Include limitations — missing data, small samples, untested confounders
Use AI as a reviewer; the workflow itself must come from you
Three short projects in a weekend is a real portfolio you can share

Build a Mini Data Analysis Project

What You'll Learn

The Dataset

Step 1: Inspect Before You Touch

Step 2: Pick Three Questions

Step 3: Answer Each Question with Code + Chart

Question 1: Survival by sex

Question 2: Survival by class

Question 3: Age and sex combined

Step 4: Write Your Findings

Step 5: Use AI as a Reviewer

What Makes This Project Portfolio-Ready

Step 6: Save Your Work

Try Three More Datasets

A Common Trap: Letting AI Write the Whole Project

Key Takeaways

Quiz

Build a Mini Data Analysis Project

What You'll Learn

The Dataset

Step 1: Inspect Before You Touch

Step 2: Pick Three Questions

Step 3: Answer Each Question with Code + Chart

Question 1: Survival by sex

Question 2: Survival by class

Question 3: Age and sex combined

Step 4: Write Your Findings

Step 5: Use AI as a Reviewer

What Makes This Project Portfolio-Ready

Step 6: Save Your Work

Try Three More Datasets

A Common Trap: Letting AI Write the Whole Project

Key Takeaways

Quiz