Build a Mini Data Analysis Project
So far you have learned the building blocks. This lesson is where you put them together. You will load a real dataset, ask three real business questions, and produce three charts that answer them. The whole project takes 30 to 60 minutes and is short enough to drop into a portfolio or a job application.
What You'll Learn
- How to structure a data analysis project from start to finish
- The "question first, code second" workflow
- How to use AI as a project partner without losing your reasoning
- How to summarize findings the way a data scientist would
The Dataset
We will use the Titanic dataset — well known, small, and rich enough to ask interesting questions. You can load it in one line:
import seaborn as sns
df = sns.load_dataset("titanic")
df.head()
You will see columns like survived, pclass, sex, age, fare, embarked, class, who, adult_male, alive. About 891 rows.
Step 1: Inspect Before You Touch
Always begin by looking at what you have.
df.shape
df.info()
df.describe()
df.isna().sum()
You will notice that age has 177 missing values, and embarked has a couple. Decide what to do about those — drop them, fill them, or carry them through. Document your choice.
For this project we will keep age as-is (pandas skips NaN for .mean() automatically) and ignore the two missing embarked values.
Step 2: Pick Three Questions
Before you write any code, write three questions in plain English. These guide everything else.
For Titanic, three good ones:
- Did women have a higher survival rate than men?
- Did passenger class affect survival?
- Did age affect survival, and did the effect differ by sex?
Pick questions you actually want answers to. The discipline of writing them down first stops you from doing aimless plotting.
Step 3: Answer Each Question with Code + Chart
Question 1: Survival by sex
import matplotlib.pyplot as plt
survival_by_sex = df.groupby("sex")["survived"].mean()
print(survival_by_sex)
ax = survival_by_sex.plot(kind="bar", figsize=(6, 4), color=["#ff6b6b", "#4ecdc4"])
ax.set_ylabel("Survival rate")
ax.set_title("Survival rate by sex on the Titanic")
ax.set_ylim(0, 1)
plt.xticks(rotation=0)
plt.show()
You will see female survival around 74 percent and male around 19 percent. Striking.
Question 2: Survival by class
survival_by_class = df.groupby("class")["survived"].mean().reindex(["First", "Second", "Third"])
print(survival_by_class)
ax = survival_by_class.plot(kind="bar", figsize=(6, 4), color="#3a86ff")
ax.set_ylabel("Survival rate")
ax.set_title("Survival rate by passenger class")
ax.set_ylim(0, 1)
plt.xticks(rotation=0)
plt.show()
First class around 63 percent, second around 47 percent, third around 24 percent. Class matters.
Question 3: Age and sex combined
import seaborn as sns
plt.figure(figsize=(10, 5))
sns.violinplot(data=df, x="sex", y="age", hue="survived", split=True, palette="Set2")
plt.title("Age distribution by sex, split by survival")
plt.show()
A violin plot shows the distribution. You will see that young children (under 10) survived at much higher rates regardless of sex, and that older men were the least likely to survive. Compare the shapes.
Step 4: Write Your Findings
Open a markdown cell in Colab (Insert → Text cell) and write a short summary. Three or four bullet points, no more:
- Sex was the strongest predictor of survival. Women survived at roughly 74 percent, men at roughly 19 percent — a gap of 55 points.
- Class amplified the gap. Among first-class passengers, the survival rate was 63 percent; among third-class, only 24 percent.
- Children survived at higher rates regardless of sex. Passengers under 10 had visibly higher survival in both sexes.
- Limitations. Age is missing for 177 passengers, so the age-based conclusion is based on the 80 percent of records where age is known.
Notice the structure: each finding is one sentence, with the number that supports it. The "limitations" bullet is what separates an analysis from a stunt.
Step 5: Use AI as a Reviewer
After you have a draft of the analysis, ask AI to critique it. Use this prompt:
I just finished a beginner data analysis project on the Titanic dataset. Below is my code and my written summary. Please review it as a senior data scientist would. Tell me:
- Are my findings supported by the analysis?
- What did I miss?
- What is one chart I should add?
- What would you say to a stakeholder reading this for the first time?
Code:
[paste]Summary:[paste]
You will get suggestions like "consider adding a scatter showing fare vs survival probability" or "your child survival claim is based on a small subset — flag the sample size." Apply the feedback you agree with.
What Makes This Project Portfolio-Ready
A working data scientist would expect five things in a project this size:
- A clear question. Not "explore the data" — three specific questions.
- Evidence for each finding. A number, a table, or a chart.
- Honest limitations. Missing data, sample size issues, things you could not test.
- Reproducible code. Anyone can run your notebook and get the same results.
- A short writeup. Two paragraphs at most.
Drop those into a notebook, save it to GitHub, and you have a project to point to in a job application.
Step 6: Save Your Work
In Colab: File → Save a copy in GitHub to push it to a repository, or File → Download → .ipynb for a local copy.
Add a README.md to the GitHub repo with your three questions and the bullet-point findings. That is the version a recruiter or grader will read first.
Try Three More Datasets
Once Titanic is done, repeat the workflow on:
sns.load_dataset("tips")— restaurant tips data, simpler than Titanicsns.load_dataset("penguins")— three species of penguins, great for comparison plotssns.load_dataset("diamonds")— bigger dataset, good for histograms and scatter
Each one takes 30 to 60 minutes once you have the workflow down. Three datasets in a weekend is a real portfolio.
A Common Trap: Letting AI Write the Whole Project
You can ask ChatGPT or Claude "write a Titanic analysis," and it will spit out a complete notebook. Resist that. The point is for you to internalize the workflow.
A good middle ground: write the questions yourself, write a first draft of the code yourself, then ask AI to review and improve it. The pattern is AI as reviewer, not author — the same way a senior engineer reviews a junior's code rather than writing it for them.
Key Takeaways
- Always inspect data before touching it:
shape,info,describe,isna().sum() - Write three plain-English questions before any code
- One chart per question; one sentence per finding
- Include limitations — missing data, small samples, untested confounders
- Use AI as a reviewer; the workflow itself must come from you
- Three short projects in a weekend is a real portfolio you can share

