Bias & Fairness Case Studies
Theory is helpful. Cases are sticky. In this lesson, you'll walk through six famous AI bias incidents — what happened, why it happened, what was fixed, and what you can learn from it. By the end, you'll have a vocabulary for talking about bias that recruiters, professors, and regulators take seriously.
What You'll Learn
- Six landmark AI bias cases worth knowing by name
- Patterns that show up across industries
- Specific lessons each case teaches a junior practitioner
- How to use these stories in interviews and policy discussions
Case 1: Amazon's Resume-Screening AI (2014–2018)
What happened: Amazon built an internal AI to score resumes 1–5 stars. It learned from 10 years of resumes, mostly from men. It downgraded resumes that contained the word "women's" (as in "women's chess club captain"), penalized graduates of two all-women colleges, and favored masculine-coded language.
Why: Historical hiring patterns at Amazon had skewed male. The model learned that "successful candidates look like past hires" — and past hires were mostly men.
What was fixed: Amazon scrapped the project entirely.
Lesson: When you train AI on historical decisions, you teach it to repeat past discrimination. "Just remove the gender column" doesn't help — the model finds proxies (sports clubs, college names, word choices).
Case 2: COMPAS Recidivism Algorithm (2016)
What happened: ProPublica investigated COMPAS, a tool used in U.S. courts to predict the risk that a defendant would re-offend. The journalists found that Black defendants were nearly twice as likely as white defendants to be falsely flagged as high-risk; white defendants were more likely to be falsely flagged as low-risk.
Why: The model used variables that correlated with race even though race was not a direct input. It also used historical arrest data — but arrests do not equal crimes; they reflect policing patterns.
What was fixed: Several U.S. states have since restricted use of risk-assessment AI in sentencing. The case is studied in every algorithmic fairness course.
Lesson: Proxy variables matter. And different definitions of fairness (equal false-positive rates, equal accuracy, equal precision) can mathematically conflict.
Case 3: Facial Recognition & Joy Buolamwini's Gender Shades (2018)
What happened: Researcher Joy Buolamwini tested facial-analysis systems from IBM, Microsoft, and Face++. Error rates for light-skinned men were under 1%. For dark-skinned women, error rates went up to 35%. The systems were essentially broken for one group.
Why: Training datasets were dominated by lighter-skinned faces, and the test sets used to evaluate accuracy did not stratify by skin tone or gender.
What was fixed: IBM exited the facial recognition market entirely. Microsoft restricted access. Several U.S. cities banned police use of facial recognition. The "Gender Shades" methodology became a standard for AI auditing.
Lesson: A model that is "97% accurate on average" can be 65% accurate for a subgroup. Disaggregated evaluation is non-negotiable.
Case 4: The Apple Card Credit Limit (2019)
What happened: Tech entrepreneur David Heinemeier Hansson tweeted that his Apple Card credit limit was 20× higher than his wife's, despite shared assets and her higher credit score. Apple co-founder Steve Wozniak said the same happened to him and his wife. New York's Department of Financial Services investigated.
Why: The full underlying model was never publicly disclosed (a transparency issue in itself). The issuing bank's algorithm appeared to use proxy variables that disadvantaged women applicants.
What was fixed: The investigation closed without finding intentional discrimination, but the case forced the industry to confront the transparency gap. New York and several other states tightened rules on automated credit decisions.
Lesson: When users cannot see why a decision was made, even non-biased decisions become impossible to defend.
Case 5: Healthcare Risk Algorithm (Obermeyer et al., 2019)
What happened: A widely used U.S. healthcare algorithm used annual healthcare spending as a proxy for healthcare need. Because Black patients historically had less spent on them per equally sick body, the model systematically under-rated their illness severity, denying them entry to special-care programs.
Why: The proxy variable (spending) was shaped by historical inequalities in access to care.
What was fixed: When researchers swapped the proxy from "cost" to "active chronic conditions," the bias dropped dramatically. The hospitals using the tool corrected the input variable.
Lesson: The metric you optimize is the metric you embed. "Costs less to treat" is not the same as "needs less treatment."
Case 6: AI Image Generators and Default Faces (2022–2024)
What happened: Early text-to-image models like DALL-E and Stable Diffusion would, by default, depict "CEO," "doctor," or "engineer" as white men, and "nurse," "teacher," or "housekeeper" as women. Researchers also found that prompts for "Black person" sometimes produced caricatured imagery.
Why: Training data (huge web image scrapes) reflected stereotypes. Default outputs reproduced them.
What was fixed: Providers added prompt-rewriting layers (Google's Gemini overshot in early 2024 with overcorrection). Anthropic, OpenAI, and Stability published model cards describing limitations. Bias is still present but more visible and labeled.
Lesson: Default outputs are political. The "no instruction" answer reflects whoever's data dominated training.
Patterns Across the Cases
Take a step back. Notice the recurring themes:
| Pattern | Cases where it appears |
|---|---|
| Historical data encodes past discrimination | Amazon, COMPAS, Apple Card |
| Proxy variables sneak protected attributes back in | Amazon, COMPAS, healthcare, Apple Card |
| Aggregate accuracy hides subgroup failure | Facial recognition, healthcare |
| Lack of transparency makes harm hard to challenge | COMPAS, Apple Card |
| Training data lacks diversity | Facial recognition, image generators |
These five patterns are basically the whole game. If you can name them, you understand what AI fairness research is about.
How to Use These in an Interview
If you're asked "tell me about an AI ethics case you find interesting," do not say "the lawyer who used ChatGPT" (everyone says that). Pick Gender Shades or the healthcare algorithm and tell the story in 90 seconds:
- What happened
- The mechanism (which pattern from above?)
- What was fixed
- One lesson for the company you're interviewing with
That story shows fluency, not name-dropping.
Hands-on: Apply a Case to a Tool You Use
Pick one of the six cases. Now pick an AI tool you use — Grammarly, GitHub Copilot, Notion AI, Midjourney, ChatGPT, an AI tutor app. In a chatbot, ask:
"I'm analyzing whether a tool similar to [TOOL] could exhibit the same kind of bias as the [CASE NAME] case. Walk through the parallels and differences, and identify three specific tests I could run to check."
Run the suggested tests. Document your findings. You now have something concrete for your portfolio.
Key Takeaways
- Six landmark cases cover the most important fairness lessons: Amazon hiring, COMPAS, Gender Shades, Apple Card, healthcare risk, image generators.
- Five recurring patterns: historical data, proxy variables, aggregate accuracy hides subgroups, missing transparency, missing diversity.
- "Just remove the protected attribute" almost never solves bias — proxies show up.
- Disaggregated evaluation (testing per subgroup) is the single most important technique.
- Knowing these cases by name dramatically improves how you talk about AI ethics professionally.

