Conditional Probability
Conditional probability answers the question: "How does the probability of an event change when we know something else has happened?" This concept is fundamental to how AI systems update their predictions based on new information.
The Core Idea
Conditional probability is the probability of event A occurring, given that event B has already occurred.
Notation: P(A | B) — read as "probability of A given B"
This is different from P(A), which is the probability of A without any additional information.
A Simple Example
Imagine a deck of 52 cards:
- P(drawing a King) = 4/52 = 1/13
But what if you know the card is a face card (Jack, Queen, or King)?
- There are 12 face cards total
- 4 of them are Kings
- P(King | Face card) = 4/12 = 1/3
Knowing additional information changed our probability!
The Formula
The conditional probability formula is:
P(A | B) = P(A and B) / P(B)
Where:
- P(A | B) = probability of A given B
- P(A and B) = probability of both A and B occurring
- P(B) = probability of B occurring
Intuition
When we condition on B, we're restricting our universe to only cases where B happened. We then ask: of those cases, how many also have A?
Step-by-Step Calculation
Example: Weather and Umbrella
Given:
- P(Rain) = 0.3
- P(Umbrella) = 0.4
- P(Rain and Umbrella) = 0.25
What's the probability of rain, given someone has an umbrella?
P(Rain | Umbrella) = P(Rain and Umbrella) / P(Umbrella)
= 0.25 / 0.4
= 0.625
If you see someone with an umbrella, there's a 62.5% chance it's raining!
Conditional Probability in AI
Spam Filtering
An email spam filter uses conditional probability constantly:
- P(Spam | contains "free money") = very high
- P(Spam | contains your name and references previous conversation) = low
The filter calculates P(Spam | email features) and classifies accordingly.
Medical Diagnosis AI
Consider an AI diagnosing diseases:
- P(Disease | Positive test) is what we really want to know
- This is different from P(Positive test | Disease), which is what tests are typically measured by
This distinction is crucial and leads us to Bayes' theorem in the next module.
Language Models
When a language model generates text, it constantly uses conditional probability:
P(next_word | previous_words)
For example:
- P("sat" | "The cat") — probability "sat" comes after "The cat"
- P("the" | "The cat sat on") — probability "the" comes after "The cat sat on"
Each word is predicted conditioned on all previous words.
The Chain Rule
We can express joint probabilities using conditional probabilities:
P(A and B) = P(A | B) × P(B)
= P(B | A) × P(A)
This extends to multiple events:
P(A, B, C) = P(A) × P(B | A) × P(C | A, B)
Language Model Example
The probability of a sentence "The cat sat" can be decomposed:
P("The cat sat") = P("The") × P("cat" | "The") × P("sat" | "The cat")
This is exactly how autoregressive language models work!
Common Mistakes
Mistake 1: Confusing P(A | B) with P(B | A)
These are generally NOT equal:
- P(Wet ground | Rain) ≈ 1.0 (rain almost always wets the ground)
- P(Rain | Wet ground) < 1.0 (ground could be wet from sprinklers)
Mistake 2: Ignoring the Condition
P(A | B) and P(A) are only equal when A and B are independent (covered in the next lesson).
Mistake 3: Base Rate Neglect
Even if P(Positive test | Disease) is high (say 99%), P(Disease | Positive test) might be low if the disease is rare.
If only 1 in 10,000 people have the disease:
- Most positive tests will be false positives
- P(Disease | Positive test) could be below 10%
This is why understanding conditional probability is critical for medical AI!
Probability Tables
We can organize conditional probabilities in tables:
Spam Detection Example
| Spam | Not Spam | Total | |
|---|---|---|---|
| Contains "Free" | 0.15 | 0.05 | 0.20 |
| No "Free" | 0.05 | 0.75 | 0.80 |
| Total | 0.20 | 0.80 | 1.00 |
From this table:
- P(Spam | Contains "Free") = 0.15 / 0.20 = 0.75
- P(Spam | No "Free") = 0.05 / 0.80 = 0.0625
Containing "Free" increases spam probability from 20% to 75%!
Updating Beliefs
Conditional probability is the mathematical foundation for:
- Bayesian updating — how we update beliefs with new evidence
- Filtering — how AI systems update predictions in real-time
- Context understanding — how language models use previous words to predict the next
Programming Perspective
In Python/pseudocode, conditional probability looks like:
# Count-based approach
def conditional_probability(a_and_b_count, b_count):
return a_and_b_count / b_count
# From a confusion matrix
def precision(true_positives, predicted_positives):
# P(Actually Positive | Predicted Positive)
return true_positives / predicted_positives
Precision in machine learning is actually a conditional probability!
Summary
- Conditional probability P(A | B) is the probability of A given B has occurred
- Formula: P(A | B) = P(A and B) / P(B)
- P(A | B) is generally different from P(B | A)
- The chain rule lets us decompose joint probabilities
- Language models use conditional probability at every step
- Many AI metrics (like precision) are conditional probabilities
In the next lesson, we'll explore independence—when knowing B tells us nothing about A.

