Independence and Joint Probability
Understanding when events are independent—and when they're not—is crucial for building accurate AI models. This lesson explores how we determine whether events influence each other and how we calculate the probability of multiple events occurring together.
What is Independence?
Two events A and B are independent if knowing one tells us nothing about the other:
P(A | B) = P(A)
Equivalently:
P(B | A) = P(B)
The occurrence of one event doesn't change the probability of the other.
Testing for Independence
Events A and B are independent if and only if:
P(A and B) = P(A) × P(B)
This is the multiplication rule for independent events.
Example: Coin Flips
Two fair coin flips are independent:
- P(First flip heads) = 0.5
- P(Second flip heads) = 0.5
- P(Both heads) = 0.5 × 0.5 = 0.25
Knowing the first flip was heads doesn't change the probability of the second.
Example: Drawing Cards (Without Replacement)
Drawing cards WITHOUT replacement are not independent:
- P(First card is Ace) = 4/52
- P(Second card is Ace | First was Ace) = 3/51 ≠ 4/52
The first draw affects the second!
Dependent Events
When events are dependent (not independent):
P(A and B) = P(A) × P(B | A)
Or equivalently:
P(A and B) = P(B) × P(A | B)
Example: Weather and Picnic
- P(Sunny) = 0.7
- P(Picnic | Sunny) = 0.9
- P(Picnic | Not Sunny) = 0.1
The probability of a picnic depends on the weather:
P(Sunny and Picnic) = P(Sunny) × P(Picnic | Sunny)
= 0.7 × 0.9
= 0.63
Joint Probability
Joint probability P(A, B) or P(A and B) is the probability that both events occur.
For multiple events:
P(A, B, C) = P(A) × P(B | A) × P(C | A, B)
This chain of conditional probabilities can extend to any number of events.
Independence in Machine Learning
The Naive Bayes Assumption
The Naive Bayes classifier assumes that features are independent given the class:
P(features | class) = P(feature1 | class) × P(feature2 | class) × ...
This is called "naive" because features are rarely truly independent in practice. Yet it often works well!
For spam detection:
P("free", "money", "click" | Spam) ≈ P("free" | Spam) × P("money" | Spam) × P("click" | Spam)
Why the Naive Assumption Works
Even when features aren't truly independent, Naive Bayes often performs well because:
- We only need to rank probabilities, not get exact values
- The simplification greatly reduces the data needed
- Errors often cancel out across features
When Independence Fails
Consider sentiment analysis:
- "not good" is negative
- "not bad" is positive (often)
The words "not" and "good"/"bad" are highly dependent—treating them independently would miss the negation pattern.
This is why modern NLP moved to models that capture dependencies (like transformers).
Marginal Probability
Marginal probability is the probability of a single event, obtained by summing over all possibilities of other events.
If we know the joint probability P(A, B), we can find:
P(A) = ∑ P(A, B) for all values of B
Example: Medical Test
| Disease | No Disease | P(Test Result) | |
|---|---|---|---|
| Positive Test | 0.009 | 0.099 | 0.108 |
| Negative Test | 0.001 | 0.891 | 0.892 |
| P(Disease Status) | 0.01 | 0.99 | 1.0 |
Marginal probabilities:
- P(Positive Test) = 0.009 + 0.099 = 0.108
- P(Disease) = 0.009 + 0.001 = 0.01
Conditional Independence
Events A and B are conditionally independent given C if:
P(A, B | C) = P(A | C) × P(B | C)
Things can be dependent overall but independent once we know C.
Example: Height and Vocabulary
Height and vocabulary size are correlated in the general population (taller people tend to have larger vocabularies).
But conditioned on age, they're independent! The correlation exists because both height and vocabulary increase with age in children.
P(Height, Vocabulary | Age) = P(Height | Age) × P(Vocabulary | Age)
Independence in Neural Networks
Feature Learning
Deep neural networks learn to create independent features in their hidden layers. This is often more effective than assuming input features are independent.
The network transforms dependent raw inputs (pixels) into more independent high-level features (edges, shapes, objects).
Dropout
Dropout regularization works partly by forcing neurons to be more independent:
- Randomly "dropping" neurons during training
- Prevents co-adaptation where neurons only work together
- Each neuron must be independently useful
Checking Independence in Data
To test if two features are independent in real data:
- Calculate P(A) and P(B) from data
- Calculate P(A and B) from data
- Compare P(A and B) with P(A) × P(B)
If they're close, the events are approximately independent.
# Pseudocode
p_a = count_a / total
p_b = count_b / total
p_a_and_b = count_a_and_b / total
# Check independence
if abs(p_a_and_b - (p_a * p_b)) < threshold:
print("Approximately independent")
else:
print("Dependent")
Joint Probability Tables
For discrete variables, we can represent joint probabilities in tables:
Sentiment vs. Length
| Short | Medium | Long | |
|---|---|---|---|
| Positive | 0.10 | 0.20 | 0.05 |
| Negative | 0.15 | 0.25 | 0.10 |
| Neutral | 0.05 | 0.05 | 0.05 |
From this table:
- P(Positive and Medium) = 0.20
- P(Positive) = 0.10 + 0.20 + 0.05 = 0.35
- P(Medium) = 0.20 + 0.25 + 0.05 = 0.50
Are Sentiment and Length independent?
- P(Positive) × P(Medium) = 0.35 × 0.50 = 0.175
- P(Positive and Medium) = 0.20
- 0.175 ≠ 0.20, so they're not independent
Summary
- Independent events: Knowing one doesn't change the probability of the other
- Test for independence: P(A and B) = P(A) × P(B)
- Dependent events: Use P(A and B) = P(A) × P(B | A)
- Naive Bayes assumes feature independence given the class
- Marginal probability: Sum over all values of other variables
- Conditional independence: Independent given some other information
- Neural networks learn to create independent features
This completes Module 1! Next, we dive into Bayes' theorem—the most important formula in probabilistic AI.

