Independence and Joint Probability

Understanding when events are independent—and when they're not—is crucial for building accurate AI models. This lesson explores how we determine whether events influence each other and how we calculate the probability of multiple events occurring together.

What is Independence?

Two events A and B are independent if knowing one tells us nothing about the other:

P(A | B) = P(A)

Equivalently:

P(B | A) = P(B)

The occurrence of one event doesn't change the probability of the other.

Testing for Independence

Events A and B are independent if and only if:

P(A and B) = P(A) × P(B)

This is the multiplication rule for independent events.

Example: Coin Flips

Two fair coin flips are independent:

P(First flip heads) = 0.5
P(Second flip heads) = 0.5
P(Both heads) = 0.5 × 0.5 = 0.25

Knowing the first flip was heads doesn't change the probability of the second.

Example: Drawing Cards (Without Replacement)

Drawing cards WITHOUT replacement are not independent:

P(First card is Ace) = 4/52
P(Second card is Ace | First was Ace) = 3/51 ≠ 4/52

The first draw affects the second!

Dependent Events

When events are dependent (not independent):

P(A and B) = P(A) × P(B | A)

Or equivalently:

P(A and B) = P(B) × P(A | B)

Example: Weather and Picnic

P(Sunny) = 0.7
P(Picnic | Sunny) = 0.9
P(Picnic | Not Sunny) = 0.1

The probability of a picnic depends on the weather:

P(Sunny and Picnic) = P(Sunny) × P(Picnic | Sunny)
                    = 0.7 × 0.9
                    = 0.63

Joint Probability

Joint probability P(A, B) or P(A and B) is the probability that both events occur.

For multiple events:

P(A, B, C) = P(A) × P(B | A) × P(C | A, B)

This chain of conditional probabilities can extend to any number of events.

Independence in Machine Learning

The Naive Bayes Assumption

The Naive Bayes classifier assumes that features are independent given the class:

P(features | class) = P(feature1 | class) × P(feature2 | class) × ...

This is called "naive" because features are rarely truly independent in practice. Yet it often works well!

For spam detection:

P("free", "money", "click" | Spam) ≈ P("free" | Spam) × P("money" | Spam) × P("click" | Spam)

Why the Naive Assumption Works

Even when features aren't truly independent, Naive Bayes often performs well because:

We only need to rank probabilities, not get exact values
The simplification greatly reduces the data needed
Errors often cancel out across features

When Independence Fails

Consider sentiment analysis:

"not good" is negative
"not bad" is positive (often)

The words "not" and "good"/"bad" are highly dependent—treating them independently would miss the negation pattern.

This is why modern NLP moved to models that capture dependencies (like transformers).

Marginal Probability

Marginal probability is the probability of a single event, obtained by summing over all possibilities of other events.

If we know the joint probability P(A, B), we can find:

P(A) = ∑ P(A, B) for all values of B

Example: Medical Test

	Disease	No Disease	P(Test Result)
Positive Test	0.009	0.099	0.108
Negative Test	0.001	0.891	0.892
P(Disease Status)	0.01	0.99	1.0

Marginal probabilities:

P(Positive Test) = 0.009 + 0.099 = 0.108
P(Disease) = 0.009 + 0.001 = 0.01

Conditional Independence

Events A and B are conditionally independent given C if:

P(A, B | C) = P(A | C) × P(B | C)

Things can be dependent overall but independent once we know C.

Example: Height and Vocabulary

Height and vocabulary size are correlated in the general population (taller people tend to have larger vocabularies).

But conditioned on age, they're independent! The correlation exists because both height and vocabulary increase with age in children.

P(Height, Vocabulary | Age) = P(Height | Age) × P(Vocabulary | Age)

Independence in Neural Networks

Feature Learning

Deep neural networks learn to create independent features in their hidden layers. This is often more effective than assuming input features are independent.

The network transforms dependent raw inputs (pixels) into more independent high-level features (edges, shapes, objects).

Dropout

Dropout regularization works partly by forcing neurons to be more independent:

Randomly "dropping" neurons during training
Prevents co-adaptation where neurons only work together
Each neuron must be independently useful

Checking Independence in Data

To test if two features are independent in real data:

Calculate P(A) and P(B) from data
Calculate P(A and B) from data
Compare P(A and B) with P(A) × P(B)

If they're close, the events are approximately independent.

# Pseudocode
p_a = count_a / total
p_b = count_b / total
p_a_and_b = count_a_and_b / total

# Check independence
if abs(p_a_and_b - (p_a * p_b)) < threshold:
    print("Approximately independent")
else:
    print("Dependent")

Joint Probability Tables

For discrete variables, we can represent joint probabilities in tables:

Sentiment vs. Length

	Short	Medium	Long
Positive	0.10	0.20	0.05
Negative	0.15	0.25	0.10
Neutral	0.05	0.05	0.05

From this table:

P(Positive and Medium) = 0.20
P(Positive) = 0.10 + 0.20 + 0.05 = 0.35
P(Medium) = 0.20 + 0.25 + 0.05 = 0.50

Are Sentiment and Length independent?

P(Positive) × P(Medium) = 0.35 × 0.50 = 0.175
P(Positive and Medium) = 0.20
0.175 ≠ 0.20, so they're not independent

Summary

Independent events: Knowing one doesn't change the probability of the other
Test for independence: P(A and B) = P(A) × P(B)
Dependent events: Use P(A and B) = P(A) × P(B | A)
Naive Bayes assumes feature independence given the class
Marginal probability: Sum over all values of other variables
Conditional independence: Independent given some other information
Neural networks learn to create independent features

This completes Module 1! Next, we dive into Bayes' theorem—the most important formula in probabilistic AI.