Conditional Probability

Conditional probability answers the question: "How does the probability of an event change when we know something else has happened?" This concept is fundamental to how AI systems update their predictions based on new information.

The Core Idea

Conditional probability is the probability of event A occurring, given that event B has already occurred.

Notation: P(A | B) — read as "probability of A given B"

This is different from P(A), which is the probability of A without any additional information.

A Simple Example

Imagine a deck of 52 cards:

P(drawing a King) = 4/52 = 1/13

But what if you know the card is a face card (Jack, Queen, or King)?

There are 12 face cards total
4 of them are Kings
P(King | Face card) = 4/12 = 1/3

Knowing additional information changed our probability!

The Formula

The conditional probability formula is:

P(A | B) = P(A and B) / P(B)

Where:

P(A | B) = probability of A given B
P(A and B) = probability of both A and B occurring
P(B) = probability of B occurring

Intuition

When we condition on B, we're restricting our universe to only cases where B happened. We then ask: of those cases, how many also have A?

Step-by-Step Calculation

Example: Weather and Umbrella

Given:

P(Rain) = 0.3
P(Umbrella) = 0.4
P(Rain and Umbrella) = 0.25

What's the probability of rain, given someone has an umbrella?

P(Rain | Umbrella) = P(Rain and Umbrella) / P(Umbrella)
                   = 0.25 / 0.4
                   = 0.625

If you see someone with an umbrella, there's a 62.5% chance it's raining!

Conditional Probability in AI

Spam Filtering

An email spam filter uses conditional probability constantly:

P(Spam | contains "free money") = very high
P(Spam | contains your name and references previous conversation) = low

The filter calculates P(Spam | email features) and classifies accordingly.

Medical Diagnosis AI

Consider an AI diagnosing diseases:

P(Disease | Positive test) is what we really want to know
This is different from P(Positive test | Disease), which is what tests are typically measured by

This distinction is crucial and leads us to Bayes' theorem in the next module.

Language Models

When a language model generates text, it constantly uses conditional probability:

P(next_word | previous_words)

For example:

P("sat" | "The cat") — probability "sat" comes after "The cat"
P("the" | "The cat sat on") — probability "the" comes after "The cat sat on"

Each word is predicted conditioned on all previous words.

The Chain Rule

We can express joint probabilities using conditional probabilities:

P(A and B) = P(A | B) × P(B)
           = P(B | A) × P(A)

This extends to multiple events:

P(A, B, C) = P(A) × P(B | A) × P(C | A, B)

Language Model Example

The probability of a sentence "The cat sat" can be decomposed:

P("The cat sat") = P("The") × P("cat" | "The") × P("sat" | "The cat")

This is exactly how autoregressive language models work!

Common Mistakes

Mistake 1: Confusing P(A | B) with P(B | A)

These are generally NOT equal:

P(Wet ground | Rain) ≈ 1.0 (rain almost always wets the ground)
P(Rain | Wet ground) < 1.0 (ground could be wet from sprinklers)

Mistake 2: Ignoring the Condition

P(A | B) and P(A) are only equal when A and B are independent (covered in the next lesson).

Mistake 3: Base Rate Neglect

Even if P(Positive test | Disease) is high (say 99%), P(Disease | Positive test) might be low if the disease is rare.

If only 1 in 10,000 people have the disease:

Most positive tests will be false positives
P(Disease | Positive test) could be below 10%

This is why understanding conditional probability is critical for medical AI!

Probability Tables

We can organize conditional probabilities in tables:

Spam Detection Example

	Spam	Not Spam	Total
Contains "Free"	0.15	0.05	0.20
No "Free"	0.05	0.75	0.80
Total	0.20	0.80	1.00

From this table:

P(Spam | Contains "Free") = 0.15 / 0.20 = 0.75
P(Spam | No "Free") = 0.05 / 0.80 = 0.0625

Containing "Free" increases spam probability from 20% to 75%!

Updating Beliefs

Conditional probability is the mathematical foundation for:

Bayesian updating — how we update beliefs with new evidence
Filtering — how AI systems update predictions in real-time
Context understanding — how language models use previous words to predict the next

Programming Perspective

In Python/pseudocode, conditional probability looks like:

# Count-based approach
def conditional_probability(a_and_b_count, b_count):
    return a_and_b_count / b_count

# From a confusion matrix
def precision(true_positives, predicted_positives):
    # P(Actually Positive | Predicted Positive)
    return true_positives / predicted_positives

Precision in machine learning is actually a conditional probability!

Summary

Conditional probability P(A | B) is the probability of A given B has occurred
Formula: P(A | B) = P(A and B) / P(B)
P(A | B) is generally different from P(B | A)
The chain rule lets us decompose joint probabilities
Language models use conditional probability at every step
Many AI metrics (like precision) are conditional probabilities

In the next lesson, we'll explore independence—when knowing B tells us nothing about A.