Conditional probability
Let \((S, \mathcal{S}, P)\) be a probability space and let \(A \in \mathcal{S}\) be an event with \(P(A) > 0\). The conditional probability of any event \(E\in\mathcal{S}\) given \(A\) is defined as: \[ P(E\;|A) = \frac{P(E\cap A)}{P(A)} \]
This is interpreted as the chance of \(E\) provided that \(A\) has occurred. Importantly, \(E|A\) is not an event; rather, \(P(\cdot\;| A)\) is a new probability measure. To see this, check the axioms:
- (A1) Since \(P\) is a probability measure, \(P(E\cap A) \geq 0\), so it follows \(P(E\;|A) = \frac{P(E\cap A)}{P(A)} \geq 0\).
- (A2) \(P(S\;|A) = \frac{P(S\cap A)}{P(A)} = \frac{P(A)}{P(A)} = 1\)
- (A3) If \(\{E_i\}\) is a disjoint collection, then \(\{E_i \cap A\}\) is also a disjoint collection, so by countable additivity of \(P\), one has: \[ P\left(\bigcup_i E_i \;\big|\; A\right) = \frac{P\left(\left[\bigcup_i E_i\right]\cap A\right)}{P(A)} = \frac{P\left(\bigcup_i (E_i\cap A)\right)}{P(A)} = \sum_i \frac{P(E_i\cap A)}{P(A)} = \sum_i P(E_i\;|A) \]
One can view \(P(\cdot\;|A)\) as a probability measure on \((S, \mathcal{S})\), or as a probability measure on \(\left(A, \mathcal{S}^A\right)\) where \(\mathcal{S}^A = \{E\cap A: E \in \mathcal{S}\}\). Some prefer the latter view, since it aligns with the interpretation that by conditioning on \(A\) one is redefining the sample space.
Basic properties
An immediate consequence of the definition is that: \[ P(E\cap A) = P(E\;| A) P(A) \]
In fact, this multiplication rule for conditional probabilities can be generalized to an arbitrary finite collection of events: \[ P\left(\bigcap_{i = 1}^n E_i\right) = P(E_1) \times P(E_2\;|E_1) \times P(E_3\;|E_1 \cap E_2) \times\cdots\times P(E_n\;| E_1 \cap \cdots \cap E_{n - 1}) \]
Or, written more compactly: \[ P\left(\bigcap_{i = 1}^n E_i\right) = P(E_1) \times \prod_{i = 2}^n P\left(E_i \;\Bigg| \bigcap_{j = 1}^{i - 1} E_j \right) \]
Apply the definition of conditional probability to the terms in the product on the right hand side to see that: \[ \begin{align*} P(E_1) \times \prod_{i = 2}^n P\left(E_i \;\Bigg| \bigcap_{j = 1}^{i - 1} E_j \right) &= P(E_1) \times \prod_{j = 2}^n \left[\frac{P\left(\bigcap_{j = 1}^i E_j\right)}{P\left(\bigcap_{j = 1}^{i - 1} E_j \right)}\right] \\ &= P(E_1) \times \frac{P\left(\bigcap_{j = 1}^2 E_j\right)}{P\left( E_1 \right)} \times \frac{P\left(\bigcap_{j = 1}^3 E_j\right)}{P\left(\bigcap_{j = 1}^{2} E_j \right)} \times\cdots \times \frac{P\left(\bigcap_{j = 1}^n E_j\right)}{P\left(\bigcap_{j = 1}^{n - 1} E_j \right)} \end{align*} \] Then notice that all terms cancel, leaving only \(P\left(\bigcap_{j = 1}^n E_j\right)\), and establishing the result.
The multiplication rule provides a convenient way to compute certain probabilities, as in some problems it’s easier to find a conditional probability than an unconditional one.
Consider drawing 5 cards at random. What’s the probability that all 5 are diamonds?
This could be found by computing the total number of ways to draw 5 diamonds out of the total number of ways to draw 5 cards: \[ \frac{{13 \choose 5}}{{52 \choose 5}} = \frac{13!}{5!8!}\times\frac{5!47!}{52!} \] Or, one can avoid evaluating the factorials and instead notice that the conditional probability of drawing a diamond given having already drawn \(k\) diamonds is \(\frac{13 - k}{52 - k}\) — the number of diamonds left as a proportion of the number of cards left — so: \[ P\left(\text{5 diamonds}\right) = \frac{13}{52}\times\frac{12}{51}\times\frac{11}{50}\times\frac{10}{49}\times\frac{9}{48} \approx 0.000495 \]
For a quick exercise, check the result by simplifying the factorials in the first solution.
To see why this is an application of the multiplication rule for conditional probabilities more formally, let \(E_i = \{\text{draw a diamond on the $k$th draw}\}\). Then: \[ \begin{align*} P\left(\text{5 diamonds}\right) &= P(E_1 \cap E_2 \cap E_3 \cap E_4 \cap E_5) \\ &= P(E_1) \times P(E_2\;| E_1) \times P(E_3\;| E_1 \cap E_2) \\ &\qquad\times P(E_4 \;| E_1 \cap E_2 \cap E_3) \times P(E_5\;| E_1 \cap E_2 \cap E_3 \cap E_4) \end{align*} \]
Another useful property is the law of total probability: if \(\{A_i\}\) is a partition of the sample space \(S\), then for any event \(E \in \mathcal{S}\) one has: \[ P(E) = \sum_i P(E\;| A_i) P(A_i) \]
Note that \(E = E \cap S = E\cap \left[\bigcup_i A_i\right] = \bigcup_i (E \cap A_i)\). Since \(\{A_i\}\) is disjoint, so is \(\{E \cap A_i\}\). Then by countable additivity: \[ P(E) = P\left[\bigcup_i (E \cap A_i)\right] = \sum_i P(E \cap A_i) \] And by the multiplication rule for conditional probabilities \(P(E\cap A_i) = P(E\;| A_i) P(A_i)\) so: \[ P(E) = \sum_i P(E \cap A_i) = \sum_i P(E\;| A_i) P(A_i) \]
It’s estimated that 8% of men and 0.5% of women have color vision deficiency (CVD). Supposing that exactly these proportions appear in a group of 400 women and 200 men, what’s the probability that a randomly selected individual has CVD?
From the problem set-up:
- \(P(\text{CVD}\;| M) = 0.08\) and \(P(\text{CVD}\;| F) = 0.005\)
- \(P(M) = \frac{1}{3}\) and \(P(F) = \frac{2}{3}\)
Note that these are the probabilities of selecting a person with the specified attributes from this group, assuming all individuals are equally likely to be selected. This is consistent with how we’ve defined probabilities for finite sample spaces. In this case, the sample space is the collection of 600 individuals, and each individual is assigned selection probability \(\frac{1}{600}\). Thus, the meaning of the probability here comes from sampling from this group of people at random; importantly, these are not statements about the chance of having CVD, or being of one sex or the other, or the like.
The law of total probability yields: \[ \begin{align*} P(\text{CVD}) &= P(\text{CVD}\;| M) P(M) + P(\text{CVD}\;| F) P(F) \\ &= 0.08 \cdot \frac{1}{3} + 0.005 \cdot\frac{2}{3} \\ &= 0.03 \end{align*} \]
Independence
If two events are independent, then the occurrence of one event doesn’t affect the probability of the other. For example, obtaining heads in a coin toss doesn’t affect the chances of obtaining a heads in a subsequent toss.
By contrast, if two events are dependent, then the occurrence of one changes the probability of the other. For example, the likelihood of a car accident is higher in heavy rain.
For independent events, conditioning on one event does not change the probability of the other. Thus, we say that any two events \(E\) and \(A\) are independent just in case: \[ P(E \cap A) = P(E)P(A) \]
If so, we write \(E\perp A\).
While it might be more intuitive to define independence according to the criterion \(P(E\;| A) = P(E)\), recall that \(P(E\;| A)\) is undefined if \(P(A) = 0\). By using instead the (almost equivalent) criterion \(P(E \cap A) = P(E)P(A)\), the concept is still well-defined for events with probability zero.
Note two facts:
independent events are not disjoint unless at least one event has probability zero
events with probability zero are independent of all other events, including themselves.
For the first fact, observe that for any disjoint events \(A, B\), one has \(P(A \cap B) = P(\emptyset) = 0\), so \(P(A \cap B) = P(A)P(B)\) cannot hold unless either \(P(A) = 0\) or \(P(B) = 0\).
For the second fact, if \(P(A) = 0\), then for any event \(B\), by monotonicity of the probability measure \(B \cap A \subseteq A\) implies \(P(B\cap A) \leq P(A) = 0\), so \(P(B \cap A) = 0\). Therefore \(P(B\cap A) = P(A)P(B)\).
Consider rolling two six-sided dice. The sample space for this experiment is \(S = \{1, 2, 3, 4, 5, 6\}^2\), i.e., all ordered pairs of integers between 1 and 6 corresponding to the physical possibilities for the two dice. If all outcomes are equally likely then \(P((i, j)) = \frac{1}{|S|} = \frac{1}{36}\) for all \(i, j \in \{1, \dots, 6\}\).
In this scenario the value of the first die is independent of the value of the second die. While this is intuitively obvious, to see this probabilistically, let \(A_x = \{(i, j) \in S: i = x\}\) denote the event that the first die is an \(i\), and let \(B_y = \{(i, j)\in S: j = y\}\) denote the event that the second die is a \(j\). Now \(|A_x| = |B_y| = 6\) for each \(x\) and each \(y\), so: \[ P(A_x) = P(B_x) = \frac{6}{36} = \frac{1}{6} \] Then note that for any \(x, y\), \(A_x \cap B_y = \{(x, y)\}\), so: \[ P(A_x \cap B_y) = \frac{1}{36} = \frac{1}{6}\times\frac{1}{6} = P(A_x)P(B_y) \]
There are many probability measures on this space that are not equally likely but for which \(A_x, B_y\) remain independent. For example, the table below shows probabilities assigned to each of the 36 possible rolls that are not equally likely, but it is easy to verify that the product of the probabilities in the row/column headers yields the probability in the corresponding cell.
\(P(B_1) = \frac{1}{3}\) | \(P(B_2) = \frac{1}{6}\) | \(P(B_3) = 0\) | \(P(B_4) = \frac{1}{6}\) | \(P(B_5) = \frac{1}{6}\) | \(P(B_6) = \frac{1}{6}\) | |
---|---|---|---|---|---|---|
\(P(A_1) = \frac{1}{6}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | \(\frac{0}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) |
\(P(A_2) = \frac{1}{6}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | \(\frac{0}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) |
\(P(A_3) = \frac{1}{6}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | \(\frac{0}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) |
\(P(A_4) = \frac{1}{6}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | \(\frac{0}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) |
\(P(A_5) = \frac{1}{6}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | \(\frac{0}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) |
\(P(A_6) = \frac{1}{6}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | \(\frac{0}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) |
Now consider the following probability measure. Note it is still a valid probability measure because the entries in the table sum to one.
\(P(B_1) = \frac{11}{36}\) | \(P(B_2) = \frac{7}{36}\) | \(P(B_3) = 0\) | \(P(B_4) = \frac{1}{6}\) | \(P(B_5) = \frac{1}{6}\) | \(P(B_6) = \frac{1}{6}\) | |
---|---|---|---|---|---|---|
\(P(A_1) = \frac{1}{6}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | \(\frac{0}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) |
\(P(A_2) = \frac{1}{6}\) | \(\frac{1}{36}\) | \(\frac{2}{36}\) | \(\frac{0}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) |
\(P(A_3) = \frac{1}{6}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | \(\frac{0}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) |
\(P(A_4) = \frac{1}{6}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | \(\frac{0}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) |
\(P(A_5) = \frac{1}{6}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | \(\frac{0}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) |
\(P(A_6) = \frac{1}{6}\) | \(\frac{2}{36}\) | \(\frac{1}{36}\) | \(\frac{0}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) | \(\frac{1}{36}\) |
Check your understanding:
- Are \(A_x, B_y\) independent under this probability measure? How do you check?
- Write the a table of the probabilities \(P(A_x | B_1)\).
A collection of events \(\{E_i\}\) is pairwise independent if for every pair of distinct events \(E_i, E_j\) one has: \[ P(E_i \cap E_j) = P(E_i)P(E_j) \]
A collection of events \(\{E_i\}\) is mutually independent if for every \(k\) and every subcollection of \(k\) distinct events \(E_{i_1}, \dots, E_{i_k}\): \[ P\left(\bigcap_{j = 1}^k E_{i_j}\right) = \prod_{j = 1}^k P(E_{i_j}) \]
Consider tossing two coins, so \(S = \{HH, HT, TH, TT\}\), and assume all outcomes are equally likely. Define the events:
\[ \begin{align*} E_1 &= \{HH, HT\} \quad(\text{heads on first toss}) \\ E_2 &= \{HH, TH\} \quad(\text{heads on second toss}) \\ E_3 &= \{HH, TT\} \quad(\text{tosses match}) \end{align*} \]
Now, \(P(E_i) = \frac{1}{2}\) for each \(E_i\), since each has two outcomes, so \(P(E_i)P(E_j) = \frac{1}{4}\). Moreover, \(P(E_i \cap E_j) = P(\{HH\}) = \frac{1}{4}\) for each pair of events \(i \neq j\), since \(HH\) is the only shared outcome between any two events. However: \[ P(E_1 \cap E_2 \cap E_3) = P(\{HH\}) = \frac{1}{4} \neq \frac{1}{8} = P(E_1)P(E_2)P(E_3) \] So the events are pairwise independent, but not mutually independent.
Bayes’ theorem
In many circumstances one may be able to compute \(P(A\;| B)\) but wish to know \(P(B\;| A)\). For example, you might have a good estimate for a diagnostic test of the probability that the test is positive given that a condition, illness, or pathogen is present, but what would really be much more useful to patients is the probability that one has the condition, illness, or pathogen given that they obtained a positive test result. The following theorem provides a means of accomplishing this.
Bayes’ theorem. If \(\{E_i\}\) is a partition of \(S\) and \(A\) is an event with nonzero probability, then:
\[ P(E_i\;|A) = \frac{P(A\;| E_i) P(E_i)}{\sum_i P(A\;| E_i) P(E_i)} \]
The result follows from the multiplication rule and the law of total probability. Since \(\{E_i\}\) is a partition, \(P(A) = \sum_i P(A\;| E_i) P(E_i)\); further, \(P(E_i \cap A) = P(A\;| E_i) P(E_i)\) Then: \[ P(E_i\;|A) = \frac{P(E_i \cap A)}{P(A)} = \frac{P(A\;| E_i) P(E_i)}{\sum_i P(A\;| E_i) P(E_i)} \]
The probability \(P(E_i \;| A)\) is referred to as the posterior probability and the probability \(P(E_i)\) is referred to as the prior probability. The theorem provides a means of obtaining the posterior from the priors (assuming the conditional probabilities \(P(A\;| E_i)\) can be computed) and is sometimes interpreted as a rule for ‘updating’ the probabilities \(P(E_i)\).
As a special case, note that for any event \(E\), \(\{E, E^C\}\) partition the sample space. Therefore: \[ P(E\;| A) = \frac{P(A\;| E) P(E)}{P(A\;| E) P(E) + P(A\;| E^C) P(E^C)} \]
Consider a diagnostic test and let \(A\) denote the event one has the condition of interest, \(T_+\) denote the event one receives a positive test result, and \(T_-\) denote the event one receives a negative test result. Assume that laboratory controls indicate that:
\[ \begin{align*} P(T_+ \;| A) &= 0.99 \\ P(T_- \;| A) &= 0.01 \\ P\left(T_+ \;| A^C\right) &= 0.05 \\ P\left(T_- \;| A^C\right) &= 0.95 \end{align*} \] As an aside, \(P(T_+\;| A)\) is the true positive rate, also known as the sensitivity of the test; \(P(T_- \;| A^C)\) is the true negative rate, also known as the specificity. Note that \(P(T_+ \;| A) + P(T_- \;| A^C) \neq 1\), despite the fact that \(T_+ = (T_-)^C\); this is because conditioning on different events produces different probability measures.
Suppose first that the condition has a prevalence of \(0.5\%\) in the general population, so assume \(P(A) = 0.005\). Then the probability that one has the condition given a positive test result is: \[ \begin{align*} P(A \;| T_+) &= \frac{P(T_+ \;| A) P(A)}{P(T_+ \;| A) P(A) + P\left(T_+ \;| A^C\right) P\left(A^C\right)} \\ &= \frac{(0.99)(0.005)}{(0.99)(0.005) + (0.05)(0.995)}\\ &\approx 0.0905 \end{align*} \] Further, the probability that one has the condition given a negative test result is: \[ \begin{align*} P(A \;| T_-) &= \frac{P(T_- \;| A) P(A)}{P(T_- \;| A) P(A) + P\left(T_- \;| A^C\right) P\left(A^C\right)} \\ &= \frac{(0.01)(0.005)}{(0.01)(0.005) + (0.95)(0.995)} \\ &\approx 0.0000529 \end{align*} \]
This usually comes as a surprise, given that the test appears to be fairly accurate, but somewhat prone to false negatives. What happens if the condition is more prevalent, say, present in \(10\%\) of the population? In this scenario:
\[ \begin{align*} P(A \;| T_+) &= \frac{(0.99)(0.1)}{(0.99)(0.1) + (0.05)(0.9)} \approx 0.688 \\ P(A \;| T_-) &= \frac{(0.05)(0.1)}{(0.05)(0.1) + (0.95)(0.9)} \approx 0.0012 \end{align*} \]
Try a few other scenarios on your own. First make a guess about how the posterior probability will change, and then check the calculation.
- What happens if the false positive rate \(P\left(T_+\;| A^C\right)\) is higher/lower, keeping everything else fixed?
- What happens if the false negative rate \(P\left(T_- \;| A)\right)\) is higher/lower, keeping everything else fixed?
- What happens if the specificity increases to \(99.5\%\) but the false negative rate increases to \(10\%\)?
- What happens if the sensitivity decreases to \(91\%\) but the false positive rate drops to \(0.1\%\)?
The odds of an event \(E\) usually refers to the ratio \(\frac{P(E)}{P\left(E^C\right)}\), but to be more precise, this is really the odds of \(E\) relative to its complement \(E^C\). More generally, for any two events \(E_1, E_2\): \[ \text{odds}\left(E_1, E_2\right) = \frac{P(E_1)}{P(E_2)} \] The value of the odds gives the factor by which \(E_1\) is more likely than \(E_2\), and:
- if \(\text{odds}\left(E_1, E_2\right) = 1\), the events are equally likely;
- if \(\text{odds}\left(E_1, E_2\right) < 1\), \(E_2\) is more likely than \(E_1\);
- if \(\text{odds}\left(E_1, E_2\right) > 1\), \(E_1\) is more likely than \(E_2\).
The conditional odds of one event relative to another is defined identically but with a conditional probability measure: \[ \text{odds}\left(E_1, E_2 \;| A\right) = \frac{P(E_1 \;| A)}{P(E_2\;| A)} \] One way to understand the diagnostic test example is that, although the probability of having the condition given a positive result is low, conditioning on the test result increases the posterior odds substantially.
The prior odds of having the condition are: \[ \text{odds}\left(A, A^C\right) = \frac{P(A)}{P\left(A^C\right)} = \frac{0.005}{0.995} = 0.0005 \] But the posterior odds after conditioning on \(T_+\) are: \[ \text{odds}\left(A, A^C \;| T_+\right) = \frac{P(A\;| T_+)}{P\left(A^C \;| T_+\right)} = \frac{0.33}{0.67} \approx 0.493 \] So the odds increase by a factor of \(\frac{0.493}{0.0005} \approx 985\) by conditioning on a positive test result.
There is an odds form of Bayes’ theorem, which states that under the same conditions as the original result, the posterior odds are the prior odds multiplied by a ‘Bayes factor’: \[ \text{odds}\left(E_i, E_j\;| A\right) = \text{odds}(E_i, E_j) \cdot \frac{P(A\;| E_i)}{P(A\;| E_j)} \] The result follows from the definition of conditional odds and application of Bayes’ theorem to the conditional probabilities \(P(E_i\;| A)\), and the proof is left as an exercise.
This is a classic probability problem named after the original host of the game show Let’s Make a Deal. The scenario is this: on a game show, you’re presented with three doors; behind one door is a car, and behind the other two are goats. If you pick the door concealing a car, you win the car (although no one ever explains whether, if you pick otherwise, you win a goat). You choose a door; then the host opens one of the remaining doors, revealing a goat, and asks you if you want to stay with your original guess, or switch. Are you more likely to win the car if you stay, or if you switch?
We’ll work out the solution in class.