Joint distributions

Course notes

STAT425, Fall 2023

Archived

December 15, 2023

This week we extend the concept of probability distributions to multiple variables and introduce multivariate probability distributions. A multivariate distribution can be thought of in either of two ways, as either:

All of the same concepts for univariate distributions — distribution functions, transformations, and expectations — extend easily to the multivariate setting.

House hunting

The need for multivariate distributions can be motivated by a simple example: consider shopping for a home or apartment. We might record the number of bedrooms and bathrooms for every home as the vector: \[ \mathbb{x} = (x_1, x_2) = (\#\text{ bedrooms}, \#\text{ bathrooms}) \]

Now imagine selecting a home at random from current listings in your area; then \(\mathbb{X} = (X_1, X_2)\) will be a random vector for which the ordered pairs of possible values \((x_1, x_2)\) have some distribution that reflects the frequency of combinations of bedrooms and bathrooms across current listings. Write the probability of selecting a home with \(x_1\) bedrooms and \(x_2\) bathrooms as a conjunction of events, that is, as: \[ P(X_1 = x_1, X_2 = x_2) = P(\{X_1 = x_1\} \cap \{X_2 = x_2\}) \]

Suppose the joint distribution of \((X_1, X_2)\), that is, the frequencies of bed/bath pairs among listings, is given by the table below.

\(x_1 = 0\) \(x_1 = 1\) \(x_1 = 2\) \(x_1 = 3\)
\(x_2 = 1\) 0.1 0.1 0.2 0
\(x_2 = 1.5\) 0 0.1 0.2 0
\(x_2 = 2\) 0 0 0 0.3
\(x_2 = 2.5\) 0 0 0 0

The table indicates, for instance, that \(P(X_1 = 2, X_2 = 1.5) = 0.2\), meaning the probability that a randomly selected listing has 2 bedrooms and 1.5 bathrooms is 0.2.

The marginal probability that a randomly selected home has 1.5 bathrooms (regardless of the number of bedrooms) can be obtained by summing the probabilities in the corresponding row: \[ P(X_2 = 1.5) = \sum_{x_1} P(X_1 = x_1, X_2, = 1.5) = 0 + 0.1 + 0.2 + 0 = 0.3 \]

Notice that these probabilities are not necessarily information you could obtain if you knew the frequencies of values of \(x_1\) and of \(x_2\) separately. For instance, computing the marginal probabilities indicates that the most common number of bathrooms is 1.5 and the most common number of bedrooms is 2, but that doesn’t entail that the most frequent pair is 2 bed and 1.5 bath. Rather, 3 bed, 2 bath homes are most common. This is possible because the variables are measured together on each home rather than, say, on separate collections of homes.

The joint distribution therefore takes account of how variables interact across the outcomes of a random process. The example illustrates that when multiple variables are measured together, a joint distribution is needed to fully capture the probabilistic behavior of the variables.

Random vectors

Formally, \(X = (X_1, X_2)\) is a random vector if for some probability space \((S, \mathcal{S}, P)\) \[ X = (X_1, X_2): S \longrightarrow \mathbb{R}^2 \] and preimages of the Borel sets in \(\mathbb{R}^2\) — sets that can be formed from countable collections of rectangles — have well-defined probabilities in the underlying space.

As with random variables, random vectors induce a probability measure \[ P_X (B) = P\left(X^{-1}(B)\right) \]

The induced measure \(P_X\) is both the joint distribution of the random variables \(X_1, X_2\) and the distribution of the random vector \(X\).

In the house hunting example, we might formalize things as follows. Suppose the sample space is a collection of \(N\) listings \(S = \{s_1, \dots, s_N\}\), and since the thought experiment involved selecting a listing at random, \(P(s_i) = \frac{1}{N}\) for each \(i\). Then the measure induced by \(X\) would be computed as the probability of selecting a house with the specified number of bedrooms and bathrooms, resulting, for instance, in: \[ P_X ((1, 1.5)) = \frac{\#\text{ 1br, 1.5ba homes}}{N} \]

There is really no fundamental difference between joint distributions and univariate distributions — the former are simply distributions of vector-valued functions rather than univariate functions.

The definition above extends directly to vectors in \(\mathbb{R}^n\) without modification. We will focus for now mostly on bivariate distributions, but where possible, concepts will be extended to collections of arbitrarily many random variables.

Characterizing multivariate distributions

Let \(X:S\rightarrow\mathbb{R}^n\) be a random vector comprising \(n\) random variables \(X_1, \dots, X_n\). The joint cumulative distribution function is defined as: \[ F(x_1, \dots, x_n) = P_X \left((-\infty, x_1] \times \cdots \times (-\infty, x_n]\right) = P(X_1 \leq x_1, \dots, X_n \leq x_n) \]

As with random variables, the joint CDF uniquely characterizes distributions, and is the basis for distinguishing discrete and continuous distributions.

The random vector \(X\) is discrete if its CDF \(F\) takes countably many values, and is continuous if \(F\) is continuous.

In the discrete case, the joint PMF is: \[ P(X_1 = x_1, \dots, X_n = x_n) = P_X \left(\{x_1, \dots, x_n\}\right) \] In the continuous case, the joint PDF is the function \(f\) satisfying: \[ F(x_1, \dots, x_n) = \int_{-\infty}^{x_1} \cdots \int_{-\infty}^{x_n} f(x_1, \dots, x_n)dx_n \cdots dx_1 \] Typically, one has: \[ f(x_1, \dots, x_n) = \frac{\partial^n}{\partial x_1 \cdots \partial x_n} F(x_1, \dots, x_n) \] Joint PMFs/PDFs also uniquely characterize the distribution of \(X\). In fact, although the CDF is introduced here in the multivariate case in order to define discrete and continuous random vectors, it is rarely used in practice to compute probabilities, expectations, and the like. More often, distributions of random vectors are characterized by specifying the joint PMF/PDF.

Probabilities associated with the random vector are given in relation to the joint PMF/PDF by: \[ P_X (B) = P(X \in B) = \begin{cases} \sum_{x \in B} P(X_1 = x_1, \dots, X_n = x_n) \\ \int\cdots\int_B f(x_1, \dots, x_n)dx_1\cdots dx_n \end{cases} \]

An arbitrary function \(f\) is a joint PMF/PDF just in case it is nonnegative everywhere and sums/integrates to one.

Example: calculating probabilities using a joint PDF

Let \((X_1, X_2)\) have a uniform joint distribution on the unit circle:

\[ f(x_1, x_2) = \frac{1}{\pi}\;,\quad x_1^2 + x_2^2 \leq 1 \]

It is easy to check that this is a valid PDF since it is nonnegative everywhere and the area of the unit circle is \(\pi\), so \(f\) clearly integrates to one over the support. To verify analytically, note that for fixed \(x_1\), one has \(-\sqrt{1 - x_1^2} \leq x_2 \leq \sqrt{1 - x_1^2}\), and across all values of \(x_2\), one has \(-1\leq x_1 \leq 1\), so:

\[ \int_{-1}^1 \int_{-\sqrt{1 - x_1^2}}^\sqrt{1 - x_1^2} \frac{1}{\pi}dx_2 dx_1 \]

In this example it is a little easier to compute probabilities via areas, since for any region \(B\in\mathbb{R}^2\), the probability of \(B\) is simply the area of its intersection with the unit circle, divided by \(\pi\). That is, denoting the support of the random vector by \(S = \{(x_1, x_2)\in\mathbb{R}2: x_1^2 + x_2^2 \leq 1\}\), one has:

\[ P_X(B) = \frac{1}{\pi}\times\text{area}(B \cap S) \]

So for instance, if the event of interest is that the random vector \(X\) lies in the positive quadrant, the intersection of the unit circle with the positive quadrant comprises a quarter of the area of the unit circle, so the probability is \(\frac{1}{4}\).

More formally, if \(B = \{(x_1, x_2)\in\mathbb{R}^2: x_1 \geq 0, x_2 \geq 0\}\), then \(\text{area}(B\cap S) = \frac{\pi}{4}\), so:

\[ P_X(B) = \frac{1}{\pi}\times\frac{\pi}{4} = \frac{1}{4} \]To compute the probability analytically using the PDF, we need to determine the integration bounds. Fixing \(x_1\), one has that on \(B\cap S\) the values of \(x_2\) are given by \(0 \leq x_2 \leq \sqrt{1 - x_1^2}\), and across all values of \(x_2\) on \(B \cap S\), the values of \(x_1\) are given by \(0 \leq x_1 \leq 1\). Polar coordinates simplify the integration:

\[ \begin{align*} P(X_1 \geq 0, X_2 \geq 0) &= \int\int_{B\cap S} \frac{1}\pi dx_2 dx_1 \\ &= \int_0^1\int_0^\sqrt{1 - x_1^2} \frac{1}{\pi}dx_2 dx_1\\ &= \int_0^\frac{\pi}{2}\int_0^1 \frac{r}{\pi}drd\theta\\ &= \int_0^\frac{\pi}{2} \frac{1}{2\pi}d\theta\\ &= \frac{1}{4} \end{align*} \]

Often the trickiest part of computing probabilities from joint PDFs is determining appropriate integration bounds. It helps considerably to sketch the support set and region of interest; we’ll review this technique in class.

Check your understanding

Find \(P(X_1 \geq X_2)\) both informally using areas, and analytically using integration.

The CDFs of individual components of a random vector can be obtained by integration or summation. Note that the event \(\{X_j \leq x_j\}\) is equivalent to \(\{X_j \leq x_j\}\cap\left[\bigcap_{i \neq j}\{-\infty < X_i < \infty\}\right]\). So: \[ P(X_j \leq x_j) = \lim_{x_{\neg j}\rightarrow\infty} F(x_1, \dots, x_n) \] For instance, in the bivariate case, if \(X = (X_1, X_2)\) has CDF \(F\), then the CDF of \(X_2\) alone is: \[ P(X_2 \leq x) = \lim_{x_1 \rightarrow \infty} F(x_1, x) \] This CDF can also be obtained from the PMF/PDF as: \[ P(X_2 \leq x) =\begin{cases} \int_{-\infty}^{x} \underbrace{\left[\int_{-\infty}^\infty f(x_1, x_2)dx_1\right]}_{\text{PDF of } X_2} dx_2 \\ \sum_{x_2 \leq x} \underbrace{\left[\sum_{x_1} P(X_1 = x_1, X_2 = x_2)\right]}_{\text{PMF of } X_2} \end{cases} \]

The expressions in square brackets must be the PDF/PMF of \(X_2\), since distribution functions are unique. Thus, the marginal distributions of individual vector components are given, in the continuous case, by ‘integrating out’ the other components: \[ \begin{align*} f_1(x_1) &= \int_\mathbb{R} f(x_1, x_2) dx_2 \\ f_2(x_2) &= \int_\mathbb{R} f(x_1, x_2) dx_1 \\ \end{align*} \]

In the discrete case, the marginal distributions are obtained by summing out the other components: \[ \begin{align*} P(X_1 = x_1) &= \sum_{x_2} P(X_1 = x_1, X_2 = x_2) \\ P(X_2 = x_2) &= \sum_{x_1} P(X_1 = x_1, X_2 = x_2) \\ \end{align*} \]

Example: finding marginal distributions

If the random vector \(X = (X_1, X_2)\) has a uniform joint distribution on the unit circle (continuing the previous example), then the marginal distribution of \(X_1\) is given by:

\[ \begin{align*} f_1 (x_1) &= \int_{-\sqrt{1 - x_2^2}}^\sqrt{1 - x_2^2} \frac{1}{\pi}dx_1 = \frac{2}{\pi}\sqrt{1 - x_1^2} \;,\quad x_1 \in (0, 1) \end{align*} \]

The bounds of integration are found by reasoning that for fixed \(x_2\), the possible values of \(x_1\) are given by \(-\sqrt{1 - x_2^2} \leq x_1 \leq \sqrt{1 - x_2^2}\). The marginal support of \(X_1\) is \(S_1 = (0, 1)\).

It is perhaps somewhat surprising that \(X_1\) is not marginally uniform, given that the vector \((X_1, X_2)\) has a uniform distribution. One way to understand this fact is that if all points on the unit circle occur with equal frequency, then not all values of the \(X_1\) coordinate will occur with the same frequency; in particular, larger values of \(X_1\) are less likely since the corresponding regions in the circle comprise fewer points.

Check your understanding

Verify that the marginal density above is in fact a valid PDF (hint: use the transformation \(x = \sin\theta\) to compute the integral).

Expectations

The expectation of a random vector \(X\) is defined as the vector of marginal expectations, assuming they exist: \[ \mathbb{E}X = \left[\begin{array}{c} \mathbb{E}X_1 \\ \vdots\\ \mathbb{E}X_n \end{array} \right] \]

However, the expected value of a function \(g(x_1, \dots, x_n)\) is defined, assuming the sums/integrals exist, as: \[ \mathbb{E}\left[g(X_1, \dots, X_n)\right] = \begin{cases} \int\cdots\int_{\mathbb{R}^n} g(x_1, \dots, x_n)f(x_1, \dots, x_n)dx_1\cdots dx_n \\ \sum_{x_1}\cdots\sum_{x_n} g(x_1, \dots, x_n) P(X_1 = x_1, \dots, X_n = x_n) \end{cases} \]

Example: house hunting

Consider again the house hunting example where \(X_1\) denotes the number of bedrooms and \(X_2\) denotes the number of bathrooms, and for a randomly selected listing the vector \((X_1, X_2)\) has joint distribution:

\(x_1 = 0\) \(x_1 = 1\) \(x_1 = 2\) \(x_1 = 3\)
\(x_2 = 1\) 0.1 0.1 0.2 0
\(x_2 = 1.5\) 0 0.1 0.2 0
\(x_2 = 2\) 0 0 0 0.3
\(x_2 = 2.5\) 0 0 0 0

Suppose you want to know the expected ratio of bedrooms to bathrooms. The expectation is:

\[ \begin{align*} \mathbb{E}\left[\frac{X_1}{X_2}\right] &= \sum_{(x_1, x_2)} \frac{x_1}{x_2}P(X_1 = x_1, X_2 = x_2) \\ &= \frac{0}{1}\cdot 0.1 + \frac{1}{1}\cdot 0.1 + \frac{2}{1}\cdot 0.2 + \frac{1}{1.5}\cdot 0.1 + \frac{2}{1.5}\cdot 0.2 + \frac{3}{2}\cdot 0.3 \\ &= 1.283 \end{align*} \]

So on average, a randomly selected home will have 1.283 bedrooms to every bathroom.

Based on this definition, it is easy to show that expectation is a linear operator. In the bivariate case: \[ \mathbb{E}\left[aX_1 + bX_2 + c\right] = a\mathbb{E}X_1 + b\mathbb{E}X_2 + c \] Slightly more generally: \[ \mathbb{E}\left[a_0 + \sum_{i = 1}^n a_i X_i\right] = a_0 + \sum_{i = 1}^n a_i \mathbb{E}X_i \]The proofs are obtained from direct application of the definition of expectation given immediately above, and are left as exercises.

Example: flu season

Suppose the random vector \(X = (X_1, X_2)\) denotes the number of influenza A and influenza B cases per week in a given region, and suppose the joint distribution is given by the PMF:

\[ P(X_1 = x_1, X_2 = x_2) = \frac{\mu_1^{x_1}\mu_2^{x_2}\exp\{-(\mu_1 + \mu_2)\}}{x_1! x_2!} \quad \begin{cases} x_1 = 0, 1, 2, \dots \\ x_2 = 0, 1, 2, \dots \\ \mu_1 > 0 \\ \mu_2 > 0 \end{cases} \]

The marginal distributions are given by:

\[ \begin{align*} P(X_1 = x_1) &= \sum_{x_2 = 0}^\infty P(X_1 = x_1, X_2 = x_2) \\ &= \frac{\mu_1^{x_1}e^{-\mu_1}}{x_1!} \sum_{x_2 = 0}^\infty \frac{\mu_2^{x_2} e^{-\mu_2}}{x_2!} \\ &= \frac{\mu_1^{x_1}e^{-\mu_1}}{x_1!} \;,\quad x_1 = 0, 1, 2, \dots \\ P(X_2 = x_2) &= \sum_{x_1 = 0}^\infty P(X_1 = x_1, X_2 = x_2) \\ &= \frac{\mu_2^{x_2}e^{-\mu_2}}{x_2!} \sum_{x_1 = 0}^\infty \frac{\mu_1^{x_1} e^{-\mu_1}}{x_1!} \\ &= \frac{\mu_2^{x_2}e^{-\mu_2}}{x_2!} \;,\quad x_2 = 0, 1, 2, \dots \end{align*} \]

In other words, \(X_1 \sim \text{Poisson}(\mu_1)\) and \(X_2 \sim\text{Poisson}(\mu_2)\). Therefore \(\mathbb{E}X_1 = \mu_1\) and \(\mathbb{E}X_2 = \mu_2\), so by linearity of expectation the expected total number of flu cases is:

\[ \mathbb{E}\left[X_1 + X_2\right] = \mathbb{E}X_1 + \mathbb{E}X_2 = \mu_1 + \mu_2 \]

What if we want to know not just the expected total number of flu cases, but its probability distribution? The strategy here is to make a one-to-one transformation in which one of the transformed variables is the sum, and then compute the marginal distribution of that transformed variable. To that end, define:

\[ Y_1 = X_1 + X_2 \quad\text{and}\quad Y_2 = X_2 \]

This transformation is one-to-one, and the support set is given by \(S_Y = \{(y_1, y_2): y_1 = 0, 1, 2, \dots; y_2 = 0, 1, 2, \dots, y_1\}\). The inverse transformation is given by:

\[ X_1 = Y_1 - Y_2 \quad\text{and}\quad X_2 = Y_2 \]

So the joint distribution of \(Y = (Y_1, Y_2)\) is:

\[ P(Y_1 = y_1, Y_2 = y_2) = P(X_1 = y_1 - y_2, X_2 = y_2) = \frac{\mu_1^{y_1 - y_2}\mu_2^{y_2}\exp\{-(\mu_1 + \mu_2)\}}{(y_1 - y_2)! y_2!} \]And therefore the marginal distribution of \(Y_1\) is obtained by summing out \(Y_2\). Note that \(y_2 \leq y_1\), so the sum should be computed up to \(y_1\).

\[ \begin{align*} P(Y_1 = y_1) &= \sum_{y_2 = 0}^{y_1} P(Y_1 = y_1, Y_2 = y_2) \\ &= \sum_{y_2 = 0}^{y_1} \frac{\mu_1^{y_1 - y_2}\mu_2^{y_2}\exp\{-(\mu_1 + \mu_2)\}}{(y_1 - y_2)! y_2!} \\ &= \frac{\exp\{-(\mu_1 + \mu_2)\}}{y_1!}\sum_{y_2 = 0}^{y_1} {y_1 \choose y_2} \mu_1^{y_1 - y_2} \mu_2^{y_2} \\ &= \frac{(\mu_1 + \mu_2)^{y_1}\exp\{-(\mu_1 + \mu_2)\}}{y_1!} \end{align*} \]

So \(Y_1 = X_1 + X_2 \sim\text{Poisson}(\mu_1 + \mu_2)\).

The above example illustrates a bivariate transformation. This will be our next topic.