Joint distributions

Course notes

STAT425, Fall 2023

Archived

December 15, 2023

This week we extend the concept of probability distributions to multiple variables and introduce multivariate probability distributions. A multivariate distribution can be thought of in either of two ways, as either:

All of the same concepts for univariate distributions — distribution functions, transformations, and expectations — extend easily to the multivariate setting.

House hunting

The need for multivariate distributions can be motivated by a simple example: consider shopping for a home or apartment. We might record the number of bedrooms and bathrooms for every home as the vector: x=(x1,x2)=(# bedrooms,# bathrooms)

Now imagine selecting a home at random from current listings in your area; then X=(X1,X2) will be a random vector for which the ordered pairs of possible values (x1,x2) have some distribution that reflects the frequency of combinations of bedrooms and bathrooms across current listings. Write the probability of selecting a home with x1 bedrooms and x2 bathrooms as a conjunction of events, that is, as: P(X1=x1,X2=x2)=P({X1=x1}{X2=x2})

Suppose the joint distribution of (X1,X2), that is, the frequencies of bed/bath pairs among listings, is given by the table below.

x1=0 x1=1 x1=2 x1=3
x2=1 0.1 0.1 0.2 0
x2=1.5 0 0.1 0.2 0
x2=2 0 0 0 0.3
x2=2.5 0 0 0 0

The table indicates, for instance, that P(X1=2,X2=1.5)=0.2, meaning the probability that a randomly selected listing has 2 bedrooms and 1.5 bathrooms is 0.2.

The marginal probability that a randomly selected home has 1.5 bathrooms (regardless of the number of bedrooms) can be obtained by summing the probabilities in the corresponding row: P(X2=1.5)=x1P(X1=x1,X2,=1.5)=0+0.1+0.2+0=0.3

Notice that these probabilities are not necessarily information you could obtain if you knew the frequencies of values of x1 and of x2 separately. For instance, computing the marginal probabilities indicates that the most common number of bathrooms is 1.5 and the most common number of bedrooms is 2, but that doesn’t entail that the most frequent pair is 2 bed and 1.5 bath. Rather, 3 bed, 2 bath homes are most common. This is possible because the variables are measured together on each home rather than, say, on separate collections of homes.

The joint distribution therefore takes account of how variables interact across the outcomes of a random process. The example illustrates that when multiple variables are measured together, a joint distribution is needed to fully capture the probabilistic behavior of the variables.

Random vectors

Formally, X=(X1,X2) is a random vector if for some probability space (S,S,P) X=(X1,X2):SR2 and preimages of the Borel sets in R2 — sets that can be formed from countable collections of rectangles — have well-defined probabilities in the underlying space.

As with random variables, random vectors induce a probability measure PX(B)=P(X1(B))

The induced measure PX is both the joint distribution of the random variables X1,X2 and the distribution of the random vector X.

In the house hunting example, we might formalize things as follows. Suppose the sample space is a collection of N listings S={s1,,sN}, and since the thought experiment involved selecting a listing at random, P(si)=1N for each i. Then the measure induced by X would be computed as the probability of selecting a house with the specified number of bedrooms and bathrooms, resulting, for instance, in: PX((1,1.5))=# 1br, 1.5ba homesN

There is really no fundamental difference between joint distributions and univariate distributions — the former are simply distributions of vector-valued functions rather than univariate functions.

The definition above extends directly to vectors in Rn without modification. We will focus for now mostly on bivariate distributions, but where possible, concepts will be extended to collections of arbitrarily many random variables.

Characterizing multivariate distributions

Let X:SRn be a random vector comprising n random variables X1,,Xn. The joint cumulative distribution function is defined as: F(x1,,xn)=PX((,x1]××(,xn])=P(X1x1,,Xnxn)

As with random variables, the joint CDF uniquely characterizes distributions, and is the basis for distinguishing discrete and continuous distributions.

The random vector X is discrete if its CDF F takes countably many values, and is continuous if F is continuous.

In the discrete case, the joint PMF is: P(X1=x1,,Xn=xn)=PX({x1,,xn}) In the continuous case, the joint PDF is the function f satisfying: F(x1,,xn)=x1xnf(x1,,xn)dxndx1 Typically, one has: f(x1,,xn)=nx1xnF(x1,,xn) Joint PMFs/PDFs also uniquely characterize the distribution of X. In fact, although the CDF is introduced here in the multivariate case in order to define discrete and continuous random vectors, it is rarely used in practice to compute probabilities, expectations, and the like. More often, distributions of random vectors are characterized by specifying the joint PMF/PDF.

Probabilities associated with the random vector are given in relation to the joint PMF/PDF by: PX(B)=P(XB)={xBP(X1=x1,,Xn=xn)Bf(x1,,xn)dx1dxn

An arbitrary function f is a joint PMF/PDF just in case it is nonnegative everywhere and sums/integrates to one.

Example: calculating probabilities using a joint PDF

Let (X1,X2) have a uniform joint distribution on the unit circle:

f(x1,x2)=1π,x12+x221

It is easy to check that this is a valid PDF since it is nonnegative everywhere and the area of the unit circle is π, so f clearly integrates to one over the support. To verify analytically, note that for fixed x1, one has 1x12x21x12, and across all values of x2, one has 1x11, so:

111x121x121πdx2dx1

In this example it is a little easier to compute probabilities via areas, since for any region BR2, the probability of B is simply the area of its intersection with the unit circle, divided by π. That is, denoting the support of the random vector by S={(x1,x2)R2:x12+x221}, one has:

PX(B)=1π×area(BS)

So for instance, if the event of interest is that the random vector X lies in the positive quadrant, the intersection of the unit circle with the positive quadrant comprises a quarter of the area of the unit circle, so the probability is 14.

More formally, if B={(x1,x2)R2:x10,x20}, then area(BS)=π4, so:

PX(B)=1π×π4=14To compute the probability analytically using the PDF, we need to determine the integration bounds. Fixing x1, one has that on BS the values of x2 are given by 0x21x12, and across all values of x2 on BS, the values of x1 are given by 0x11. Polar coordinates simplify the integration:

P(X10,X20)=BS1πdx2dx1=0101x121πdx2dx1=0π201rπdrdθ=0π212πdθ=14

Often the trickiest part of computing probabilities from joint PDFs is determining appropriate integration bounds. It helps considerably to sketch the support set and region of interest; we’ll review this technique in class.

Check your understanding

Find P(X1X2) both informally using areas, and analytically using integration.

The CDFs of individual components of a random vector can be obtained by integration or summation. Note that the event {Xjxj} is equivalent to {Xjxj}[ij{<Xi<}]. So: P(Xjxj)=limx¬jF(x1,,xn) For instance, in the bivariate case, if X=(X1,X2) has CDF F, then the CDF of X2 alone is: P(X2x)=limx1F(x1,x) This CDF can also be obtained from the PMF/PDF as: P(X2x)={x[f(x1,x2)dx1]PDF of X2dx2x2x[x1P(X1=x1,X2=x2)]PMF of X2

The expressions in square brackets must be the PDF/PMF of X2, since distribution functions are unique. Thus, the marginal distributions of individual vector components are given, in the continuous case, by ‘integrating out’ the other components: f1(x1)=Rf(x1,x2)dx2f2(x2)=Rf(x1,x2)dx1

In the discrete case, the marginal distributions are obtained by summing out the other components: P(X1=x1)=x2P(X1=x1,X2=x2)P(X2=x2)=x1P(X1=x1,X2=x2)

Example: finding marginal distributions

If the random vector X=(X1,X2) has a uniform joint distribution on the unit circle (continuing the previous example), then the marginal distribution of X1 is given by:

f1(x1)=1x221x221πdx1=2π1x12,x1(0,1)

The bounds of integration are found by reasoning that for fixed x2, the possible values of x1 are given by 1x22x11x22. The marginal support of X1 is S1=(0,1).

It is perhaps somewhat surprising that X1 is not marginally uniform, given that the vector (X1,X2) has a uniform distribution. One way to understand this fact is that if all points on the unit circle occur with equal frequency, then not all values of the X1 coordinate will occur with the same frequency; in particular, larger values of X1 are less likely since the corresponding regions in the circle comprise fewer points.

Check your understanding

Verify that the marginal density above is in fact a valid PDF (hint: use the transformation x=sinθ to compute the integral).

Expectations

The expectation of a random vector X is defined as the vector of marginal expectations, assuming they exist: EX=[EX1EXn]

However, the expected value of a function g(x1,,xn) is defined, assuming the sums/integrals exist, as: E[g(X1,,Xn)]={Rng(x1,,xn)f(x1,,xn)dx1dxnx1xng(x1,,xn)P(X1=x1,,Xn=xn)

Example: house hunting

Consider again the house hunting example where X1 denotes the number of bedrooms and X2 denotes the number of bathrooms, and for a randomly selected listing the vector (X1,X2) has joint distribution:

x1=0 x1=1 x1=2 x1=3
x2=1 0.1 0.1 0.2 0
x2=1.5 0 0.1 0.2 0
x2=2 0 0 0 0.3
x2=2.5 0 0 0 0

Suppose you want to know the expected ratio of bedrooms to bathrooms. The expectation is:

E[X1X2]=(x1,x2)x1x2P(X1=x1,X2=x2)=010.1+110.1+210.2+11.50.1+21.50.2+320.3=1.283

So on average, a randomly selected home will have 1.283 bedrooms to every bathroom.

Based on this definition, it is easy to show that expectation is a linear operator. In the bivariate case: E[aX1+bX2+c]=aEX1+bEX2+c Slightly more generally: E[a0+i=1naiXi]=a0+i=1naiEXiThe proofs are obtained from direct application of the definition of expectation given immediately above, and are left as exercises.

Example: flu season

Suppose the random vector X=(X1,X2) denotes the number of influenza A and influenza B cases per week in a given region, and suppose the joint distribution is given by the PMF:

P(X1=x1,X2=x2)=μ1x1μ2x2exp{(μ1+μ2)}x1!x2!{x1=0,1,2,x2=0,1,2,μ1>0μ2>0

The marginal distributions are given by:

P(X1=x1)=x2=0P(X1=x1,X2=x2)=μ1x1eμ1x1!x2=0μ2x2eμ2x2!=μ1x1eμ1x1!,x1=0,1,2,P(X2=x2)=x1=0P(X1=x1,X2=x2)=μ2x2eμ2x2!x1=0μ1x1eμ1x1!=μ2x2eμ2x2!,x2=0,1,2,

In other words, X1Poisson(μ1) and X2Poisson(μ2). Therefore EX1=μ1 and EX2=μ2, so by linearity of expectation the expected total number of flu cases is:

E[X1+X2]=EX1+EX2=μ1+μ2

What if we want to know not just the expected total number of flu cases, but its probability distribution? The strategy here is to make a one-to-one transformation in which one of the transformed variables is the sum, and then compute the marginal distribution of that transformed variable. To that end, define:

Y1=X1+X2andY2=X2

This transformation is one-to-one, and the support set is given by SY={(y1,y2):y1=0,1,2,;y2=0,1,2,,y1}. The inverse transformation is given by:

X1=Y1Y2andX2=Y2

So the joint distribution of Y=(Y1,Y2) is:

P(Y1=y1,Y2=y2)=P(X1=y1y2,X2=y2)=μ1y1y2μ2y2exp{(μ1+μ2)}(y1y2)!y2!And therefore the marginal distribution of Y1 is obtained by summing out Y2. Note that y2y1, so the sum should be computed up to y1.

P(Y1=y1)=y2=0y1P(Y1=y1,Y2=y2)=y2=0y1μ1y1y2μ2y2exp{(μ1+μ2)}(y1y2)!y2!=exp{(μ1+μ2)}y1!y2=0y1(y1y2)μ1y1y2μ2y2=(μ1+μ2)y1exp{(μ1+μ2)}y1!

So Y1=X1+X2Poisson(μ1+μ2).

The above example illustrates a bivariate transformation. This will be our next topic.