Covariance and correlation

Course notes

STAT425, Fall 2023

Archived

December 15, 2023

Covariance and correlation are measures of dependence between two random variables based on their joint distribution. They quantify the tendency of values of the random variables to vary together, or to “co-vary”. They are signed measures, with the sign indicating whether they tend to vary in opposite directions (negative sign) or the same direction (positive sign).

Covariance

If X1,X2 are random variables then the covariance between them is defined as the expectation: cov(X1,X2)=E[(X1EX1)(X2EX2)] The expectation is computed from the joint distribution of (X1,X2), so for instance if the random vector is discrete: cov(X1,X2)=x1x2(x1EX1)(x2EX2)P(X1=x1,X2=x2) And if the random vector is continuous: cov(X1,X2)=(x1EX1)(x2EX2)f(x1,x2)dx1dx2 It is immediate that covariance is a symmetric operator, i.e., cov(X1,X2)=cov(X2,X1). Additionally, by expanding the product and applying linearity of expectation one obtains the covariance formula: cov(X1,X2)=E(X1X2)EX1EX2 This provides a convenient way to calculate covariances, much in the same way that the variance formula simplifies calculation of variances.

Linearity of expectation also entails that covariance is “bi-linear”, meaning it is linear in each argument: cov(aX1+b,X2)=acov(X1,X2)+cov(b,X2) It is easy to show, however, that cov(b,X2)=0: cov(b,X)=E[(bEb)(XEX)]=E[(bb)0(XEX)]=0 Intuitively, this makes sense, since constants don’t vary at all. Lastly, notice that cov(X,X)=var(X).

Exercise

Use bilinearity of covariance to show that:

  1. var(c)=0 for any constant c
  2. var(aX+b)=a2varX
Example: calculating a covariance

Let (X1,X2) be a continuous random vector distributed on the unit square according to the density: f(x1,x2)=x1+x2,(x1,x2)(0,1)×(0,1)

To find the covariance, one needs the expectations EX1X2, EX1, EX2. Marginally, X1 and X2 have the same distribution, so the calculation will be shown only for X1: f1(x1)=01(x1+x2)dx2=x1+12,x1(0,1)EX1=01x1(x1+12)dx1=712EX2=EX1=712 Then: EX1X2=0101x1x2(x1+x2)dx1dx2=0101(x12x2+x1x22)dx1dx2=0101x12x2dx1dx2+0101x1x22dx1dx2=20101x2ydxdy=20112x2dx=13

So: cov(X1,X2)=EX1X2EX1EX2=13(712)2=1144 Check your understanding

  1. What is cov(X1,X2)?
  2. What is cov(X2,X1)?
  3. What is cov(3X12,5X2+1)?

Correlation

Observe that shifting a random vector by a constant will not change the covariance, but scaling will. For example, continuing the example immediately above, by bilinearity one has that cov(10X1,10X2)=100144. While this is a substantially larger number, intuitively, the scale transformation shouldn’t alter the dependence between X1,X2 — if X1,X2 are only weakly dependent, then 10X1,10X2 should remain weakly dependent. Correlation is a standardized covariance measure that is scale-invariant.

The correlation between X1,X2 is the covariance scaled by the variances: corr(X1,X2)=cov(X1,X2)var(X1)var(X2) This measure is scale invariant since it is a symmetric operator and var(aX1)=a2var(X1), so: corr(aX1,X2)=acov(X1,X2)a2var(X1)var(X2)=cov(X1,X2)var(X1)var(X2)=corr(X1,X2)

Example: computing correlation

Continuing the previous example, the marginal variances are obtained by the following calculation: EX12=01x12(x1+12)dx1=512var(X1)=EX12(EX2)2=11144

Then, the correlation is: corr(X1,X2)=11441114411144=111

In addition to being scale-invariant, correlation is easier to interpret since it must be a number between 0 and 1.

Lemma. Let X1,X2 be random variables with finite second moments. Then 1corr(X1,X2)1.

Proof

Denote the correlation by ρ=corr(X1,X2), the means by μ1,μ2, and the variances by σ12,σ22. Note that cov(X1,X2)=σ1σ2ρ.

Then consider the expression [(X1μ1)+t(X2μ2)]2 as a polynomial in t. Since the polynomial is nonnegative everywhere, by expanding the square one obtains: 0E{[(X1μ1)+t(X2μ2)]2}=(σ12)t2+(2σ1σ2ρ)t+σ12 Thus, the polynomial can have at most one real-valued root (at zero), so the discriminant is negative. Therefore: (2σ1σ2ρ)24σ12σ220ρ21

This result establishes that the largest absolute values of a correlation are 1 and 1; the smallest is 0. Thus, (absolute) values nearer to 1 indicate stronger dependence, and (absolute) values nearer to zero indicate weaker dependence.

Exercise: contingency table

Consider the random vector defined by the joint distribution given in the table below:

X1=0 X2=1
X2=0 0.1 0.5
X2=1 0.3 0.1

First, consider whether you expect outcomes to be dependent, and if so, whether you expect a positive or negative covariance/correlation. Then compute the covariance and correlation.

Lastly, it is important to note that covariance and correlation do not capture every type of dependence, but rather only linear or approximately linear dependence. We will return to this later, but the classical counterexample is given below.

Perfectly dependent but uncorrelated

Let Uuniform(1,1), and define X=U2. Then EU=0, so: $$ (U, X) = (UX) = U^3 = _{-1}^1 u^3 du = 0

$$ However, obviously X,U are dependent because X is a deterministic function of U.