Covariance and correlation
Covariance and correlation are measures of dependence between two random variables based on their joint distribution. They quantify the tendency of values of the random variables to vary together, or to “co-vary”. They are signed measures, with the sign indicating whether they tend to vary in opposite directions (negative sign) or the same direction (positive sign).
Covariance
If \(X_1, X_2\) are random variables then the covariance between them is defined as the expectation: \[ \text{cov}(X_1, X_2) = \mathbb{E}\left[(X_1 - \mathbb{E}X_1)(X_2 - \mathbb{E}X_2)\right] \] The expectation is computed from the joint distribution of \((X_1, X_2)\), so for instance if the random vector is discrete: \[ \text{cov}(X_1, X_2) = \sum_{x_1}\sum_{x_2} (x_1 - \mathbb{E}X_1)(x_2 - \mathbb{E}X_2)P(X_1 = x_1, X_2 = x_2) \] And if the random vector is continuous: \[ \text{cov}(X_1, X_2) = \int\int (x_1 - \mathbb{E}X_1)(x_2 - \mathbb{E}X_2)f(x_1, x_2) dx_1 dx_2 \] It is immediate that covariance is a symmetric operator, i.e., \(\text{cov}(X_1, X_2) = \text{cov}(X_2, X_1)\). Additionally, by expanding the product and applying linearity of expectation one obtains the covariance formula: \[ \text{cov}(X_1, X_2) = \mathbb{E}(X_1 X_2) - \mathbb{E}X_1\mathbb{E}X_2 \] This provides a convenient way to calculate covariances, much in the same way that the variance formula simplifies calculation of variances.
Linearity of expectation also entails that covariance is “bi-linear”, meaning it is linear in each argument: \[ \text{cov}(a X_1 + b, X_2) = a\text{cov}(X_1, X_2) + \text{cov}(b, X_2) \] It is easy to show, however, that \(\text{cov}(b, X_2) = 0\): \[ \text{cov}(b, X) = \mathbb{E}\left[(b - \mathbb{E}b)(X - \mathbb{E}X)\right] = \mathbb{E}[\underbrace{(b - b)}_{0}(X - \mathbb{E}X)] = 0 \] Intuitively, this makes sense, since constants don’t vary at all. Lastly, notice that \(\text{cov}(X, X) = \text{var}(X)\).
Use bilinearity of covariance to show that:
- \(\text{var}(c) = 0\) for any constant \(c\)
- \(\text{var}(aX + b) = a^2 \text{var}X\)
Let \((X_1, X_2)\) be a continuous random vector distributed on the unit square according to the density: \[ f(x_1, x_2) = x_1 + x_2 \;,\quad (x_1, x_2) \in (0, 1)\times (0, 1) \]
To find the covariance, one needs the expectations \(\mathbb{E}X_1X_2\), \(\mathbb{E}X_1\), \(\mathbb{E}X_2\). Marginally, \(X_1\) and \(X_2\) have the same distribution, so the calculation will be shown only for \(X_1\): \[ \begin{align*} f_1(x_1) &= \int_0^1 (x_1 + x_2)dx_2 = x_1 + \frac{1}{2}\;,\quad x_1 \in (0, 1) \\ \mathbb{E}X_1 &= \int_0^1 x_1\left(x_1 + \frac{1}{2}\right)dx_1 = \frac{7}{12} \\ \mathbb{E}X_2 &= \mathbb{E}X_1 = \frac{7}{12} \end{align*} \] Then: \[ \begin{align*} \mathbb{E}X_1 X_2 &= \int_0^1\int_0^1 x_1 x_2 (x_1 + x_2) dx_1 dx_2 \\ &= \int_0^1\int_0^1 (x_1^2 x_2 + x_1 x_2^2) dx_1 dx_2 \\ &= \int_0^1\int_0^1 x_1^2 x_2 dx_1 dx_2 + \int_0^1\int_0^1 x_1 x_2^2 dx_1 dx_2 \\ &= 2\int_0^1\int_0^1 x^2y dx dy \\ &= 2\int_0^1 \frac{1}{2}x^2 dx \\ &= \frac{1}{3} \end{align*} \]
So: \[ \text{cov}(X_1, X_2) = \mathbb{E}X_1X_2 - \mathbb{E}X_1\mathbb{E}X_2 = \frac{1}{3} - \left(\frac{7}{12}\right)^2 = -\frac{1}{144} \] Check your understanding
- What is \(\text{cov}(-X_1, X_2)\)?
- What is \(\text{cov}(X_2, X_1)\)?
- What is \(\text{cov}(3X_1 - 2, 5X_2 + 1)\)?
Correlation
Observe that shifting a random vector by a constant will not change the covariance, but scaling will. For example, continuing the example immediately above, by bilinearity one has that \(\text{cov}(10X_1, 10X_2) = -\frac{100}{144}\). While this is a substantially larger number, intuitively, the scale transformation shouldn’t alter the dependence between \(X_1, X_2\) — if \(X_1, X_2\) are only weakly dependent, then \(10X_1, 10X_2\) should remain weakly dependent. Correlation is a standardized covariance measure that is scale-invariant.
The correlation between \(X_1, X_2\) is the covariance scaled by the variances: \[ \text{corr}(X_1, X_2) = \frac{\text{cov}(X_1, X_2)}{\sqrt{\text{var}(X_1)\text{var}(X_2)}} \] This measure is scale invariant since it is a symmetric operator and \(\text{var}(a X_1) = a^2\text{var}(X_1)\), so: \[ \text{corr}(aX_1, X_2) = \frac{a\text{cov}(X_1, X_2)}{\sqrt{a^2\text{var}(X_1)\text{var}(X_2)}} = \frac{\text{cov}(X_1, X_2)}{\sqrt{\text{var}(X_1)\text{var}(X_2)}} = \text{corr}(X_1, X_2) \]
Continuing the previous example, the marginal variances are obtained by the following calculation: \[ \begin{align*} \mathbb{E}X_1^2 = \int_0^1 x_1^2\left(x_1 + \frac{1}{2}\right)dx_1 &= \frac{5}{12} \\ \text{var}(X_1) = \mathbb{E}X_1^2 - \left(\mathbb{E}X_2\right)^2 &= \frac{11}{144} \end{align*} \]
Then, the correlation is: \[ \text{corr}(X_1, X_2) = \frac{-\frac{1}{144}}{\sqrt{\frac{11}{144}}\sqrt{\frac{11}{144}}} = -\frac{1}{11} \]
In addition to being scale-invariant, correlation is easier to interpret since it must be a number between 0 and 1.
Lemma. Let \(X_1, X_2\) be random variables with finite second moments. Then \(-1 \leq \text{corr}(X_1, X_2) \leq 1\).
Denote the correlation by \(\rho = \text{corr}(X_1, X_2)\), the means by \(\mu_1, \mu_2\), and the variances by \(\sigma_1^2, \sigma_2^2\). Note that \(\text{cov}(X_1, X_2) = \sigma_1\sigma_2\rho\).
Then consider the expression \(\left[(X_1 - \mu_1) + t(X_2 - \mu_2)\right]^2\) as a polynomial in \(t\). Since the polynomial is nonnegative everywhere, by expanding the square one obtains: \[ 0 \leq \mathbb{E}\left\{\left[(X_1 - \mu_1) + t(X_2 - \mu_2)\right]^2\right\} = (\sigma_1^2)t^2 + (2\sigma_1\sigma_2\rho) t + \sigma_1^2 \] Thus, the polynomial can have at most one real-valued root (at zero), so the discriminant is negative. Therefore: \[ (2\sigma_1\sigma_2\rho)^2 - 4\sigma_1^2\sigma_2^2 \leq 0 \quad\Longleftrightarrow\quad \rho^2 \leq 1 \]
This result establishes that the largest absolute values of a correlation are \(-1\) and \(1\); the smallest is \(0\). Thus, (absolute) values nearer to 1 indicate stronger dependence, and (absolute) values nearer to zero indicate weaker dependence.
Consider the random vector defined by the joint distribution given in the table below:
\(X_1 = 0\) | \(X_2 = 1\) | |
---|---|---|
\(X_2 = 0\) | 0.1 | 0.5 |
\(X_2 = 1\) | 0.3 | 0.1 |
First, consider whether you expect outcomes to be dependent, and if so, whether you expect a positive or negative covariance/correlation. Then compute the covariance and correlation.
Lastly, it is important to note that covariance and correlation do not capture every type of dependence, but rather only linear or approximately linear dependence. We will return to this later, but the classical counterexample is given below.
Let \(U \sim \text{uniform}(-1, 1)\), and define \(X = U^2\). Then \(\mathbb{E}U = 0\), so: $$ (U, X) = (UX) = U^3 = _{-1}^1 u^3 du = 0
$$ However, obviously \(X, U\) are dependent because \(X\) is a deterministic function of \(U\).