Common probability distributions
Distributions based on Bernoulli trials
Several distributions arise from considering a sequence of independent so-called “Bernoulli trials”: random experiments with a binary outcome. While the coin toss is the prototypical example, a wide range of situations can be described as Bernoulli trials: clinical outcomes, device function, manufacturing defects, and so on. For the purposes of probability calculations, outcomes are encoded as 1, called a ‘success’, and 0, called a ‘failure’.
Bernoulli distribution. A random variable \(X\) has a Bernoulli distribution with parameter \(p\), written \(X \sim \text{Bernoulli}(p)\) if it has the PMF: \[ P(X = x) = p^x (1 - p)^{1 - x} \;,\qquad x = 0, 1 \;,\; p \in (0, 1) \] Note that this PMF simply assigns probability \(p\) to the outcome \(X = 1\) and probability \(1 - p\) to the outcome \(X = 0\).
Geometric distribution. Imagine now a sequence of independent Bernoulli trials with success probability \(p\), and define \(X\) to be the number of trials before the first success. Then the event \(X = k\) is equivalent to observing \(k\) failures and 1 success, in just that order. By independence, the associated probability is: \[ P(X = k) = (1 - p)^k p \;,\qquad k = 0, 1, 2, \dots \] This is a valid PMF. Any random variable with this PMF is said to have a geometric distribution with parameter \(p\), written \(X \sim \text{geometric}(p)\). The CDF can be written in closed form as \(F(x) = 1 - (1 - p)^{\lfloor x \rfloor + 1}\), where \(\lfloor x\rfloor\) is the “floor” function: the largest integer smaller than or equal to \(x\). For an exercise, verify that the above is a PMF and derive the CDF.
Binomial distribution. Let \(X\) now record the number of successes in \(n\) independent Bernoulli trials. The set of all possible outcomes for \(n\) trials is \(\{0, 1\}^n\), but since not all outcomes are equally likely unless \(p = \frac{1}{2}\), the probability of each outcome \(s \in \{0, 1\}^n\) in general can be obtained from independence as: \[ P(s) = p^{\sum_{i = 1}^n s_i} (1 - p)^{n - \sum_{i = 1}^n s_i} \] Then, the probability of \(k\) successes is the sum of the probabilities of outcomes with \(k\) 1’s: \[ P(X = k) = \sum_{s \in S: \sum_i s_i = k} P(s) = \sum_{s \in S: \sum_i s_i = k} p^k (1 - p)^{n - k} = {n \choose k} p^k (1 - p)^{n - k} \;,\qquad k = 0, 1, 2, \dots, n \] There are \({n \choose k}\) terms in the sum since that is the number of ways to allocate the \(k\) successes to the \(n\) positions. A random variable with this PMF is said to have a binomial distribution with parameters \(n, p\), written \(X \sim \text{binomial}(n, p)\), or simply \(X \sim b(n, p)\). There is no closed form CDF.
The binomial PMF is clearly non-negative for any \(k\) in the support set \(\{0, 1, \dots, n\}\), so it suffices to check whether the PMF sums to one. For this we use the binomial theorem: \[ \sum_{k = 0}^n P(X = k) = \sum_{k = 0}^n {n \choose k} p^k (1 - p)^{n - k} = (p + 1 - p)^n = 1 \]
Negative binomial. If now \(X\) records the number of trials until \(r\) successes are observed, then, by analogous reasoning in the binomial case, the probability of \(n\) trials is the sum over all ways to allocate the successes, except for the last, among the trials, of the probability of \(r\) successes and \(n - r\) trials: \[ P(X = n) = {n - 1 \choose r - 1}p^r (1 - p)^{n - r} \;,\qquad n = r, r + 1, r + 2, \dots \]
There is an alternate form of the negative binomial that arises from considering the number of failures, akin to the geometric distribution, rather than the number of trials. If \(Y\) records the number of failures before \(r\) successes, then \(Y = X - r\), so with \(k = n - r\), the PMF above becomes \[ P(Y = k) = P(X - r = k) = P(X = k + r) = {k + r - 1 \choose k} p^r (1 - p)^k \;,\qquad k = 0, 1, 2, \dots \] To show that this is a valid PMF, use the second form and write: \[ \begin{align*} {k + r - 1 \choose k} &= \frac{(r + k - 1)(r + k - 2) \cdots (r + 1)(r)}{k!} \\ &= (-1)^k \frac{(-r)(-r - 1)\cdots (-r - k + 2) (-r - k + 1)}{k!} \\ &= (-1)^k {-r \choose k} \end{align*} \] Technically, \({-r\choose k}\) is a generalized binomial coefficient. While the above should make this notation plausible, we won’t give a full treatment here; however, binomial series with generalized coefficients are convergent under certain conditions, and the limit exhibits the same form as the more familiar binomial theorem. We’ll simply apply the result. Using the limit of the binomial series, one obtains: \[ \sum_{k = 0}^\infty {k + r - 1\choose k} (1 - p)^k = \sum_{k = 0}^\infty {-r \choose k} (p - 1)^k = (1 + p - 1)^{-r} = p^{-r} \] From which it follows that: \[ \sum_{k = 0}^\infty P(Y = k) = p^r \sum_{k = 0}^\infty {k + r - 1\choose k} (1 - p)^k = p^r p^{-r} = 1 \]
About 36% of people in the US have blood type A positive. Consider a blood drive in which donors participate independently of blood type and are representative of the general population. If \(X\) records whether an arbitrary donor is of blood type \(A^+\), a reasonable model is \(X \sim \text{Bernoulli}(p = 0.36)\).
The following questions can be answered using distributions based on Bernoulli trials.
- What is the probability that the first 5 donors are not \(A^+\)?
- What is the probability that more than 5 donors have blood drawn before an \(A^+\) donor has blood drawn?
- What is the probability that of the first 20 donors, 10 are \(A^+\)?
- What is the probability that it takes 30 donors to obtain 10 \(A^+\) samples?
The answers are as follows:
Here, consider \(X\) to be the number of donors before the first \(A^+\) donor; then \(X \sim \text{geometric}(p = 0.36)\). So, \(P(X = 5) = (1 - p)^5 p = 0.64^5 \cdot 0.36\approx 0.0387\).
Let \(X\) remain as in (i). Then \(P(X > 5) = 1 - P(X \leq 5) = (1 - p)^{5 + 1} = 0.64^6 \approx 0.0687\).
Now let \(X\) record the number of \(A^+\) donors out of the first 20. Then \(X \sim b(n = 20, p = 0.36)\), so \(P(X = 10) = {20\choose 10} \cdot 0.36^{10} \cdot 0.64^{10} \approx 0.0779\)
Now let \(X\) record the number of donors until 10 \(A^+\) samples are obtained. Then \(X \sim nb(r = 10, p = 0.36)\) where the first parametrization is used. Then \(P(X = 30) = {30-1 \choose 10-1} \cdot 0.36^{10} \cdot 0.64^{20} \approx 0.0487\).
Multinomial distribution. The multinomial generalizes the binomial to trials with \(k > 2\) outcomes. If \(p_1, \dots, p_k\) denote the probabilities of each of \(k\) outcomes for a single trial, and \(X_1, \dots, X_k\) count the number of each outcome observed in \(n\) independent trials, then: \[ P(X_1 = x_1, \dots, X_k = x_k) = \frac{n!}{x_1! \cdots x_{k - 1}!} p_1^{x_1} \cdots p_{k - 1}^{x_{k - 1}} \;,\qquad \sum_i x_i = n\;,\; \sum_i p_i = 1 \] Note that only \(k - 1\) terms appear in the above expression since \(X_k = n - \sum_{i = 1}^{k - 1} X_i\).
The percentages of the US population with each of the 8 blood types is given below.
Type | Frequency |
---|---|
\(O^+\) | 34.7% |
\(O^-\) | 6.6% |
\(A^+\) | 35.7% |
\(A^-\) | 6.3% |
\(B^+\) | 8.5% |
\(B^-\) | 1.5% |
\(AB^+\) | 3.4% |
\(AB^-\) | 0.6% |
What is the probability of observing 4 \(O^+\), 4 \(A^+\), 1 \(B^+\), and 1 \(AB^+\) donors among 10 total donors? This can be found using the multinomial PMF as: \[ \frac{10!}{4!0!4!0!1!0!1!} \cdot 0.347^4 \cdot 0.066^0 \cdot 0.357^4 \cdot 0.063^0 \cdot 0.085^1 \cdot 0.015^0 \cdot 0.034^1 \approx 0.00579 \]
The above PMFs convey the distribution of probabilities across outcomes, but what are “typical” values that one is likely to observe for these random variables? There are several so-called measures of center, including: the mode or value with largest mass/density; the median or ‘middle’ value with equal mass/density above and below; and the mean or average of values in the support weighted by density/mass.
The mean is known, formally, as the expected value or simply the ‘expectation’ of a random variable, and defined to be: \[ \mathbb{E}X = \begin{cases} \sum_{x \in \mathbb{R}} x P(X = x), &X \text{ is discrete} \\ \int_\mathbb{R} x f(x) dx, &X \text{ is continuous} \end{cases} \] The expectation exists when \(X\) is absolutely summable/integrable, i.e., when \(\sum_{x \in \mathbb{R}} |x|P(X = x) < \infty\) in the discrete case or \(\int_\mathbb{R} |x| f(x) dx < \infty\) in the continuous case.
Show that:
- If \(X \sim \text{Bernoulli}(p)\) then \(\mathbb{E}X = p\)
- If \(X \sim \text{binomial}(n, p)\) then \(\mathbb{E}X = np\)
- If \(X \sim \text{geometric}(p)\) then \(\mathbb{E}X = \frac{1 - p}{p}\)
Poisson distribution. Consider a binomial probability \({n \choose x}p^x (1 - p)^{n - x}\). If the expectation \(np\) is held constant at \(\lambda\) while \(n \rightarrow \infty\), the probability tends to: \[ \lim_{n \rightarrow\infty} {n \choose x}p^x (1 - p)^{n - x} = \lim_{n \rightarrow \infty} \frac{n!}{(n - x)!n^x} \cdot \frac{\lambda^x}{x!} \cdot \left(1 - \frac{\lambda}{n}\right)^n \cdot \left(1 - \frac{\lambda}{n}\right)^{-x} = \frac{\lambda^x e^{-\lambda}}{x!} \] This is a PMF for \(x = 0, 1, 2, \dots\) and \(\lambda > 0\), since it is obviously nonnegative and: \[ \sum_{x = 0}^\infty \frac{\lambda^x e^{-\lambda}}{x!} = e^{-\lambda} \sum_{x = 0}^\infty \frac{\lambda^x}{x!} = e^{-\lambda}e^\lambda = 1 \] If a random variable \(X\) has this PMF, then \(X \sim \text{Poisson}(\lambda)\). Think of this distribution as pertaining to the outcome of infinitely many Bernoulli trials with a finite expected total number of successes. For a Poisson random variable, \(\mathbb{E}X = \lambda\): \[ \mathbb{E}X = \sum_{x = 0}^\infty x \frac{\lambda^x e^{-\lambda}}{x!} = \sum_{x = 1}^\infty \frac{\lambda^x e^{-\lambda}}{(x - 1)!} = \lambda\sum_{x = 1}^\infty \frac{\lambda^{x - 1} e^{-\lambda}}{(x - 1)!} = \lambda\sum_{x = 0}^\infty \frac{\lambda^{x} e^{-\lambda}}{x!} = \lambda \]
The Poisson distribution is often used to model count data. Suppose you’re recording the number of visits to your website each day for a year, and the average number of daily visits is 32.7, so you decide to model the random variable \(X\), which records the number of daily visits, as Poisson with parameter \(\lambda = 32.7\).
- Find the following probabilities according to your model: \(P(X = 0), P(X \leq 20), P(X > 40), P(X > 100)\).
- If you assume visits on each day are independent, how many days would you expect to observe no visits in a year? Under 20 visits? Over 40 visits? Over 100 visits?
- If your year of data shows 30 days with over 100 visits, do you think the Poisson is a good model?
We’ll work through the solution in class.
Basic continuous distributions
Here we’ll look at a few elementary continuous distributions.
Uniform distribution. The continuous uniform distribution corresponds to drawing a real number at random from an interval \((a, b)\): the density is constant on the specified interval and zero elsewhere. \(X \sim \text{uniform}(a, b)\) if \(X\) has the PDF: \[ f(x) = \frac{1}{b - a} \;,\qquad x \in (a, b) \;,\; -\infty < a < b < \infty \] Including one or both endpoints in the interval does not change the probabilities of any events; even though this will produce different densities, the CDF will be the same. The CDF in either case is: \[ F(x) = \int_{-\infty}^x \frac{1}{b - a}dz = \begin{cases} 0 &,\; x \leq a \\ \frac{x - a}{b - a} &,\; x \in (a, b) \\ 1 &,\; x \geq b \end{cases} \]
Another way of looking at the matter of endpoints is that since \(P(X = x) = 0\) for every \(x \in \mathbb{R}\), countably many points may be ‘removed’ without fundamentally alterning the probability distribution. At any rate, for some problems, it makes more sense contextually to specify a uniform distribution on a closed interval, so you may occasionally see \(X \sim \text{uniform}[a, b]\), though this is less common.
The expectation of a uniform random variable is the midpoint of the interval: \[ \int_\mathbb{R} x f(x) dx = \frac{1}{b - a}\int_a^b x dx = \frac{1}{b-a}\left[\frac{x^2}{2}\right]_a^b = \frac{b^2 - a^2}{2(b - a)} = \frac{(b + a)(b - a)}{2(b - a)} = \frac{b + a}{2} \]
Exponential distribution. The exponential distribution is given by the PDF: \[ f(x) = \alpha e^{-\alpha x} \;,\qquad x > 0 \;,\; \alpha > 0 \]
The CDF is \(F(x) = 1 - e^{-\alpha x}\), and the mean is: \[ \begin{align*} \mathbb{E}X &= \int_\mathbb{R} x f(x) dx \\ &= \int_0^\infty x \alpha e^{-\alpha x}dx \\ &= \alpha \int_0^\infty \left(-\frac{d}{d\alpha} e^{-\alpha x}\right)dx \\ &= \alpha \frac{d}{d\alpha}\left[\frac{e^{-\alpha x}}{\alpha} \right]_0^\infty \\ &= \alpha \frac{d}{d\alpha}\left(0 - \frac{1}{\alpha}\right) \\ &= \alpha \cdot \frac{1}{\alpha^2} \\ &= \frac{1}{\alpha} \end{align*} \]
The exponential distribution is often used to model waiting times and failure times.
The Gaussian distribution
The Gaussian or normal distribution is of central importance in statistical inference, and arises in relation to averages. Here we’ll develop the density in the standard case (no parameters) and then introduce the center and scale parameters through a simple linear transformation. We’ll start with an important calculus result.
Theorem (Gaussian integral). \(\int_\mathbb{R} e^{-z^2}dz = \sqrt\pi\).
Let \(I\) denote the integral of interest. Then: \[ I^2 = \left(\int_\mathbb{R} e^{-z^2}dz\right)\left(\int_\mathbb{R} e^{-w^2}dw\right) = \int_\mathbb{R}\int_\mathbb{R} e^{-(z^2 + w^2)}dzdw \] Now converting to polar coordinates — that is, applying the transformation \(z = r\cos\theta\) and \(w = r\sin\theta\) and using the result from calculus that \(dzdw = rdrd\theta\) — one obtains: \[ I^2 = \int_0^{2\pi}\int_0^\infty e^{-r^2}rdrd\theta = \int_0^{2\pi} \left(\left[-\frac{1}{2}e^{-r^2}\right]^\infty_0\right)d\theta = \int_0^{2\pi} \frac{1}{2}d\theta = \pi \] Taking square roots yields \(I = \sqrt\pi\) as required.
Gaussian distribution. The standard Gaussian or normal distribution is given by the PDF \(\varphi (x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\) for \(x \in \mathbb{R}\). There is no closed form for the CDF, but we write the CDF as \(\Phi(x) = \int_{-\infty}^x \varphi(z)dz\). If a random variable \(Z\) has this PDF/CDF, write \(Z \sim N(0, 1)\).
As a remark, special notation is used for the PDF and CDF of the standard normal/Gaussian because it is used so frequently. \(\varphi\) and \(\Phi\) are typical. \(\varphi\) is a valid PDF since it is evidently non-negative and: \[ \begin{align*} \int_\mathbb{R} \varphi(x)dx &= \frac{1}{\sqrt{2\pi}} \int_\mathbb{R} e^{-\frac{x^2}{2}}dx \\ &= \frac{1}{\sqrt{2\pi}} \int_\mathbb{R} e^{-z^2}\sqrt{2}dz \qquad \left(z^2 = \frac{x^2}{2} \Rightarrow dz = \frac{dx}{\sqrt{2}}\right) \\ &= \sqrt{\frac{2}{2\pi}} \int_\mathbb{R} e^{-z^2}dz \\ &= \sqrt{\frac{2}{2\pi}} \cdot\sqrt{\pi} \\ &= 1 \end{align*} \]
If \(Z \sim N(0, 1)\) then the expectation of \(Z\) is: \[ \begin{align*} \mathbb{E}Z &= \int_\mathbb{R} x\varphi(x)dx \\ &= \int_{-\infty}^0 x\varphi(x)dx + \int_0^\infty x\varphi(x)dx \\ &= -\int_{-\infty}^0 -x\varphi(-x)dx + \int_0^\infty x\varphi(x)dx \\ &= -\int_0^\infty x\varphi(x)dx + \int_0^\infty x\varphi(x)dx \\ &= 0 \end{align*} \]
This argument leverages the fact, immediate from the functional form of \(\varphi\), that \(\varphi(x) = \varphi(-x)\), i.e., \(\varphi\) is an even function.
Now let \(Z \sim N(0, 1)\) and define \(X = \sigma Z + \mu\) for arbitrary numbers \(\sigma > 0\) and \(\mu \in \mathbb{R}\). \(\sigma\) is referred to as a scale parameter and \(\mu\) is referred to as a location parameter. The CDF of \(X\) is: \[ F_X (x) = P(X \leq x) = P(\sigma Z + \mu \leq x) = P\left(Z \leq \frac{x - \mu}{\sigma}\right) = \Phi\left(\frac{x - \mu}{\sigma}\right) \] It follows that the PDF is: \[ f_X (x) = \frac{d}{dx} \Phi\left(\frac{x - \mu}{\sigma}\right) = \frac{1}{\sigma}\varphi\left(\frac{x - \mu}{\sigma}\right) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{1}{2\sigma^2}\left(x - \mu\right)^2\right\} \] We write \(X \sim N(\mu, \sigma^2)\) to indicate that \(X\) has a Gaussian distribution with parameters \(\mu\) and \(\sigma\). Furthermore, \(X\) has expectation: \[ \begin{align*} \mathbb{E}X &= \int_\mathbb{R} x\cdot\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{1}{2\sigma^2}\left(x - \mu\right)^2\right\}dx \\ &= \int_\mathbb{R} (\sigma z + \mu)\cdot\frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{1}{2}z^2\right\}dz \qquad \left(z = \frac{x - \mu}{\sigma} \Rightarrow dz = \frac{1}{\sigma}dx\right) \\ &= \sigma\int_\mathbb{R} z \cdot\frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{1}{2}z^2\right\}dz + \mu\int_\mathbb{R} \frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{1}{2}z^2\right\}dz \\ &= \sigma\mathbb{E}Z + \mu \int_\mathbb{R}\varphi(z)dz \\ &= \mu \end{align*} \]
Notice that \(\mathbb{E}X = \mathbb{E}\left(\sigma Z + \mu\right) = \sigma\mathbb{E}Z + \mu\).
More expectations
Theorem. If \(X\) is a random variable and \(g\) is a function, then the expectation of \(Y = g(X)\), provided it exists, is: \[ \mathbb{E}X = \begin{cases} \sum_{x \in \mathbb{R}} g(x) P(X = x) &,\;X \text{ is discrete}\\ \int_\mathbb{R} g(x)f(x)dx &,\;X \text{ is continuous}\\ \end{cases} \]
The proof requires some advanced techniques, so we’ll skip it. Hogg & Craig provide a sketch in the discrete case and otherwise defer to references. Some texts simply state this as a definition.
Corollaries. If \(X\) is a random variable, \(a, b, c\) are constants, and \(g_1, g_2\) are functions whose expectations exist, then the following statements are true.
- \(\mathbb{E}\left(ag_1(X) + bg_2(X) + c\right) = a\mathbb{E}g_1(X) + b\mathbb{E}g_2(X) + c\).
- If \(g_1(x) \geq 0\) for all \(x\) in the support of \(X\), then \(\mathbb{E}g_1(X) \geq 0\).
- If \(g_1(x) \geq g_2(x)\) for all \(x\) in the support of \(X\), then \(\mathbb{E}g_1(X) \geq \mathbb{E}g_2(X)\).
- If \(a \leq g_1(x) \leq b\) then for all \(x\) in the support of \(X\), \(a \leq \mathbb{E}g_1(X) \leq b\).
Notice that (ii) and (iv) are special cases of (iii), so it suffices to prove (i) and (iv). Essentially, these properties follow directly from properties of integration/summation.
For (i), in the discrete case:
\[ \begin{align*} \mathbb{E}\left(ag_1(X) + bg_2(X) + c\right) &= \sum_x \left(ag_1(x) + bg_2(x) + c\right) P(X = x) \\ &= a\sum_x g_1(x)P(X = x) + b\sum_x g_2(x)P(X = x) + c\sum_x P(X = x) \\ &= a\mathbb{E}g_1(X) + b\mathbb{E}g_2(X) + c \end{align*} \]
In the continuous case, the argument is essentially the same:
\[ \begin{align*} \mathbb{E}\left(ag_1(X) + bg_2(X) + c\right) &= \int_\mathbb{R} \left(ag_1(x) + bg_2(x) + c\right) f(x)dx \\ &= a\int_\mathbb{R} g_1(x)f(x)dx + b\int_\mathbb{R} g_2(x)f(x)dx + c\int_\mathbb{R}f(x)dx \\ &= a\mathbb{E}g_1(X) + b\mathbb{E}g_2(X) + c \end{align*} \]
For (iii), let \(g_2(x) \leq g_1(x)\) for every \(x \in \mathbb{R}\). Then, in the discrete case:
\[ \mathbb{E}g_2(X) = \sum_x g_2(x) P(X = x) \leq \sum_x g_1(x) P(X = x) = \mathbb{E}g_1(X) \]
In the continuous case:
\[ \mathbb{E}g_2(X) = \int_\mathbb{R} g_2(x) f(x) dx \leq \int_\mathbb{R} g_1(x) f(x) dx = \mathbb{E}g_1(X) \]
To obtain (ii), set \(g_2(x) \equiv 0\). Then if \(0 \leq g_1(x)\) for every \(x\in\mathbb{R}\), one has \(g_2 (x) \leq g_1(x)\) and so \(\mathbb{E}g_2(X) \leq \mathbb{E}g_1(X)\). But \(\mathbb{E}g_2(X) = 0\), so \(0 \leq \mathbb{E}g_1(X)\).
To obtain (iv), set \(g_2(x) \equiv a\) and consider a function \(g_3(x) \equiv b\). Then the expectations exist and \(\mathbb{E}g_2(X) = a\) and \(\mathbb{E}g_3(X) = b\). apply (iii) to obtain that if \(g_2(x) \leq g_1(x) \leq g_3(x)\) for all \(x \in \mathbb{R}\), the expectations are similarly ordered and thus \(a \leq \mathbb{E}g_1(X) \leq b\).
These corollaries can often ease calculations. For example, it is immediate that for any random variable whose expectation exists, \(\mathbb{E}X^2 \geq 0\). Similarly, scaling and shifting a random variable scales and shifts the mean: \(\mathbb{E}(aX + b) = a\mathbb{E}X + b\).
The variance of a random variable is defined as the expectation: \[ \text{var}X = \mathbb{E}\left(X - \mathbb{E}X\right)^2 \]
If \(X \sim \text{Bernoulli}(p)\) then, noting that from above \(\mathbb{E}X = p\), the variance of \(X\) is: \[ \begin{align*} \text{var}X &= \mathbb{E}(X - \mathbb{E}X)^2 \\ &= \mathbb{E}(X - p)^2 \\ &= (1 - p)^2 P(X = 1) + (0 - p)^2 P(X = 0) \\ &= (1 - p)^2 p + p^2 (1 - p) \\ &= (1 - p)\left((1 - p)p + p^2\right) \\ &= (1 - p)\left(p - p^2 + p^2\right) \\ &= (1 - p)p \end{align*} \]
While one could calculate the variance directly, as in the example above, this is often a cumbersome calculation in more complex cases. Instead, by the corollary, and noting that \(\mathbb{E}g(X)\) is a constant, one can obtain the variance formula as: \[ \text{var}X = \mathbb{E}\left(X^2 - 2X\mathbb{E}X + \mathbb{E}X\mathbb{E}X\right) = \mathbb{E}X^2 - 2\mathbb{E}X\mathbb{E}X + \mathbb{E}X\mathbb{E}X = \mathbb{E}X^2 - \left(\mathbb{E}X\right)^2 \] Thus, to calculate the variance of a random variable, one simply needs to know \(\mathbb{E}X\) and \(\mathbb{E}X^2\).
If \(X \sim \text{Poisson}(\lambda)\), then from before \(\mathbb{E}X = \lambda\), and: \[ \begin{align*} \mathbb{E}X^2 &= \sum_{x = 0}^\infty x^2 \frac{\lambda^x e^{-\lambda}}{x!} \\ &= \lambda \sum_{x = 0}^\infty x \frac{\lambda^{x-1} e^{-\lambda}}{(x - 1)!} \\ &= \lambda \sum_{x = 0}^\infty (x + 1) \frac{\lambda^{x} e^{-\lambda}}{x!} \\ &= \lambda \left[\sum_{x = 0}^\infty x \frac{\lambda^{x} e^{-\lambda}}{x!} + 1 \cdot \frac{\lambda^{x} e^{-\lambda}}{x!} \right] \\ &= \lambda \left[\sum_{x = 0}^\infty x \frac{\lambda^{x} e^{-\lambda}}{x!} + \sum_{x = 0}^\infty 1 \cdot \frac{\lambda^{x} e^{-\lambda}}{x!} \right] \\ &= \lambda(\lambda + 1) \end{align*} \]
Then, by the variance formula: \[ \text{var}X = \mathbb{E}X^2 - \left(\mathbb{E}X\right)^2 = \lambda(\lambda + 1) - \lambda^2 = \lambda^2 + \lambda - \lambda^2 = \lambda \]