Module 5


Matias Salibian Barrera

Last modified — 24 Oct 2025

Joint distribution of several random variables

  • Suppose \(X_1\) and \(X_2\) are two random variables (defined on the same sample space \(\Omega\)).

  • We can study them separately (e.g. their CDF’s \(F_{X_1}(a)\) and \(F_{X_2}(t)\), their expected values \(\mathbb{E}[X_1]\), etc.)

  • But if we study them jointly we may be able to explore their relationships, if there is any (e.g. we may be able to use one of them to predict the other, etc.)

  • The joint CDF of the random vector \((X_1, X_2)\) is \[F_{(X_1, X_2)} ( a, b ) = \mathbb{P}( X_1 \le a, \, \ X_2 \le b)\]

    The set \(\left\{ X_1 \le x_1, \ X_2 \le x_2 \right\}\) is \[\left\{ X_1 \le x_1 \, , \ X_2 \le x_2 \right\} = \left\{ X_1 \le x_1 \right\} \cap \left\{ X_2 \le x_2 \right\}\]

Properties of the joint distributions

  • \(F_{(X_1, X_2)} ( s, t)\) is a non-decreasing function of \(s\) and \(t\).

  • Moreover \[ \lim_{a \to -\infty} F_{(X_1, X_2)}(a, x_2) = 0 \quad \forall x_2 \in \mathbb{R}\] and \[ \lim_{b \to -\infty} F_{(X_1, X_2)}(x_1, b) = 0 \quad \forall x_1 \in \mathbb{R}\]

  • Also \[ \lim_{a \to \infty, \ b \to \infty} F_{(X_1, X_2)}(a, b) = 1 \]

Joint PMF (discrete random vectors)

  • Suppose \(X_1\) and \(X_2\) are both discrete with ranges being \({\mathcal R}_1\) and \({\mathcal R}_2\).

  • The joint range of \({\mathbf X} = (X_1, X_2)^\mathsf{T}\) is then (a subset of) \({\mathcal R}_1 \otimes {\mathcal R}_2\) (their Cartesian product), which will also be finite or countable.

  • The joint PMF of \(X_1\) and \(X_2\) is \[f_{(X_1, X_2)} ( k_1, k_2) = \mathbb{P}( X_1 = k_1 \,, \ X_2 = k_2)\]

Marginal CDFs

  • Recall that the CDF of \(X_1\) is \[F_{X_1}(x) = \mathbb{P}(X_1 \leq x)\] Since \(\mathbb{P}( X_2 < \infty) = 1\) we have \(\mathbb{P}(X_1 \leq x) = \mathbb{P}(X_1\leq x, X_2 < \infty)\) (prove it!) and thus \[F_{X_1}(x) = \mathbb{P}(X_1\leq x, X_2 < \infty) = \lim_{a \to \infty} F_{(X_1, X_2)}(x, a)\] Similarly, \(F_{X_2}(x) = \lim_{a \to \infty} F_{(X_1, X_2)}(a, x)\).

  • \(F_{X_1}\) and \(F_{X_2}\) are called the marginal CDFs of \(X_1\) and \(X_2\) respectively.

  • The word marginal refers to the presence of other random variables

Marginal PMFs when \(X\) is discrete

  • The PMF of \(X_1\) (or of \(X_2\)) can be derived from the joint PMF.

  • Note that \(\Omega = \{ X_2 \in {\cal R}_2 \}\) and thus for any \(k_1 \in {\cal R}_1\) we have

\[ f_{X_1}(k_1) = \mathbb{P}( X = k_1 ) = \mathbb{P}\left( (X = k_1) \cap (X_2 \in {\cal R}_2) \right) \] also

\[ \{ X_2 \in {\cal R}_2 \} \, = \, \bigcup_{b \in {\cal R}_2} \{ X_2 = b \} \] hence

\[ \Bigl\{ X = k_1 \Bigr\} \cap \Bigl\{ X_2 \in {\cal R}_2 \Bigr\} = \biggl\{ X = k_1 \biggr\} \cap \left( \bigcup_{b \in {\cal R}_2} \{ X_2 = b \} \right) = \bigcup_{b \in {\cal R}_2} \Bigl( \{ X_1 = k_1 \} \cap \{ X_2 = b \} \Bigr) \]

Marginal PMFs when \(X\) is discrete

  • Putting it all together we get

\[ \begin{aligned} f_{X_1}(k_1) &= \mathbb{P}( X = k_1 ) = \mathbb{P}\left( \bigcup_{b \in {\cal R}_2} \left( \{ X_1 = k_1 \} \cap \{ X_2 = b \} \right) \right) \\ & \\ &= \sum_{b \in {\cal R}_2} \mathbb{P}( X_1 = k_1, \, X_2 = b ) = \sum_{b \in \mathcal{R}_{2}} f_{(X_1, X_2)} ( k_1, b ) \end{aligned} \]

  • Also, with the same reasoning we obtain

\[f_{X_2}(k_2) = \sum_{a \in \mathcal{R}_{1}} f_{(X_1, X_2)} ( a, k_2 )\]

Properties of PMFs

  • Let \(X\) be a discrete random variable with PMF \(f_X(a)\) and range \({\cal R}_X\)

  • For any subset \(B \subset {\cal R}_X\) we have

\[ \mathbb{P}\left( X \in B \right) \, = \, \sum_{b \in B} \, f_X(b) \]

  • Similarly, if \((X, Y)\) is a discrete random vector with PMF \(f_{(X,Y)}(a, b)\) and range \({\cal R}_{(X, Y)}\), then for any subset \(A \subset {\cal R}_{(X, Y)}\) we have

\[ \mathbb{P}\left( (X, Y) \in A \right) \, = \, \sum_{(a, b) \in A} \, f_{(X,Y)}(a, b) \]

Example

  • Consider the experiment of rolling two fair dice.
  • Let \(X\) be the lowest of the two rolls, \(Y\) be the highest.
  • The marginal PMFs of \(X\) and \(Y\) are
\(k\) \(f_X(k)\) \(f_Y(k)\)
1 11/36 1/36
2 9/36 3/36
3 7/36 5/36
4 5/36 7/36
5 3/36 9/36
6 1/36 11/36

Example

  • The joint PMF of \(X\) and \(Y\) is

    \(f(k_1, k_2)\) 1 2 3 4 5 6
    1 1/36 2/36 2/36 2/36 2/36 2/36
    2 0 1/36 2/36 2/36 2/36 2/36
    3 0 0 1/36 2/36 2/36 2/36
    4 0 0 0 1/36 2/36 2/36
    5 0 0 0 0 1/36 2/36
    6 0 0 0 0 0 1/36
  • Calculate \(f_X(3)\) and \(f_Y(5)\) using the joint PMF above and check that it coincides with the table on the previous page

  • What is \(\mathbb{P}( 2 X > Y)\)?

More than 2 random variables

  • The joint PMF of \((X_1, X_2, ..., X_n)\) is

\[f ( k_1, k_2, \ldots, k_n ) = \mathbb{P}( X_1 = k_1, X_2 = k_2, \ldots, X_n = k_n) \, , \quad ( k_1, k_2, \ldots, k_n ) \in {\cal R}_{(X_1, \ldots, X_n)}\]

  • The joint CDF of \(X_1\), \(X_2\), …, \(X_n\)

\[F ( k_1, k_2, \ldots, k_n) = \mathbb{P}( X_1 \le k_1, X_2 \le k_2, \ldots, X_n \le k_n) \, , \quad ( k_1, k_2, \ldots, k_n ) \in \mathbb{R}^n\]

Independent random variables

Definition
\(X_1, X_2,\dots, X_n\) are independent if the events \(\{ X_1 \in A_1 \}\), \(\{ X_2 \in A_2 \}\), …, \(\{ X_n \in A_n \}\), are independent, for any \(A_j \subset {\cal R}_j\).

In other words: for any \(2 \le k \le n\) and \(1 \le i_1 < i_2 < \ldots < i_k \le n\) we have

\[\mathbb{P}( X_{i_1} \in A_{i_1}, X_{i_2} \in A_{i_2}, \ldots, X_{i_k} \in A_{i_k}) = \mathbb{P}( X_{i_1} \in A_{i_1}) \mathbb{P}(X_{i_2} \in A_{i_2}) \ldots \mathbb{P}( X_{i_k} \in A_{i_k}) \]

Independent random variables

  • A necessary and sufficient condition for random variables \(X_1\), \(X_2\), , \(X_n\) to be independent is that for ALL \(k_1, k_2, \ldots, k_n\), where each \(k_j \in \mathbb{R}\), their joint and marginal CDFs satisfy

    \[F_{(X_1, \ldots, X_n)} ( k_1, k_2, \ldots, k_n ) = F_{X_1}(k_1) \times F_{X_2}(k_2) \times \ldots \times F_{X_n} (k_n)\]

  • When \(X_1, X_2, \dots, X_n\), are discrete, a necessary and sufficient condition is that for ALL \(k_1, k_2, \ldots, k_n\), where the vector \((k_1, \ldots, k_n) \in {\cal R}_{(X_1, \ldots, X_n)}\), their joint and marginal PMFs satisfy

\[f_{(X_1, \ldots, X_n)}( k_1, k_2, \ldots, k_n ) = f_{X_1}(k_1) \times f_{X_2}(k_2) \times \ldots \times f_{X_n} (k_n)\]

Independent dice

  • We roll a fair die twice.
  • Let \(X_1\) the number in the 1st roll, and \(X_2\) the number in the 2nd roll.
  • Are \(X_1\) and \(X_2\) independent?
  • The joint PMF of \((X_1, X_2)\) is

\[f( k_1, k_2) = \begin{cases} \frac{1}{36} & k_1, k_2 \in \{1, 2, \dots, 6\}\\ 0 & \text{else.} \end{cases}\]
- Note that

\[{\cal R}_{(X_1, X_2)} = \left\{ 0, 1, \ldots, 6 \right\} \times \left\{ 0, 1, \ldots, 6 \right\} = {\cal R}_{X_1} \times {\cal R}_{X_2}\]

Independent dice

  • Now compute the marginal PMFs \[\begin{aligned} f_1 ( k_1 ) & = \sum_{k_2=1}^6 f ( k_1, k_2 )\\ & = \sum_{k_2=1}^6 1/36 \\ & = \begin{cases} 1 / 6 & k_1\in\{1,\dots, 6\}\\ 0 & \text{else}.\end{cases} \end{aligned}\]

  • Similarly \[f_2 ( k_2 ) = \begin{cases} 1 / 6 & k_1\in\{1,\dots, 6\}\\ 0 & \text{else}.\end{cases}\]

  • Thus for any \(k_1, k_2 \in \{1,\dots,6\}\): \[f ( k_1, k_2 ) = \frac{1}{36} = \frac{1}{6} \times \frac{1}{6} = f_1 ( k_1 ) f_2 ( k_2 )\]

  • For any \(k_1, k_2 \not\in \{1,\ldots,6\}\), we have \[f ( k_1, k_2 ) = 0 = f_1 ( k_1 ) f_2 ( k_2 )\]

  • Hence \(X_1\) and \(X_2\) are independent.

Defective transistors

  • In a bag of 5 transistors, 2 are defective

  • Transistors will be tested until the 2 defective ones are identified

  • Let \(X_{1}\) the number of tests made until the first defective is identified

  • Let \(X_{2}\) the number of tests until the second defective is identified.

  • Find the joint PMF of \(X_{1}\) and \(X_{2}\)

Find the joint PMF of defectives

  • We have \(1 \le X_1 < X_2 \le 5\)

  • Let \(D\) denote “defective”, and \(N\) mean “not defective”. Then the events corresponding to each combination \((X_1, X_2)\) are

    \(x_1 \backslash x_2\) 2 3 4 5
    1 DD DND DNND DNNND
    2 - NDD NDND NDNND
    3 - - NNDD NNDND
    4 - - - NNNDD

Find the joint PMF of defectives

  • Thus, their join PMF is given by \[f (x_1 = 1, x_2 = 2 ) = \mathbb{P}( \{DD\} ) = \mathbb{P}( D_2 \ \vert\ D_1 ) \mathbb{P}( D_1 ) = \frac{1}{4} \times \frac{2}{5} = 0.10\] where \(D_1\) is the event that the first tested item was “D”, etc.

  • Similarly: \[\begin{aligned} f (2, 4) &= \mathbb{P}( NDND )\\ &= \mathbb{P}( D_4 \ \vert\ N_1 D_2 N_3 ) \mathbb{P}( N_3 \ \vert\ N_1 D_2 ) \mathbb{P}( D_2 \ \vert\ N_1 ) \mathbb{P}( N_1 ) \\ &= \frac12 \times \frac23 \times \frac24 \times \frac35 = 0.10 \end{aligned}\]

  • etc.

Find the joint PMF of defectives

\(\mathbb{P}(X_1 = x_1, X_2 = x_2)\) 2 3 4 5
1 0.1 0.1 0.1 0.1
2 0 0.1 0.1 0.1
3 0 0 0.1 0.1
4 0 0 0 0.1
  • Another way to present the joint PMF is \[f(X_1 = k_{1}, X_2 = k_{2}) = \begin{cases} 0.10 & 1 \le k_1 < k_2 \le 5,\ k_i\in{\mathbb{N}}\\ 0 & \mbox{else.} \end{cases}\]

Marginal PMFs of defectives

The marginal PMFs:

\(k\) \(f_1(X_1 = k)\) \(f_2(X_2 = k)\)
1 0.4 0
2 0.3 0.1
3 0.2 0.2
4 0.1 0.3
5 0 0.4

Are \(X_1\) and \(X_2\) independent?

Functions of random variables

  • Let \(g : \mathbb{R}^2 \to \mathbb{R}\) be a bivariate function and \(X_1\) and \(X_2\) be two RVs with joint PMF \(f(k_1, k_2)\).

  • For example: \[\begin{aligned} g ( k_1, k_2 ) &= k_1 + k_2 \\ g ( k_1, k_2 ) &= k_1 \times k_2 \\ g ( k_1, k_2 ) &= \exp\{ 2 \, (k_1 + k_2) \} = e^{2 \, (k_1 + k_2)} \end{aligned}\]

  • Then, we can calculate the expectation of this function using the natural formula: \[\mathbb{E}[ g( X_1, X_2 ) ] = \sum_{k_{1}\in \mathcal{R}_{1}} \sum_{k_{2}\in \mathcal{R}_{2}} g( k_{1},k_{2}) f( k_{1}, k_{2}),\] where \(\mathcal{R}_{1}\) and \(\mathcal{R}_{2}\) are ranges of \(X_1\) and \(X_2\).

Expectation of a bivariate function

  • Suppose the PMF of \((X_1, X_2)\) is given by


\(X_1 = x_1 \backslash X_2 = x_2\) 1 2 3 4
1 0.05 0.10 0.15 0.20
2 0.05 0.15 0.20 0.10
  • Let \(g(a, b) = a \, b\)

  • Compute \(\mathbb{E}[g ( X_1, X_2 )]\)

Solution

\[\begin{aligned} \mathbb{E}[ X_1 X_2 ] &= 1 \times 1 \times 0.05 + 1 \times 2 \times 0.05 \\ & \quad + 2 \times 1 \times 0.10 + 2 \times 2 \times 0.15 \\ & \quad + 3 \times 1 \times 0.15 + 3 \times 2 \times 0.20 \\ & \quad + 4 \times 1 \times 0.20 + 4 \times 2 \times 0.10 \\ & = 4.2. \end{aligned}\]

Expectations of sums

Proposition
\[\mathbb{E}[ X_{1}+X_{2}] = \mathbb{E}[ X_{1}] + \mathbb{E}[ X_{2}].\]

Proof
\[\begin{aligned} \mathbb{E}[X_{1}+X_{2}] &= \sum_{k_{1}}\sum_{k_{2}}(k_{1}+k_{2}) f( k_{1}, k_{2}) \\ &= \sum_{k_{1}}\sum_{k_{2}}[ k_{1} f( k_{1}, k_{2}) +k_{2}f( k_{1}, k_{2}) ] \\ &= \sum_{k_{1}} k_{1}\overset{f_{1}( k_{1}) }{\overbrace{% \sum_{k_{2}} f( k_{1}, k_{2}) }}+\sum_{k_{2}} k_{2}\overset{% f_{2}( k_{2}) }{\overbrace{\sum_{k_{1}}f( k_{1}, k_{2}) }} \\ &=\sum_{k_{1}} k_{1} f_{1}( k_{1}) +\sum_{k_{2}} k_{2} f_{2}( k_{2}) \\ &= \mathbb{E}[ X_{1}] +\mathbb{E}[X_{2}]. \end{aligned}\]

Expectations of products

Proposition
If \(X_1\) and \(X_2\) are independent, then \[\mathbb{E}[X_{1}X_{2}] = \mathbb{E}[X_{1}] \mathbb{E}[X_{2}]\]

Proof
\[\begin{aligned} \mathbb{E}[X_{1}X_{2}] &= \sum_{x_{1}}\sum_{x_{2}}x_{1}x_{2}f( x_{1},x_{2}) \\ &=\sum_{x_{1}}\sum_{x_{2}}x_{1}x_{2}f_{1}( x_{1}) f_{2}( x_{2}) & \mbox{(by independence)} \\ &=\sum_{x_{1}}x_{1}f_{1}( x_{1})\sum_{x_{2}}x_{2}f_{2} ( x_{2}) \\ &= \mathbb{E}[X_{1}] \mathbb{E}[X_{2}]. \end{aligned}\]

Linearity of the expected value

Proposition
\[\mathbb{E}[a + b \, g(X_1, X_2) ] = a + b \, \mathbb{E}[ g (X_1, X_2) ] \quad \forall \ a, b \in {\mathbb{R}},\] \[\mathbb{E}[g(X_1, X_2) + h(X_1, X_2) ] = \mathbb{E}[g(X_1, X_2) ] + \mathbb{E}[h (X_1, X_2) ]\]

Proof
Let \(X = g ( X_1, X_2)\) and \(Y = h ( X_1, X_2)\).

Applying \(\mathbb{E}[a +b \, X] = a + b \, \mathbb{E}[X]\) and \(\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]\), we get the above results.

Example revisited

Recall the joint PMF of \((X_1, X_2)\) given by

\(X_1 = x_1 \backslash X_2 = x_2\) 1 2 3 4
1 0.05 0.10 0.15 0.20
2 0.05 0.15 0.20 0.10

Calculate \[\mathbb{E}\left[ \left( 1 + 2 \, X_{1} + 3 \, X_{2} \right)^{2} \right]\]

There are many ways to calculate this value.

One approach

  • Note that \[(1+2X_{1}+3X_{2})^2= 1+4X_{1}+6X_{2}+4X_{1}^{2}+9X_{2}^{2}+12X_{1}X_{2}.\]

  • Making use of linearity, we may compute the expectation of each of these terms and add the results.

  • We already have \(\mathbb{E}[X_{1}X_{2}] =4.2\).

Other terms

\(k_{1}\) \(f_{1}(k_{1})\) \(k_{1}f_{1}(k_{1})\) \(k_{1}^{2}f_{1}(k_{1})\)
1 0.10 0.10 0.10
2 0.25 0.50 1.00
3 0.35 1.05 3.15
4 0.30 1.20 3.80
Total 1.00 2.85 9.05

Therefore, \[\begin{aligned} \mathbb{E}[X_{1}]&= 2.85 \\ \mathbb{E}[X_{1}^{2}] &= 9.05 \end{aligned}\]

Other terms

\(k_{2}\) \(f_{2}(k_{2})\) \(k_{2}f_{2}(k_{2})\) \(k_{2}^{2}f_{2}(k_{2})\)
1 0.50 0.50 0.50
2 0.50 1.00 2.00
Total 1.00 1.50 2.50

Therefore, \[\begin{aligned}\mathbb{E}[X_{2}]&=1.5 \\ \mathbb{E}[X_{2}^{2}] &= 2.5\end{aligned}\]

Assembling

So far, we have obtained \[\begin{aligned} \mathbb{E}[X_{1}X_{2}] &=4.2\\ \mathbb{E}[X_{1}] &=2.85 & E[X_{2}] &=1.5 \\ \mathbb{E}[X_{1}^{2}] &=9.05 & E[X_{2}^{2}] &= 2.5. \end{aligned}\]

Therefore, \[\begin{aligned} \mathbb{E}[(1+2X_{1}+3X_{2})^{2}] &= 1 + 4\mathbb{E}[X_{1}] + 6\mathbb{E}[X_{2}] + 4\mathbb{E}[X_{1}^{2}] + 9 \mathbb{E}[X_{2}^{2}] + 12\mathbb{E}[X_{1}X_{2}]\\ &= 1+4 \times 2.85 + 6\times 1.5 + 4\times 9.05 + 9\times 2.5 + 12\times 4.2 \\ &=130.5. \end{aligned}\]

  • Check this result by computing \(\mathbb{E}[(1+2X_{1}+3X_{2})^{2}]\) in a different way on your own.

Covariance: measuring linear relationships

  • Given two random variables, we might ask how they are related: for instance, a tall person tends to have a heavier body.

  • Let \(X_1\) and \(X_2\) be two random variables with \[\mathbb{E}[ X_1] = \mu_1 \qquad \text{ and } \qquad \mathbb{E}[ X_2 ] = \mu_2\]

Definition
We define the covariance between \(X_1\) and \(X_2\) to be \[\operatorname{Cov}(X_1, X_2) = \mathbb{E}[ ( X_1 - \mu_1) ( X_2 - \mu_2) ]\]

If \(X_1\) and \(X_2\) are discrete, then \[\operatorname{Cov}(X_1, X_2) = \sum_{k_1 \in {\cal R}_{X_1}} \sum_{k_2 \in {\cal R}_{X_2}} (k_1 - \mu_1)(k_2 - \mu_2)f( k_1, k_2)\]

Covariance

  • A useful result:

\[\operatorname{Cov}(X_1, X_2) = \mathbb{E}[X_1 X_2] - \mathbb{E}[X_1]\mathbb{E}[X_2].\]

Proof
\[\begin{aligned} \operatorname{Cov}(X_1, X_2) & = \mathbb{E}[( X_1 - \mu_1)(X_2 - \mu_2) ] \\ & = \mathbb{E}[ X_1 X_2 - \mu_1 X_2 - \mu_2 X_1 + \mu_1 \mu_2 ] \\ & = \mathbb{E}[ X_1 X_2 ] - \mathbb{E}[ \mu_1 X_2 ] - \mathbb{E}[ \mu_2 X_1 ] + \mu_1 \mu_2 \\ & = \mathbb{E}[ X_1 X_2 ] - \mu_1 \, \mathbb{E}[ X_2 ] - \mu_2 \, \mathbb{E}[ X_1 ] + \mu_1 \mu_2 \\ & = \mathbb{E}[ X_1 X_2 ] - \mu_1 \mu_2 - \mu_2 \mu_1 + \mu_1 \mu_2 \\ & = \mathbb{E}[ X_1 X_2 ] - \mu_1 \mu_2. \end{aligned}\]

Numerical example

Suppose that \[\begin{aligned} X_{1} &=\text{Time to complete task 1 (in days)} \\ X_{2} &=\text{Time to complete task 2 (in days)} \end{aligned}\] have their joint PMF given by

\(X_1\backslash X_2\) 1 2 3 4 Total
1 0.20 0.05 0.05 0 0.30
2 0.05 0.15 0.10 0.05 0.35
3 0 0.05 0.10 0.20 0.35
Total 0.25 0.25 0.25 0.25 1

It is easy to calculate \[\begin{aligned} \mathbb{E}[X_1] &= 1\times 0.30+2\times 0.35+3\times 0.35=2.05, \\ \mathbb{E}[X_2] &= 1\times 0.25+2\times 0.25+3\times 0.25+4\times 0.25 = 2.50. \end{aligned}\]

 Example continued

\(X_1\backslash X_2\) 1 2 3 4 Total
1 0.20 0.05 0.05 0 0.30
2 0.05 0.15 0.10 0.05 0.35
3 0 0.05 0.10 0.20 0.35
Total 0.25 0.25 0.25 0.25 1

We further have \[\begin{aligned} \mathbb{E}[X_{1}X_{2}] &= 1\times 1\times 0.20+1\times 2\times 0.05 +1\times 3\times 0.05\\ &\quad +1\times 4\times 0 + 2\times 1\times 0.05+2\times 2\times 0.15\\ &\quad +2\times 3\times 0.10+2\times 4\times 0.05 +3\times 1\times 0 \\ &\quad +3\times 2\times 0.05+3\times 3\times 0.10+3\times 4\times 0.20 \\ &= 5.75. \end{aligned}\]

Example (continued)

So far, we have obtained \[\begin{aligned} \mathbb{E}[X_1] &= 2.05, \\ \mathbb{E}[X_2] &= 2.50, \\ \mathbb{E}[X_{1}X_{2}] &= 5.75. \end{aligned}\]

Therefore, \[\begin{aligned} \operatorname{Cov}(X_{1},X_{2}) &= \mathbb{E}[X_{1}X_{2}] - \mathbb{E}[X_{1}] \mathbb{E}[X_{2}] \\ &=5.75 - 2.05\times 2.50 \\ &= 0.625. \end{aligned}\]

Interpretation

When \(\operatorname{Cov}( X_{1},X_{2})\) is large and positive, the variables tend to be both above or both below their respective means simultaneously.

  • In other words, the two variables tend to move in the same direction (when one increases, so does the other, and vice versa).

  • The covariance is a measure of the “linear” association between the variables.

When \(\operatorname{Cov}(X_{1},X_{2})\) is large and negative one of the variables tends to be above its mean when the other is below its mean.

  • In other words, the two variables tend to move in opposite directions (when one increases, the other tends to decrease, and vice versa).

Covariance and independence

Proposition
If \(X_1\) and \(X_2\) are independent, then \(\operatorname{Cov}(X_1, X_2 ) = 0\) .

Proof
\[\begin{aligned} \operatorname{Cov}(X_1, X_2) & = \mathbb{E}[ X_1 X_2 ] - \mathbb{E}[ X_1 ] \mathbb{E}[ X_2 ] \\ & = \mathbb{E}[ X_1 ] \mathbb{E}[ X_2 ] - \mathbb{E}[ X_1 ] \mathbb{E}[ X_2 ] \\ & = \ 0. \end{aligned}\]

Bi-linearity

Proposition
Given constants \(a, b, c, d \in{\mathbb{R}}\), \[\operatorname{Cov}(a + b \, X_1,\ c + d \, X_2) \ = \ b\, d \, \operatorname{Cov}( X_1, X_2 ).\]

Proof
\[\begin{aligned} \operatorname{Cov}(a + b X_1,\ c + d X_2 )&= \mathbb{E}\bigg[ (a + b X_1 - \mathbb{E}[ a + b X_1 ] ) \, ( c + d X_2 - \mathbb{E}[c + d X_2 ] ) \bigg] \\ &= \mathbb{E}\bigg[ ( a + b X_1 - a - b \mathbb{E}[ X_1 ] ) \, ( c + d X_2 - c - d \mathbb{E}[X_2] ) \bigg] \\ &= \mathbb{E}[ \, b \, ( X_1 - \mathbb{E}[ X_1] ) \, d \, ( X_2 - \mathbb{E}[X_2] ) \, ] \\ &= b \, d \, \mathbb{E}[ ( X_1 - \mathbb{E}[ X_1]) \, ( X_2 - \mathbb{E}[X_2] ) ] \\ &= b \, d \, \operatorname{Cov}( X_1, X_2 ). \end{aligned}\]

Lack of scale invariance

  • Variables in last example were given in days \[ \operatorname{Cov}( X_{1},X_{2}) = 0.625.\]

  • If the variables were given in hours instead: \[ \operatorname{Cov}( 24 X_{1}, 24 X_{2}) = 24^2 \times 0.625 = 360.\]

  • The covariance can be artificially increased or decreased by changing the units of the variables.

  • Not an ideal metric for the strength of the relationship.

Correlation coefficient

  • The correlation between \(X_1\) and \(X_2\) is

\[\operatorname{Corr}(X_1, X_2) \, = \, \frac{ \operatorname{Cov}(X_1, X_2) }{ \sqrt{ \operatorname{Var}[ X_1 ] } \, \sqrt{ \operatorname{Var}[ X_2 ] } }\] provided that \(\operatorname{Var}[ X_1 ] \times \operatorname{Var}[ X_2 ] > 0\)

 

  • Common notation: “\(\operatorname{Corr}\)” or “\(\rho\)”.

Correlation coefficient

Proposition
  • Let \(a, b, c, d \in {\mathbb{R}}\), then \[\operatorname{Corr}(a + b X_1,\ c + d X_2) = \text{sign}(b \times d) \, \operatorname{Corr}(X_1, X_2)\]

Proof
Recall: \(\sqrt{ Var [ a + b X_1 ] } = |b| \, \sqrt{ \operatorname{Var}[ X_1] }\) and \(\operatorname{Cov}(a + b X_1, c + d X_2) = bd \operatorname{Cov}(X_1, X_2 )\). Hence, \[\begin{aligned} \operatorname{Corr}(a + b \, X_1, c + d \, X_2) = \frac{ b d \operatorname{Cov}(X_1, X_2 ) }{ |b| |d| \sqrt{ \operatorname{Var}[ X_1] } \sqrt{ \operatorname{Var}[X_2]}} \\ & = \frac{ b d }{ |b| |d| } \operatorname{Corr}(X_1, X_2). \end{aligned}\]

Scale invariance

  • The linear correlation coefficient is scale invariant

  • In other words: for any \(b, d \in \mathbb{R}\):

    • \(|\operatorname{Corr}(b \, X_1, X2)| = |\operatorname{Corr}(X_1, X_2)|\)
    • \(|\operatorname{Corr}(X_1, d \, X2)| = |\operatorname{Corr}(X_1, X_2)|\)
    • \(|\operatorname{Corr}(b \, X_1, dX2)| = |\operatorname{Corr}(X_1, X_2)|\)
  • The strength (magnitude) of the relationship doesn’t change

  • The direction (sign) might

The range of the correlation coefficient

  • In addition to being scale invariant, we have \[-1 \le \operatorname{Corr}( X_1, X_2 ) \le 1.\]

  • Hence, we have a unique “reference scale” to consider a correlation value to be “high” or “low”

  • Specific cutoffs / guidelines / thresholds often depend on the subject area
    e.g. \(\rho=0.9\) may be low to a physicist, but extremely high to a sociologist

  • To prove this, we first show:

\[\operatorname{Var}( X_{1} \pm X_{2}) = \operatorname{Var}( X_{1})+ \operatorname{Var}( X_{2}) \pm 2 \operatorname{Cov}( X_{1},X_{2}).\]

The variance of a sum

\[\begin{aligned} \operatorname{Var}[ X_{1}+X_{2} ] &= \mathbb{E}\left[ \left( \left( X_{1}+X_{2} \right)- \left(\mu _{1} + \mu _{2} \right) \right)^{2} \right] \\ & \\ &= \mathbb{E}\left[\left( \left( X_{1}-\mu _{1}\right) + \left( X_{2}-\mu_{2}\right) \right)^{2} \right] \\ & \\ &= \mathbb{E}\left[ \left( X_{1}-\mu _{1} \right)^{2} + \left( X_{2}-\mu _{2} \right)^{2} +2 \left( X_{1}-\mu _{1} \right) \left( X_{2}-\mu _{2} \right) \right] \\ & \\ &= \operatorname{Var}[ X_{1}] + \operatorname{Var}[X_{2}] + 2 \operatorname{Cov}( X_{1},X_{2} ) \end{aligned}\]

Similarly, you should prove that

\[ \operatorname{Var}[ X_{1} - X_{2} ] = \operatorname{Var}[ X_{1}] + \operatorname{Var}[X_{2}] - 2 \operatorname{Cov}( X_{1},X_{2} ) \]

Proof that \(\operatorname{Corr}(X_1, X_2) \in [-1, 1]\)

Let \(Y = X_2 - \beta \, X_1\). Choose \(\beta = \operatorname{Cov}(X_1, X_2) / \operatorname{Var}[X_1]\).

\[\begin{aligned} 0 \le \operatorname{Var}[Y] &= \sigma_2^2 + \beta^2 \sigma_1^2 - 2 \, \beta \, \operatorname{Cov}(X_1, X_2) \\ &= \sigma_2^2 + \frac{ \operatorname{Cov}(X_1, X_2)^2 }{ \sigma_1^2} - 2 \frac{ \operatorname{Cov}(X_1, X_2)^2}{ \sigma_1^2 } \\ &= \sigma_2^2 - \frac{ \operatorname{Cov}(X_1, X_2)^2}{ \sigma_1^2 }\\ &\Rightarrow \frac{\operatorname{Cov}(X_1, X_2)^2}{ \sigma_1^2 } \leq \sigma_2^2\\ &\Rightarrow \frac{\operatorname{Cov}(X_1, X_2)^2}{ \sigma_1^2\sigma_2^2 } \leq 1\\ &\Rightarrow \operatorname{Corr}(X_1, X_2)^2 \leq 1\\ & \\ &\Rightarrow |\operatorname{Corr}(X_1, X_2)| \leq 1 \end{aligned}\]

Uncorrelated and not independent

  • We roll two fair dice
  • Let \(X\) and \(Y\) be the results
  • Note that \(X\) and \(Y\) are independent random variables
  • Let

\[V = X + Y \qquad \text{ and } \qquad U = X - Y\]

  • Find \(\operatorname{Corr}(V, U)\).

Uncorrelated and not independent

  • We need \(\operatorname{Var}[V]\), \(\operatorname{Var}[U]\) and \(\operatorname{Cov}(U, V)\)

  • For \(\operatorname{Var}[V]\):

\[ \operatorname{Var}[V] = \operatorname{Var}[X + Y] = \operatorname{Var}[X] + \operatorname{Var}[Y] = 2 \times 35 / 12 = 35 / 6 \]

  • Similarly

\[ \operatorname{Var}[U] = \operatorname{Var}[X - Y] = \operatorname{Var}[X] + \operatorname{Var}[Y] = 35 / 6 \]

  • \(\operatorname{Cov}(V, U)\) is hard to find directly. Instead, recall that

\[\operatorname{Var}[V + U] = \operatorname{Var}[V] + \operatorname{Var}[U] + 2 \, \operatorname{Cov}(V, U) \]

Uncorrelated and not independent

  • Also \(V + U = X + Y + X - Y = 2 \, X\), so

\[\operatorname{Var}[V + U] = \operatorname{Var}[ 2 \, X ] = 4 \, \operatorname{Var}[X] = 35/3 \]

  • Finally

\[ \begin{aligned} \operatorname{Cov}(V, U) &= \frac{1}{2} \left( \operatorname{Var}[V + U] - \operatorname{Var}[V] - \operatorname{Var}[U] \right) \\ & \\ &= \frac{1}{2} \left( \frac{35}{3} - \frac{35}{6} - \frac{35}{6} \right) = 0 \end{aligned} \]

  • Then, \(\operatorname{Corr}(V, U) = 0\)

\(U\) and \(V\) are not independent

  • We will find a point \((u, v)\) where

\[ \mathbb{P}\left( U = u \, , V = v \right) \ne \mathbb{P}\left( U = u \right) \, \mathbb{P}\left( V = v \right) \]

  • Note that \(\mathbb{P}( V = 2 ) = 1/36\), because

\[ \left\{ V = 2 \right\} \, = \, \left\{ X + Y = 2 \right\} \, = \, \left\{ X = 1, \, Y = 1 \right\} \]

  • And also

\[ \left\{ V = 2 \right\} \, = \, \left\{ X = 1, \, Y = 1 \right\} \ \subset \ \left\{ U = 0 \right\} \]

\(U\) and \(V\) are not independent

  • Hence

\[ \left\{ V = 2, \, U = 0 \right\} = \left\{ V = 2 \right\} \cap \left\{ U = 0 \right\} = \left\{ V = 2 \right\} \] and \[ \mathbb{P}(V = 2, U = 0) \, = \, \mathbb{P}( V = 2 ) \, = \, \frac{1}{36} \]

  • Since \(\mathbb{P}( U = 0) = 1/6\), we have

\[\mathbb{P}(V = 2) \, \mathbb{P}(U = 0) = \frac{1}{36}\times\frac{1}{6} \, \ne \, \mathbb{P}(V = 2, U = 0)\]

Discrete conditional distributions

  • Let \(f_{(X_1, X_2)}(k_1, k_2)\) be the PMF of the random vector\((X_1, X_2)\).

  • For points \(k_1\) with \(f_{X_1}(k_1) > 0\), we define the conditional PMF of \(X_2\) given \(X_1 = k_1\) as

\[\begin{aligned} f_{2|1} ( k_{2}\ \vert\ k_1 ) &=\mathbb{P}(X_{2}=k_{2} \ \vert\ X_{1}= k_1 ) \\ & \\ &=\frac{\mathbb{P}( X_1 = k_1, \, X_{2}=k_{2}) }{\mathbb{P}( X_{1}= k_1 ) } \\ & \\ &=\frac{f_{(X_1, X_2)}( k_1 , k_{2}) }{f_{X_1}( k_1 ) }. \end{aligned}\]

Discrete conditional distributions

  • For each fixed \(k_1\) with \(f_{X_1}(k_1) > 0\), the function \(f_{2 | 1}( k_{2} \ \vert\ k_{1})\) is a PMF in \(k_2 \in {\cal R}_{X_2}\):

    • You can check that \(f_{2|1}(k_{2} \ \vert\ k_{1}) \ge 0\)

    • and also: \[\begin{aligned} \sum_{k_{2} \in \mathcal{R}_{X_2} } f_{2 | 1}( k_{2} \ \vert\ k_{1}) &= \sum_{k_{2} \in \mathcal{R}_2 }\frac{f_{(X_1, X_2)}( k_{1}, k_{2}) }{f_{X_1}( k_{1}) } \\ & \\ &= \frac{1}{f_{X_1}( k_1 ) } \sum_{k_{2} \in \mathcal{R}_2 }f_{(X_1, X_2)}( k_{1}, k_{2}) \\ & \\ &= \frac{1}{f_{X_1}( k_1 ) } \, f_{X_1}( k_1 ) = 1 \end{aligned}\]

The other way

  • Similarly: when \(f_2(k_2) > 0\) we define the PMF of \(X_1\) given \(X_2 = k_2\) to be: \[f_{1 | 2} ( k_{1} \ \vert\ k_{2}) =\frac{f( k_{1}, k_{2}) }{f_{2}( k_{2}) }.\]

  • For each \(k_2\), \(f_{1 | 2}( k_{1} \ \vert\ k_{2})\) is a PMF in \(k_1\).

More dice examples

  • We roll two fair dice
  • Let \(X\) and \(Y\) be the results
  • Let \(V = X + Y\) and \(W = \max\{X, Y\}\).

More dice examples (cont’d)

  • The PMF of \((W, V)\) where

\[V = X + Y \quad\quad \text{and} \quad\quad W = \max\{X, Y\}\] is:

\(W\ \backslash\ V\) 2 3 4 5 6 7 8 9 10 11 12
1 1/36 0 0 0 0 0 0 0 0 0 0
2 0 2/36 1/36 0 0 0 0 0 0 0 0
3 0 0 2/36 2/36 1/36 0 0 0 0 0 0
4 0 0 0 2/36 2/36 2/36 1/36 0 0 0 0
5 0 0 0 0 2/36 2/36 2/36 2/36 1/36 0 0
6 0 0 0 0 0 2/36 2/36 2/36 2/36 2/36 1/36

Conditional PMFs of \(V\) and \(W\)

  • Conditional PMFs of \(W\) are the rows of the table. (divided by that row’s sum)

  • For example

\[ f_{W|V}(3|6) = \frac{ \mathbb{P}(W=3, V=6) }{ \mathbb{P}( V=6 )} = \frac{ 1/36 }{ 1/36 + 2/36 + 2/36} = 1/5 \]

  • We have

\[f_{W|V}(w|6) = \begin{cases} 1/5 & w = 3 \\ & \\ 2/5 & w = 4, \, 5 \\ & \\ 0 & \text{else.}\end{cases}\]

Conditional Expectation

  • The conditional expectation of \(W\) given \(V = v\) is:

\[\mathbb{E}[ W\ \vert\ V =v] = \sum_{w \in \mathcal{W}} w f_{W|V} ( w \ \vert\ v).\]

  • For example:

\[\mathbb{E}[W \ \vert\ V = 4] = 2\times \frac{1}{3}+3\times \frac{2}{3}=8/3\]

\[\mathbb{E}[W \ \vert\ V=7] =4\times \frac{1}{3}+5\times \frac{1}{3}+6\times \frac{1}{3}=5\]

\[\mathbb{E}[V \ \vert\ W = 1] = 2 \times 1 = 1\]

Conditional variance

  • Naturally, the conditional variance is the variance of the conditional distribution:

\[\begin{aligned} \operatorname{Var}[ W \ \vert\ V=v] &= \mathbb{E}\Big[\big( W - \mathbb{E}[ W \ \vert\ V=v ]\big)^2 \ \vert\ V = v \Big]\\ &= \sum_{w \in \mathcal{W}} \big(w - \mathbb{E}[W\ \vert\ V = v]\big)^2 f(w \ \vert\ v)\\ &= \mathbb{E}[W^2 \ \vert\ V = v] - \left( \mathbb{E}[W \ \vert\ V = v] \right)^2 \end{aligned}\] Prove the last equality

  • Example: \[\begin{aligned} \mathbb{E}[W \ \vert\ V = 4] &= 2\times \frac{1}{3}+3\times \frac{2}{3}=2.67\\ \mathbb{E}[ W^{2}\ \vert\ V = 4] &=4\times \frac{1}{3}+9\times \frac{2}{3}=7.333\\ \Rightarrow \operatorname{Var}[W\ \vert\ V=4] &=7.333-2.67^{2}=0.2041. \end{aligned}\]

Conditional expectation as a function

  • We defined

\[ \mathbb{E}\left( W | V = v \right) \]


  • We can look at the function

\[ h( {\color{red} a} ) \, = \, \mathbb{E}\left( W | V = {\color{red} a} \right) \] - Note that

\[ h \, : \, \mathcal{R}_V \ \longrightarrow \ \mathbb{R} \]

Conditional expectation as a function

\(v\) \(h(v) = \mathbb{E}\left( W | V = v \right)\)
3 2.00
4 2.67
5 3.50
6 4.20
7 5.00
8 5.20
9 5.50
10 5.67
11 6.00
12 6.00

Useful properties

  • If we apply the function \(h\) to the random variable \(V\), we will get a new random variable: \(h(V)\)

  • Note that \[ h \left( V \right) \, = \, \mathbb{E}\bigl( \left. W \right| V \bigr) \]

  • In other words: \(\mathbb{E}\bigl( \left. W \right| V \bigr)\) is a random variable

  • which will have its own expected value and variance, for example

Useful properties

  • Iterated expectations (“Tower property”): For any two random variables \(X\) and \(Y\) (for which all the expectations below exist), we have:

\[ \mathbb{E}\Bigl[ \, \mathbb{E}\bigl( \left. Y \right| X \bigr) \, \Bigr] \ = \ \sum_{k_1 \in \mathcal{R}_X} \, \mathbb{E}\bigl( \left. Y \right| X = k_1 \bigr) \, f_X(k_1) \, = \, \mathbb{E}\bigl[ Y \bigr] \]

  • And similarly (but not identically!) for the variance:

\[ \mbox{Var} \Bigl( \mathbb{E}\bigl( \left. Y \right| X \bigr) \Bigr) \, + \, \mathbb{E}\Bigl( \mbox{Var} \bigl( Y | X \bigr) \Bigr) \, = \, \mbox{Var} \left( Y \right) \]

Iterated expectations (Tower property) - Proof

  • Let \(h(X) = \mathbb{E}[Y\ \vert\ X]\)

\[\begin{aligned} \mathbb{E}[ \, \mathbb{E}[Y \ \vert\ X] ] &= \mathbb{E}[h(X)] = \sum_{x \in \mathcal{R}_X} h(x) \mathbb{P}(X = x) \\ &= \sum_{x \in \mathcal{R}_X} \mathbb{E}[Y \ \vert\ X = x] f_X(x) \\ &= \sum_{x \in \mathcal{R}_X} \left(\sum_{y \in \mathcal{R}_Y} y f_{Y|X} (y\ \vert\ x) \right) f_X (x) \\ &= \sum_{x \in \mathcal{R}_X} \sum_{y \in \mathcal{R}_Y} \left(y f_{Y|X} (y\ \vert\ x) f_X (x)\right) = \sum_{x \in \mathcal{R}_X} \sum_{y \in \mathcal{R}_Y} y f_{X,Y}(x, y)\\ & = \sum_{y \in \mathcal{R}_Y} y \, \sum_{x \in \mathcal{R}_X} f_{X,Y}(x, y) = \sum_{y \in \mathcal{R}_Y} y \, f_{Y}(y) = \mathbb{E}[Y]. \end{aligned}\]

Proof of the identity for the variance of \(\mathbb{E}[ Y | X ]\)

  • Recall that

\[\operatorname{Var}[\mathbb{E}[Y \ \vert\ X ]] = \operatorname{Var}[h(X)]= \mathbb{E}[h(X)^2] - \mathbb{E}[h(X)]^2 = \mathbb{E}[h(X)^2] - \mathbb{E}[Y]^2\]

  • and also

\[\mathbb{E}[\operatorname{Var}[ Y \ \vert\ X]] = \mathbb{E}\big[ \mathbb{E}[Y^2 \ \vert\ X] - \mathbb{E}[Y\ \vert\ X]^2\big]= \mathbb{E}[Y^2] - \mathbb{E}[h(X)^2]\]

  • Thus

\[\begin{aligned} \operatorname{Var}[\mathbb{E}[Y \ \vert\ X ]] + \mathbb{E}[\operatorname{Var}[ Y \ \vert\ X]] &= \mathbb{E}[h(X)^2] - \mathbb{E}[Y]^2 + \mathbb{E}[Y^2] - \mathbb{E}[h(X)^2] \\ & \\ &= \mathbb{E}[Y^2] - \mathbb{E}[Y]^2 = \operatorname{Var}[Y] \end{aligned}\]

Example (continued)

\(v\) \(\mathbb{E}\left( W | V = v \right)\) \(f_V(v)\)
2 1.00 1/36
3 2.00 2/36
4 2.67 3/36
5 3.50 4/36
6 4.20 5/36
7 5.00 6/36
8 5.20 5/36
9 5.50 4/36
10 5.67 3/36
11 6.00 2/36
12 6.00 1/36
  • Then

\[ \begin{aligned} \mathbb{E}[ \, \mathbb{E}[ W | V] ] &= 1 \times 1/36 + 2 \times 2/36 \\ & \qquad + 2.67 \times 3 / 36 + \ldots + \\ & \qquad + 6 \times 2/36 + 6 \times 1/36 = 4.472778 \end{aligned} \]

  • Check that indeed

\[ \begin{aligned} \mathbb{E}[ W ] &= 1 \times 1/36 + 2 \times 3/36 + \\ & \qquad 3 \times 5/36 + \ldots = 4.472778 \end{aligned} \]

Best predictors

  • Suppose we observe \(X\) and want to predict \(Y\)

  • Does it sound familiar?

  • The prediction should be based on \(X\)

  • In other words, it should be a function of \(X\): \(g(X)\)

  • The question is: what is the best function \(g\) we can use?

  • Well… first: what do we mean by “best”?

Best predictors

  • More precisely, the question is: what is the solution to

\[ \arg \min_{g \in {\cal G}} \, \mathbb{E}\left[ \left( Y - g \left( X \right) \right)^2 \right] \]

over a class of functions \({\cal G} = \left\{ g: \mathbb{R} \to \mathbb{R} \right\}\)

  • Solution: for any (measurable) function \(g\)

\[ \mathbb{E}\left[ \, \left( Y - \mathbb{E}\left( Y | X \right) \right)^{2} \, \right] \, \le \, \mathbb{E}\left[ \, \left( Y - g\left( X \right) \right) ^{2} \, \right] \]

Best predictors

  • Take any (measurable) function \(g\), then:

\[ \begin{aligned} \mathbb{E}\left[ \left( Y-g\left( X \right) \right) ^{2}\right] &= \mathbb{E}\left\{ \mathbb{E}\left. \left[ \left( Y-g\left( X \right) \right) ^{2} \right| X \right] \right\} \\ & \\ & = \mathbb{E}\left\{ \mathbb{E}\left. \Bigl[ \Bigl( Y - \mathbb{E}\left( Y | X \right) + \mathbb{E}\left( Y | X \right) - g\left( X \right) \Bigr) ^{2} \right| X \Bigr] \right\} \\ & \\ &= \mathbb{E}\left\{ \mathbb{E}\left[ \left. \Bigl( Y - \mathbb{E}\left( Y | X \right) \Bigr)^2 \right| X \right] \right\} \\ & \qquad + \mathbb{E}\left\{ \mathbb{E}\left[ \left. \Bigl( \mathbb{E}\left( Y | X \right) - g(X) \Bigr)^2 \right| X \right] \right\} \\ & \qquad + 2 \, \mathbb{E}\Bigl\{ \mathbb{E}\Bigl[ \Bigl. \bigl\{ Y - \mathbb{E}\left( Y | X \right) \bigr\} \, \bigl\{ \mathbb{E}\left( Y | X \right) - g(X) \bigr\} \Bigr| X \Bigr] \Bigr\} \end{aligned} \]

Best predictors

  • Since

\[ \mathbb{E}\left\{ \Bigl( \mathbb{E}\left( Y | X \right) - g(X) \Bigr)^2 \right\} \ge 0 \]

the last equation on the last page is

\[ \ge \quad \mathbb{E}\left[ \Bigl( Y - \mathbb{E}\left( Y | X \right) \Bigr)^2 \right] + 2 \, \mathbb{E}\Bigl\{ \Bigl. \bigl\{ \mathbb{E}\left( Y | X \right) - g(X) \bigr\} \ \mathbb{E}\Bigl[ \bigl\{ Y - \mathbb{E}\left( Y | X \right) \bigr\} \, \Bigr| X \Bigr] \Bigr\} \]

Best predictors

  • Now note that

\[ \mathbb{E}\Bigl[ \bigl\{ Y - \mathbb{E}\left( Y | X \right) \bigr\} \, \Bigr| X \Bigr] = \mathbb{E}\left( Y | X \right) - \mathbb{E}\left( Y | X \right) \, = 0 \]

so

\[ \mathbb{E}\Bigl\{ \bigl\{ \mathbb{E}\left( Y | X \right) - g(X) \bigr\} \, \mathbb{E}\Bigl[ \Bigl. \bigl\{ Y - E \left( Y | X \right) \bigr\} \, \Bigr| X \Bigr] \Bigr\} = 0 \]

  • Putting it all together we get that for any (measureable) function \(g\):

\[ \mathbb{E}\left[ \left( Y -g\left( X \right) \right)^{2}\right] \ \ge \ \mathbb{E}\left[ \left( Y - \mathbb{E}\left( Y | X \right) \right)^2 \right] \]

Hierarchical models

  • Example:

\[\begin{aligned} X &=\text{\# of daily visits to a website},\\ Y &=\text{\# daily sales made on the website}. \end{aligned}\]

  • Suppose that

\[\begin{aligned} X &\sim {\mathrm{Poiss}}( \lambda ) & \lambda &> 0,\\ Y \ \vert\ X &\sim {\mathrm{Binom}}( X,\ p) &p &\in [0, 1]. \end{aligned}\]


  • Find \(\mathbb{E}[Y]\) and \(\operatorname{Var}[Y]\)

Hierarchical models

  • Because \(X\) is Poisson:

\[\mathbb{E}[X] = \lambda, \qquad \operatorname{Var}[X] = \lambda\]

  • Because \(Y\ \vert\ X\) is \({\mathrm{Binom}}(X,p)\)

\[\mathbb{E}\left[ Y\ \vert\ X \right] = X \, p, \qquad \operatorname{Var}\left[ Y\ \vert\ X \right] = X \, p \, (1-p)\]

  • Therefore:

\[\mathbb{E}[Y] = \mathbb{E}[ \mathbb{E}[Y \ \vert\ X]] = \mathbb{E}[ X \, p] \, = \, p \, \mathbb{E}[X] = p \, \lambda\]

Hierarchical models

  • Finally:

\[\begin{aligned} \operatorname{Var}[Y] &=\mathbb{E}\left[ \operatorname{Var}[Y\ \vert\ X] \right] + \operatorname{Var}\left[ \mathbb{E}[ Y \ \vert\ X] \right] \\ & \\ &=\mathbb{E}\left[ X \, p \, ( 1-p) \right] + \operatorname{Var}\left[ X\, p \right] \\ & \\ &=p(1-p) \mathbb{E}[X] +p^{2}\operatorname{Var}[X] \\ & \\ &=p(1-p) \lambda +p^{2}\lambda \\ & \\ &=\lambda \, \left[ p \, (1-p) + p^{2} \right] \\ & \\ &=\lambda \, p \end{aligned}\]