Module 09

MGFs and conditional expectation


TC and DJM

Last modified — 05 Apr 2026

1 Moment generating functions

The moment generating function

Important

For now, we will skip generating functions and characteristic functions.

  • (Beginning of 3.4) \(r_X(t) = \mathbb{E}[t^X]\) is the probability generating function of a random variable \(X\).
  • (Section 3.4.1) \(c_X(t) = \mathbb{E}[e^{itX}]\) is the characteristic function

Definition
The moment generating function (MGF) of a random variable \(X\) is defined by \[m_X(t) = \mathbb{E}[e^{tX}].\]

  • The MGF is a scalar-valued function of \(t\).

Example of an MGF

We saw in an Exercise earlier that if \(X \sim {\mathrm{Gam}}(\alpha, \lambda)\), then

\[\begin{aligned} m_X(t) &= \mathbb{E}[e^{tX}] = \left(1 - \frac{t}{\lambda}\right)^{-\alpha}, \quad t < \lambda. \end{aligned}\]

  • \(m_X(t)\) depends on the parameters \(\alpha\) and \(\lambda\) of the distribution of \(X\).
  • It also depends on \(t\), which is a free variable that we can choose.
  • This seems like a weird object to care about, but it turns out to be very useful.

Using the MGF to compute moments

Theorem
If \(X\) is a random variable with MGF \(m_X(t)\), and there exists \(s>0\) such that, for all \(t \in (-s, s)\), \(m_X(t)<\infty\).

Then for any integer \(k \ge 1\), \[\mathbb{E}[X^k] = m_X^{(k)}(0) = \left.\frac{\mathsf{d}^k}{\mathsf{d}t^k} m_X(t)\right|_{t=0}.\]

  • \(\mathbb{E}[X^k]\) is called the \(k\)-th moment of \(X\).
  • The MGF is called the “moment generating function” because we can use it to compute the moments of \(X\).
  • Specifically, the first moment is \(\mathbb{E}[X] = m_X'(0)\), and the second moment is \(\mathbb{E}[X^2] = m_X''(0)\).

The variance of a Gamma using the MGF

Let \(X \sim {\mathrm{Gam}}(\alpha, \lambda)\) with MGF \(m_X(t) = \left(1 - \frac{t}{\lambda}\right)^{-\alpha}\).

\[\begin{aligned} \mathbb{E}[X] &= m_X'(0) = \frac{\alpha}{\lambda}\left(1-\frac{t}{\lambda}\right)^{-\alpha-1}\bigg|_{t=0} = \frac{\alpha}{\lambda}.\\ \mathbb{E}[X^2] &= m_X''(0) = \frac{\alpha(\alpha+1)}{\lambda^2}\left(1-\frac{t}{\lambda}\right)^{-\alpha-2}\bigg|_{t=0} = \frac{\alpha(\alpha+1)}{\lambda^2}.\\ \Longrightarrow \operatorname{Var}(X) &= \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \frac{\alpha(\alpha+1)}{\lambda^2} - \left(\frac{\alpha}{\lambda}\right)^2 = \frac{\alpha}{\lambda^2}. \end{aligned}\]

Mean and variance of a Normal

Exercise 1
Let \(X \sim \mathcal{N}(\mu, \sigma^2)\). Then,

\[m_X(t) = \mathbb{E}[e^{tX}] = \exp\left(\mu t + \frac{1}{2} \sigma^2 t^2\right).\]

Find \(\mathbb{E}[X]\) and \(\operatorname{Var}(X)\) using the MGF of \(X\).

Sums of independent random variables

Theorem
Let \(X\) and \(Y\) be independent random variables with MGFs \(m_X(t)\) and \(m_Y(t)\), respectively.

Then the MGF of \(X + Y\) is given by \[\begin{aligned} m_{X+Y}(t) &= \mathbb{E}[e^{t(X + Y)}] = \mathbb{E}[e^{tX} e^{tY}] = \mathbb{E}[e^{tX}] \mathbb{E}[e^{tY}] && \text{$X$ and $Y$ are independent} \\ &= m_X(t) m_Y(t). \end{aligned}\]

  • We saw before how to find the PMF/PDF of \(X + Y\) using convolution.
  • The MGF gives us an alternative way to find the distribution of \(X + Y\).

Properties of the MGF

  • \(m_X(0) = \mathbb{E}[e^{0 \, X}] = \mathbb{E}[1] = 1\) for any random variable \(X\).
  • If \(X_1, \ldots, X_n\) are independent random variables with common MGF \(m_{X}(t)\), then the MGF of \(S_n = \sum_{i=1}^n X_i\) is given by \[m_{S_n}(t) = (m_X(t))^n.\]
  • If \(X\) has MGF \(m_X(t)\), then letting \(Y=aX + b\) for any \(a, b \in {\mathbb{R}}\), has MGF \[m_Y(t) = e^{bt} m_X(at).\]

Theorem
If \(X\) and \(Y\) are random variables with MGFs \(m_X(t)\) and \(m_Y(t)\), respectively, and there exists \(s > 0\) such that for all \(t \in (-s, s)\), \(m_X(t) = m_Y(t) < \infty\), then \(X\) and \(Y\) have the same distribution.

This is a very important result, as it allows us to identify the distribution of a random variable by finding its MGF.

Using these theorems together

Let \(X_1, \dots, X_n\) be independent identically distributed (i.i.d.) random variables with \(X_i \sim \mathcal{N}(\mu, \sigma^2)\) for all \(i\).

Note that \(m_{X_i}(t) = \exp\left(\mu t + \frac{1}{2} \sigma^2 t^2\right)\) for \(i = 1, \ldots, n\).

Exercise 2
Find the distribution of \(\overline{X} = \frac{1}{n}\sum_{i=1}^n X_i\) using MGFs.

2 Conditional expectation

Conditional expectation and variance

Definition
If \(X\) and \(Y\) are two random variables, then the conditional expectation of \(X\) given \(Y = y\) is \[\begin{aligned} \mathbb{E}\left[ X | Y= y \right] &= \int_{-\infty }^{\infty } x f_{X|Y}\left( x | y \right) \mathsf{d}x, & \mathbb{E}\left[ X | Y= y \right] &= \sum_{x} x p_{X|Y}(x|y). \end{aligned}\]

Definition
If \(X\) and \(Y\) are two random variables, then the conditional variance of \(X\) given \(Y = y\) is \[\begin{aligned} \operatorname{Var}(X | Y = y ) &= \int_{-\infty }^{\infty } ( x - \mathbb{E}[X|Y=y] ) ^{2}f_{X|Y}\left( x|y \right) \mathsf{d}x,\\ \operatorname{Var}(X | Y = y ) &= \sum_{x} ( x - \mathbb{E}[X|Y=y] ) ^{2}p_{X|Y}(x|y). \end{aligned}\]

Conditionally Binomial

  • Let \(\Theta \sim {\mathrm{Unif}}(0, 1)\)
  • Let \(Y | \Theta = \theta \sim {\mathrm{Binom}}(n, \theta)\)

Claim:

  • \(\mathbb{E}[Y|\Theta=\theta] = n\theta\) and
  • \(\operatorname{Var}(Y|\Theta=\theta) = n\theta(1-\theta)\).

We already knew that \(\mathbb{E}[Y|\Theta=\theta] = n\theta\) and \(\operatorname{Var}(Y|\Theta=\theta) = n\theta(1-\theta)\) because \(Y|\Theta=\theta\) is a Binomial random variable with parameters \(n\) and \(\theta\).

  • We are just using the definitions of expectation and variance, but instead of using the PMF/PDF of \(Y\), we are using the PMF/PDF of \(Y|\Theta=\theta\).

Understanding conditional expectation and variance

  • The properties of expectation and variance that we have seen before also hold for conditional expectation and variance.

But there are some additional properties as well.

Important

The key is that \(\mathbb{E}[X|\Theta]\) and \(\operatorname{Var}[X|\Theta]\) are themselves random variables, because they depend on \(\Theta\).

\(\mathbb{E}[X|\Theta]\) and \(\operatorname{Var}[X|\Theta]\) have their own distributions.

  • \(W = \mathbb{E}[Y|\Theta] = n\Theta\) is a random variable that depends on \(\Theta\).
  • But \(\Theta \sim {\mathrm{Unif}}(0, 1)\), so \(W \sim {\mathrm{Unif}}(0, n)\)!
  • Using the Jacobian method, we can show that the PDF of \(V = \operatorname{Var}(Y|\Theta) = n\Theta(1-\Theta)\) is given by \[f_V(v) = \frac{2}{n\sqrt{1 - 4v/n}}, \quad 0 < v < n/4.\]

Hierarchical models

We refer to this general setup as a hierarchical model.

  1. We first draw \(\Theta\) from some distribution.
  2. Then we draw \(Y\) from a distribution that depends on \(\Theta\).
  3. We can find the distribution of \(Y | \Theta\) as well as those of its expectation.
Exercise 3
  • Let \(\Lambda \sim {\mathrm{Gam}}(1, 2)\).
  • Let \(X | \Lambda \sim {\mathrm{Exp}}(1/\Lambda)\).

Find the distribution of \(W = \mathbb{E}[X|\Lambda]\) and \(\mathbb{E}[W]\).

Law of total expectation

Using the definition of the joint distribution of \(X\) and \(\Lambda\), we can show that \[\begin{aligned} f_{X,\Lambda}(x, \lambda) &= f_{X|\Lambda}(x|\lambda) f_{\Lambda}(\lambda) \\ &= \frac{1}{\lambda} e^{-x/\lambda} \cdot \frac{1}{\Gamma(1)} \lambda e^{-\lambda} I_{[0,\infty)}(x)I_{[0,\infty)}(\lambda) \\ &= e^{-x/\lambda}e^{-\lambda} I_{[0,\infty)}(x)I_{[0,\infty)}(\lambda). \end{aligned}\]

Using our definition of Expectation, we can find \(\mathbb{E}[X]\): \[\begin{aligned} \mathbb{E}[X] &= \int_0^\infty \int_0^\infty x e^{-x/\lambda}e^{-\lambda} I_{[0,\infty)}(x)I_{[0,\infty)}(\lambda) \mathsf{d}\lambda \mathsf{d}x = \cdots. \end{aligned}\]

That is: \[\mathbb{E}[X] = \mathbb{E}[W] = \mathbb{E}[\mathbb{E}[X|\Lambda]].\]

Law of total expectation and variance (tower property)

Theorem
Let \(X\) and \(Y\) be two random variables. Then, \[\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X|Y]]\] and \[\operatorname{Var}(X) = \mathbb{E}[\operatorname{Var}(X|Y)] + \operatorname{Var}(\mathbb{E}[X|Y]).\]

  • The first equation shows what we just saw, but it is general and holds for any \(X\) and \(Y\).
  • The second equation is a bit more complicated, but it is also very useful.
  • It shows that the variance of \(X\) can be decomposed into two parts: the expected value of the conditional variance of \(X\) given \(Y\), and the variance of the conditional expectation of \(X\) given \(Y\).