Lecture 1: Probability Review

Author

Geoff Pleiss

Published

December 2, 2025

Learning Objectives

By the end of this lecture, you will be able to:

Define a random variable and understand when to use them to model quantities
Derive probability rules through the product and sum rules
Apply linearity of expectation and the law of total expectation to simplify calculations

Random Variables

Motivation

Example: Let’s say you want to grab a coffee at Loafe and you want to know how long you’ll have to wait in line.
Denote this time by the variable $A$
$A$ depends on a multitude of factors:
- How hot it is outside
- What day of the week it is
- How late Josh and his friends were up playing video games, thus leading them to take up spots in line
While we could try to model all of these factors, it would be infeasible to do so.
Instead, we can treat $A$ as a random variable: a variable whose value is randomly sampled from some distribution.

Notation

We (almost) always denote random variables with uppercase letters (e.g., $A$)
We (almost) always denote their realizations (i.e., specific values they can take) with lowercase letters (e.g., $a$).

Joint Random Variables

Throughout this class, most of the probability we will encounter will be concerned with relationships between two or more random variables.
Example: maybe we want to understand how the temperature outside, denoted by $B$, affects the Loafe line length.
Again, $B$ depends on many factors:
- What time of year it is
- Whether or not it’s sunny outside
- How many flights Sarah took last year, thus leading to an increase in greenhouse gases
We can treat $B$ as a random variable as well.
$A$ and $B$ are related to one another, potentially in a causal manner. If we treat them as joint random variables, we can derive many useful probabilistic representations about their relationship.

Distributions and The Two Rules of Probability

Given two random variables (A) and (B), we can describe this relationship through a joint probability distribution

\[ \begin{cases} P(A=a, B=b) & \text{for discrete random variables} \\ f_{A,B}(a, b) & \text{for continuous random variables} \end{cases} \]
Without loss of generality, we will use the discrete random variable notation throughout the rest of this lecture (and throughout most of the course).
We can also describe (A) and (B) through:
- Conditional distributions, i.e., (P(A=a | B=b)) or (P(B=b | A=a))
- Marginal distributions, i.e., (P(A=a)) or (P(B=b))
While there are many fundamental rules of probability to manipulate these distributions, most of them can be derived from two basic rules: the product rule and the sum rule.

The Product Rule

The product rule allows us to decompose a joint distribution into the product of a conditional and marginal probability:

\[\begin{align*} P(A=a, B=b) &= P(A=a|B=b)P(B=b) \\ &= P(B=b|A=a)P(A=a) \end{align*}\]

This rule can be applied recursively in the case of more than two random variables.
This rule gives rise to lots of useful facts from probability theory.

Independence

We say that $A$ and $B$ are independent if $P(A=a, B=b) = P(A=a) P(B=b)$; that is, their joint probability (density) is the product of their marginal probability (densities).
By the product rule, for independent random variables we have that

\[ P(A=a) P(B=b) = P(A=a, B=b) = P(A=a | B=b) P(B=b), \]

and, through some algebra, that $P(A=a) = P(A=a | B=b)$. (Similarly, $P(B=b) = P(B=b | A=a)$.)
In other words, when $A$ and $B$ are independent, the occurrence of $B$ does not affect the probability of $A$, and vice versa.

Bayes’ Rule

We can derive Bayes’ formula

\[ P(B=b | A=a) = \frac{P(A=a | B=b) P(B=b)}{P(A=a)} \]

using the product rule by starting from the identity $P(B=b, A=a) = P(A=a, B=b)$ and simplifying.

The Sum Rule

The sum rule allows us to obtain a marginal probability for $A$ (or $B$) from a joint probability over $A$ and $B$:

\[ P(A=a) = \int_{b} P(A=a, B=b) \: \mathrm{d}b. \]

Here, we are again assuming $A$ and $B$ are continuous and $P$ represents a density. The integral becomes a summation in the case of discrete random variables.
Again, this rule can be extended to three or more variables recursively.
This rule is instrumental in establishing properties about expectations:

Linearity of Expectation

We define the expected value of $A$ as:

\[ \mathbb{E}[A] := \int_{a} a \: P(A=a) \: \mathrm{d}a. \]
Similarly, the expected value of some function of $A$ and $B$ is defined as:

\[ \mathbb{E}[g(A, B)] := \int_a \int_b g(a, b) P(A=a, B=b) \: \mathrm{d}b \: \mathrm{d}a. \]
We can use the sum rule in conjunction with Fubini’s theorem to derive one of the most important formulas in all of probability:

\[ \mathbb E[A + B] = \mathbb E[A] + \mathbb E[B] \]

Derivation

\[\begin{align*} \mathbb{E}[A + B] &= \int_a \int_b (a + b) P(A=a, B=b) \: \mathrm{d}b \: \mathrm{d}a \\ &= \int_a \int_b a P(A=a, B=b) \: \mathrm{d}b \: \mathrm{d}a \\ &\quad + \int_a \int_b b P(A=a, B=b) \: \mathrm{d}b \: \mathrm{d}a \\ &= \int_a a \int_b P(A=a, B=b) \: \mathrm{d}b \: \mathrm{d}a \\ &\quad + \int_b b \int_a P(A=a, B=b) \: \mathrm{d}a \: \mathrm{d}b \\ &= \int_a a P(A=a) \: \mathrm{d}a + \int_b b P(B=b) \: \mathrm{d}b \\ &= \mathbb{E}[A] + \mathbb{E}[B] \end{align*}\]

This formula, known as linearity of expectation, holds even when $A$ and $B$ are not independent! We will use this fact constantly throughout the course.
As a fun challenge problem, try using this formula to derive the famous inclusion-exclusion principle:

\[ P(A \cup B) = P(A) + P(B) - P(A \cap B). \]

Hint: note that $P(A) = \mathbb{E}[\mathbf{1}_A]$, where $\mathbf{1}_A$ is the indicator random variable for event $A$. Similarly, $P(A \cap B) = \mathbb{E}[\mathbf{1}_A \mathbf{1}_B]$ and $P(A \cup B) = 1 - P(\overline{A} \cap \overline{B})$.

Conditional Expectations and The Tower Rule

The conditional expectation of $A$ given $B=b$ is defined as:

\[ \mathbb{E}[A \mid B = b] := \int_{a} a \: P(A=a \mid B=b) \: \mathrm{d}a. \]

It is the average value that $A$ takes when we have the additional information that the random variable $B$ takes on the value $b$.
Note that this conditional expectation is a function of $b$; i.e. we can write

\[ \mathbb{E}[A \mid B = b] = g(b) \qquad \text{for some function} g.\]
The expression $\mathbb{E} [A \mid B]$ should then be read as:

\[ \mathbb{E}[A \mid B] = g(B). \]

I.e. that we are applying the conditional expectation function to the random variable $B$ rather than to a specific realization $b$.
To relate $\mathbb E[A \mid B]$ to $\mathbb E[A]$, we can use the sum and product rules to get:

\[ \mathbb E[A] = \mathbb E[ \mathbb E[ A \mid B]] \]

Derivation

\[\begin{align*} \mathbb{E}[A] &= \int_a a \int_b P(A=a, B=b) \: \mathrm{d}b \: \mathrm{d}a \\ &= \int_a a \int_b P(A=a \mid B=b) P(B=b) \: \mathrm{d}b \: \mathrm{d}a \\ &= \int_b \left( \int_a a P(A=a \mid B=b) \: \mathrm{d}a \right) P(B=b) \: \mathrm{d}b \\ &= \int_b \mathbb{E}[A \mid B=b] P(B=b) \: \mathrm{d}b \\ &= \mathbb{E} \left[ \mathbb{E}[ A \mid B ] \right] \end{align*}\]

This rule is known as the Tower Rule. It allows us to express marginal expectations as recursive applications of conditional expectations.
This notation is often confusing and scary the first few times you encounter it. Try translating it back into probabilities via the sum and product rules, and you’ll fluently understand it in no time!

Important: What is Random?

A (standard) expectation $\mathbb E[A]$ is not a random variable (despite the fact that there’s a random variable inside the expectation).

Why?

If we go back to the definition: $ E[A] = a P(A=a) da $, note that the two terms in the integral $a$, $P(A=a)$ are functions of $a$, which is a realized quantity and therefore not random. So we’re integrating non-random quantities together, giving us a non-random output.
The conditional expectation $\mathbb E[A \mid B=b]$ is also not a random variable.

Why?

Again going back to the definition: $ E[A B=b] = a P(A=a B=b) da$, the terms inside the integral are functions of $a$ and $b$ (realized quantities, not random variables).
However, the conditional expectation $\mathbb E[A \mid B]$ is a random variable.

Why?

Recall that we can view $\mathbb E[A \mid B=b]$ as some function of $b$; i.e. $\mathbb E[A \mid B=b] = g(b)$. When we plug a realization (i.e. not a random variable) into $g$, the output is a fixed (not random) quantity.

However, when we plug in a random variable into $g$, the output is a random quantity! So $\mathbb E[A \mid B] = g(B)$ is a random variable due to the randomness in $B$.

Conclusion

We will use (jointly-distributed) random variables to model data, models that depend on data, and predictions that depend on models that depend on data.
You will need to manipulate marginal, joint, and conditional probabilities and expectations of these random variables throughout the course.
This review has covered most of the probability rules that we’ll use, but just remember that you can always derive any of them through the product and sum rules!