Lecture 1: Probability Review

Author

Geoff Pleiss

Published

September 2, 2025

Learning Objectives

By the end of this lecture, you will be able to:

  1. Define a random variable and understand when to use them to model quantities
  2. Derive probability rules through the product and sum rules
  3. Apply linearity of expectation and the law of total expectation to simplify calculations

Random Variables

Motivation

  • Example: Let’s say you want to grab a coffee at Loafe and you want to know how long you’ll have to wait in line.
  • Denote this time by the variable \(A\)
  • \(Y\) depends on a multitude of factors:
    • How hot it is outside
    • What day of the week it is
    • How late Josh and his friends were up playing video games, thus leading them to take up spots in line
  • While we could try to model all of these factors, it would be infeasible to do so.
  • Instead, we can treat \(Y\) as a random variable: a variable whose value is randomly sampled from some distribution.

Notation

  • We (almost) always denote random variables with uppercase letters (e.g., \(A\))
  • We (almost) always denote their realizations (i.e., specific values they can take) with lowercase letters (e.g., \(a\)).

Joint Random Variables

  • Throughout this class, most of the probability we will encounter will be concerned with relationships between two or more random variables.
  • Example: maybe we want to understand how the temperature outside, denoted by \(B\), affects the Loafe line length.
  • Again, \(B\) depends on many factors:
    • What time of year it is
    • Whether or not it’s sunny outside
    • How many flights Sarah took last year, thus leading to an increase in greenhouse gases
  • We can treat \(B\) as a random variable as well.
  • \(A\) and \(B\) are related to one another, potentially in a causal manner. If we treat them as joint random variables, we can derive many useful probabilistic representations about their relationship.

Distributions and The Two Rules of Probability

  • Given two random variables (A) and (B), we can describe this relationship through a joint probability distribution

    \[ \begin{cases} P(A=a, B=b) & \text{for discrete random variables} \\ f_{A,B}(a, b) & \text{for continuous random variables} \end{cases} \]

  • Without loss of generality, we will use the discrete random variable notation throughout the rest of this lecture (and throughout most of the course).

  • We can also describe (A) and (B) through:

    • Conditional distributions, i.e., (P(A=a | B=b)) or (P(B=b | A=a))
    • Marginal distributions, i.e., (P(A=a)) or (P(B=b))
  • While there are many fundamental rules of probability to manipulate these distributions, most of them can be derived from two basic rules: the product rule and the sum rule.

The Product Rule

The product rule allows us to decompose a joint distribution into the product of a conditional and marginal probability:

\[\begin{align*} P(A=a, B=b) &= P(A=a|B=b)P(B=b) \\ &= P(B=b|A=a)P(A=a) \end{align*}\]

  • This rule can be applied recursively in the case of more than two random variables.
  • This rule gives rise to lots of useful facts from probability theory.

Independence

  • We say that \(A\) and \(B\) are independent if \(P(A=a, B=b) = P(A=a) P(B=b)\); that is, their joint probability (density) is the product of their marginal probability (densities).

  • By the product rule, for independent random variables we have that

    \[ P(A=a) P(B=b) = P(A=a, B=b) = P(A=a | B=b) P(B=b), \]

    and, through some algebra, that \(P(A=a) = P(A=a | B=b)\). (Similarly, \(P(B=b) = P(B=b | A=a)\).)

  • In other words, when \(A\) and \(B\) are independent, the occurrence of \(B\) does not affect the probability of \(A\), and vice versa.

Bayes’ Rule

We can derive Bayes’ formula

\[ P(B=b | A=a) = \frac{P(A=a | B=b) P(B=b)}{P(A=a)} \]

using the product rule by starting from the identity \(P(B=b, A=a) = P(A=a, B=b)\) and simplifying.

The Sum Rule

The sum rule allows us to obtain a marginal probability for \(A\) (or \(B\)) from a joint probability over \(A\) and \(B\):

\[ P(A=a) = \int_{b} P(A=a, B=b) \: \mathrm{d}b. \]

  • Here, we are again assuming \(A\) and \(B\) are continuous and \(P\) represents a density. The integral becomes a summation in the case of discrete random variables.
  • Again, this rule can be extended to three or more variables recursively.
  • This rule is instrumental in establishing properties about expectations:

Linearity of Expectation

  • We define the expected value of \(A\) as:

    \[ \mathbb{E}[A] := \int_{a} a \: P(A=a) \: \mathrm{d}a. \]

  • Similarly, the expected value of some function of \(A\) and \(B\) is defined as:

    \[ \mathbb{E}[f(A, B)] := \int_a \int_b f(a, b) P(A=a, B=b) \: \mathrm{d}b \: \mathrm{d}a. \]

  • We can use the sum rule in conjunction with Fubini’s theorem to derive one of the most important formulas in all of probability:

    \[\begin{align*} \mathbb{E}[A + B] &= \int_a \int_b (a + b) P(A=a, B=b) \: \mathrm{d}b \: \mathrm{d}a \\ &= \int_a \int_b a P(A=a, B=b) \: \mathrm{d}b \: \mathrm{d}a \\ &\quad + \int_a \int_b b P(A=a, B=b) \: \mathrm{d}b \: \mathrm{d}a \\ &= \int_a a \int_b P(A=a, B=b) \: \mathrm{d}b \: \mathrm{d}a \\ &\quad + \int_b b \int_a P(A=a, B=b) \: \mathrm{d}a \: \mathrm{d}b \\ &= \int_a a P(A=a) \: \mathrm{d}a + \int_b b P(B=b) \: \mathrm{d}b \\ &= \mathbb{E}[A] + \mathbb{E}[B] \end{align*}\]

  • This formula, known as linearity of expectation, holds even when \(A\) and \(B\) are not independent! We will use this fact constantly throughout the course.

  • As a fun challenge problem, try using this formula to derive the famous inclusion-exclusion principle:

    \[ P(A \cup B) = P(A) + P(B) - P(A \cap B). \]

    Hint: note that \(P(A) = \mathbb{E}[\mathbf{1}_A]\), where \(\mathbf{1}_A\) is the indicator random variable for event \(A\). Similarly, \(P(A \cap B) = \mathbb{E}[\mathbf{1}_A \mathbf{1}_B]\) and \(P(A \cup B) = 1 - P(\overline{A} \cap \overline{B})\).

The Tower Rule

  • We define the expected value of \(A\) as:

    \[ \mathbb{E}[A] := \int_{a} a \: P(A=a) \: \mathrm{d}a. \]

  • Using the sum rule and product rule gives us:

    \[\begin{align*} \mathbb{E}[A] &= \int_a a \int_b P(A=a, B=b) \: \mathrm{d}b \: \mathrm{d}a \\ &= \int_a a \int_b P(A=a \mid B=b) P(B=b) \: \mathrm{d}b \: \mathrm{d}a \\ &= \int_b \left( \int_a a P(A=a \mid B=b) \: \mathrm{d}a \right) P(B=b) \: \mathrm{d}b \\ &= \int_b \mathbb{E}[A \mid B=b] P(B=b) \: \mathrm{d}b \\ &= \mathbb{E} \left[ \mathbb{E}[ A \mid B ] \right] \end{align*}\]

  • This rule is known as the Tower Rule. It allows us to express marginal expectations as recursive applications of conditional expectations.

  • This notation is often confusing and scary the first few times you encounter it. Try translating it back into probabilities via the sum and product rules, and you’ll fluently understand it in no time!

Conclusion

  • We will use (jointly-distributed) random variables to model data, models that depend on data, and predictions that depend on models that depend on data.
  • You will need to manipulate marginal, joint, and conditional probabilities and expectations of these random variables throughout the course.
  • This review has covered most of the probability rules that we’ll use, but just remember that you can always derive any of them through the product and sum rules!