Module 9

Inequalities and asymptotics

Matias Salibian Barrera

Last modified — 06 Dec 2025

Inequalities

How far can $X$ be from its expected value $\mathbb{E}[ X ]$?
Better: how likely is it that $X$ is $\varepsilon$ far from $\mathbb{E}[ X ]$?

\[ \mathbb{P}\Bigl( \left| X - \mathbb{E}[X] \Bigr| > \varepsilon \right) \quad ? \]

Intuitively, this should depend on the size of $\Bigl| X - \mathbb{E}[X] \Bigr|$, or $\Bigl( X - \mathbb{E}[X] \Bigr)^2$ which are random, so perhaps, it will depend on

\[ \mathbb{E}\Bigl( X - \mathbb{E}[X] \Bigr)^2 \ = \ V(X) \]

Auxiliary results

Let $X$ be a continuous or discrete random variable with pdf/pmf $f_X$.
If there is a set $A$ such that $\mathbb{P}\left( X \in A \right) = 1$, and if $h$ is a function such that $\mathbb{E}[ h(X) ]$ exists, then

\[ \int_{A^c} h(t) \, f_X(t) \, dt = 0 \] or

\[ \sum_{k \in A^c} h(k) \, f_X(k) = 0 \]

Auxiliary results

Under the same conditions as before, we have

\[ \mathbb{E}[ h(X) ] \, = \, \int_{A} h(t) \, f_X(t) \, dt \]

\[ \mathbb{E}[ h(X) ] = \sum_{k \in A} h(k) \, f_X(k) \] Proof: \[ \mathbb{E}[ h(X) ] = \int h(t) \, f_X(t) \, dt = \int_{A} h(t) \, f_X(t) \, dt + \int_{A^c} h(t) \, f_X(t) \, dt = \int_{A} h(t) \, f_X(t) \, dt \]

Auxiliary results

If $\mathbb{E}[ g(X) ]$ exists and \[ \mathbb{P}\Bigl( g(X) \ge 0 \Bigr) = 1 \] then \[ \mathbb{E}[ g(X) ] = \int g(t) \, f_X(t) \, dt \ \ge \ 0 \]
This implies that, if $\mathbb{E}[ h(X) ]$ exists then $\int_{h(t) \ge 0} h(t) \, f_X(t) \, dt \, \ge \, 0$

Proof: \[ \int_{h(t) \ge 0} h(t) \, f_X(t) \, dt = \int \left[ I_{[0, +\infty)}(h(t)) \, h(t) \right] \, f_X(t) \, dt \ge 0 \] because $g(t) = \left[ I_{[0, +\infty)}(h(t)) \, h(t) \right] \ge 0$ for all $t \in \mathbb{R}$

Auxiliary results

Suppose that $\mathbb{E}[g(X)]$ and $\mathbb{E}[h(X)]$ exist and \[ \mathbb{P}\Bigl( g(X) \ge h(X) \Bigr) \, = \, 1 \] then \[ \mathbb{E}[ g(X) ] \, \ge \, \mathbb{E}[ h(X) ] \]

Proof: Let $b(t) = g(t) - h(t)$ and \[ A = \bigl\{ \, t \in \mathbb{R} : b(t) \ge 0 \, \bigr\} \] then $\mathbb{P}( X \in A ) = 1$ and \[ \mathbb{E}[ g(X) ] - \mathbb{E}[ h(X)] = \mathbb{E}[ b(X) ] = \int_{A} b(t) \, f_X(t) \, dt \ge 0 \]

Markov’s inequality

Let $X$ be a non-negative random variable:

\[ \mathbb{P}\Bigl( \, X \ge 0 \, \Bigr) = 1 \]

Then, for any $a > 0$ we have

\[ \mathbb{P}\Bigl( \, X \ge a \, \Bigr) \le \frac{ \mathbb{E}[ X ] }{a} \]

Markov’s inequality

Proof:

\[ \begin{aligned} \mathbb{E}[ X ] = \int t \, f_X(t) \, dt &= \int_{0 \le t} t \, f_X(t) \, dt \\ & \\ & = \int_{0 \le t < a} t \, f_X(t) \, dt \ + \ \int_{a \le t} t \, f_X(t) \, dt \\ & \\ & \ge \int_{a \le t} t \, f_X(t) \, dt \ge a \, \int_{a \le t} f_X(t) \, dt = a \, \mathbb{P}( X \ge a) \end{aligned} \]

Chebyshev’s Inequality

Let $X$ be a random variable with $\mathbb{E}[X] = \mu$ and $V(X) = \sigma^2$, then, for any $k > 0$ we have

\[ \mathbb{P}\Bigl( \, \left| X - \mu \right| \ge k \, \Bigr) \ \le \ \frac{\sigma^2}{k^2} \] Proof:

Note that $\mathbb{P}\left( (X - \mu)^2 \ge 0 \right) = 1$, hence, by Markov’s inequality:

\[ \mathbb{P}\Bigl( \, \left| X - \mu \right| \ge k \, \Bigr) = \mathbb{P}\Bigl( \, \left( X - \mu \right)^2 \ge k^2 \, \Bigr) \le \frac{\mathbb{E}[ ( X - \mu)^2 ]}{k^2} = \frac{\sigma^2}{k^2} \]

Example 1

The number of customers coming to a service station each day is a random variable $X$ with $\mathbb{E}[X] = 60$ and $V(X) = 16$ (and $\mathbb{P}( X \ge 0) = 1$, of course).

What can you say about the probability that at least 70 customers come on any given day?
What can you say about the probability that the number of customers coming to the station in a given day is between 50 and 70?

Example 1

Using Markov’s inequality, since $\mathbb{P}( X \ge 0 ) = 1$: \[ \mathbb{P}\Bigl( X \ge 70 \Bigr) \le \frac{60}{70} = 0.86 \]
Using Chebyshev’s inequality \[ \begin{aligned} \mathbb{P}\Bigl( 50 < X < 70 \Bigr) &= \mathbb{P}\Bigl( | X - 60 | < 10 \Bigr) \\ & \\ &= 1 - \mathbb{P}\Bigl( | X - 60 | \ge 10 \Bigr) \\ & \\ & \ge 1 - \frac{16}{10^2} = 0.84 \end{aligned} \]

Example 2

A measure of the distance to a distant star is a random variable with mean $\mu$ (the unknown true distance) and variance $4$ light years.
An astronomer will perform several independent measurements, $X_{i}$, $i=1,2,...,n$ of the distance and use their average $\overline{X}_n$ as an estimate for the true distance.
How many measurements must she perform so that \[ \mathbb{P}\left( \bigl\vert \overline{X}_n-\mu \bigr\vert <0.5\text{ light years} \right) \, \ge \, 0.95 \ ? \]

Example 2

Recall that \[ \mathbb{E}\left[ \overline{X}_n \right] \, = \, \mu \] and \[ V \left[ \overline{X}_n \right] \, = \, \frac{4}{n} \]
Thus, using Chebyshev’s inequality applied to $\overline{X}_n$: \[ \mathbb{P}\biggl( \bigl| \overline{X}_n - \mu \bigr| < 1/2 \biggr) \, \ge \, 1 - \frac{4}{n \, (1/2)^2} = 1 - \frac{16}{n} \] Hence, to have $1 - 16/n \ge 0.95$ she needs $n \ge 16/0.05 = 320$ independent measurements.

Weak Law of Large Numbers

Let $X_{1},X_{2}, \ldots, X_{n} \, \ldots$ be independent and identically distributed (i.i.d) random variables with finite mean $\mu$. Then, for all $\epsilon >0$: \[ \lim_{n\rightarrow \infty } \mathbb{P}\Bigl( \bigl| \overline{X}_n-\mu \bigr| \ge \epsilon \Bigr) \, = \, 0 \] where \[ \overline{X}_n \, = \, \frac{1}{n} \, \sum_{i=1}^n X_i \]
Interpretation: The probability that the average of a large number of independent measurements of a random variable deviates (by any fixed amount) from their expected value
decreases as the averages involve more observations.
In Statistics this is recast as follows: the sample average is a consistent estimator for the population mean.

The Central Limit Theorem (CLT)

Let $X_{1},X_{2}, \ldots, X_{n} \, \ldots$ be independent and identically distributed (i.i.d) random variables with finite mean $\mu$ and variance $\sigma^2$.

Then, for any $z \in \mathbb{R}$:

\[ \lim_{n\rightarrow \infty } \mathbb{P}\left( \frac{\overline{X}_n-\mu }{\sigma /\sqrt{n}} \ \le \ z \, \right) = \Phi \left( z\right) \] where $\Phi( z )$ is the CDF of a $\mathcal{N}(0,1)$ distribution

Approximate Distribution of $\overline{X}_n$

The CLT states that the distribution of the standardized sample average (which is a random variable) \[ \frac{\overline{X}_n - \mu }{\sigma /\sqrt{n}} \] is approximately that of a $\mathcal{N}(0, 1)$ r.v.
As a rule of thumb, the approximation is good for $n \ge 30$ (see the Berry-Esseen Theorem)
In other words: \[ \overline{X}_n \text{ \ is \ approximately \ } \mathcal{N}\left( \mu ,\frac{\sigma ^{2}}{n} \right) \text{ \ for large }n \]

Example 2 - Reloaded

A measure of the distance to a distant star is a random variable with mean $$ (the true distance) and variance $4$ light years.
An astronomer will perform several independent measurements, $X_{i},$ $% i=1,2,...,n$ of the distance and use their average $\overline{X}_n$ as an estimate for the true distance.
How many measurements must she perform so that \[ \mathbb{P}\left( \bigl\vert \overline{X}_n - \mu \bigr\vert <0.5 \text{ light years} \right) \, \ge \, 0.95 \ ? \]

Example 2 - Reloaded

\[ \begin{aligned} \mathbb{P}\left( \bigl\vert \overline{X}_n - \mu \bigr\vert <0.5\right) &= \mathbb{P}\left( -0.5 < \overline{X}_n - \mu <0.5\right) \\ & \\ &= \mathbb{P}\left( \frac{-0.5}{2/\sqrt{n}}<\frac{\overline{X}_n - \mu }{2/\sqrt{n}} < \frac{0.5}{2/\sqrt{n}}\right) \\ & \\ & \thickapprox \Phi \left( \frac{0.5}{2/\sqrt{n}}\right) -\Phi \left( - \frac{0.5}{2/\sqrt{n}}\right) \\ & \\ &= 2\Phi \left( \frac{0.5}{2/\sqrt{n}}\right) -1 \end{aligned} \]

Example 2 - Reloaded

\[ \begin{aligned} 2\Phi \left( \frac{0.5}{2/\sqrt{n}}\right) -1 & \ge 0.95 \\ & \\ \Phi \left( \frac{0.5}{2/\sqrt{n}}\right) & \ge 0.975 \\ & \\ \sqrt{n}\frac{0.5}{2} & \ge \Phi ^{-1}\left( 0.975\right) =1.96 \\ & \\ n & \ge \left( \frac{1.96\times 2}{0.5}\right)^{2}=61.47 \end{aligned} \]

Hence she needs $n \ge 62$ measurements.

Example 2 - Reloaded - Discussion

Using Chebyshev’s inequality we conclude we need \[ n \ge 320 \] independent observations.
Using the CLT we conclude we need \[ n \ge 62 \] independent observations
Which one is right? Is either wrong? Why the difference?

Chernoff bounds

Let $X$ be a r.v. for which its MGF exists for $t \in (-\epsilon, \epsilon)$ for some $\epsilon > 0$.
Then, for any $a \in \mathbb{R}$ and $t \in (0, \epsilon)$ \[ \begin{aligned} \mathbb{P}\Bigl( X \ge a \Bigr) & = \mathbb{P}\Bigl( e^{t \, X} \ge e^{t \, a} \Bigr) \\ & \\ & \le \frac{ \mathbb{E}\left[ e^{t \, X} \right] }{ e^{t \, a} } \\ & \\ & = M_X(t) \, e^{-t \, a} \end{aligned} \] (using Markov’s inequality)

Chernoff bounds

Similarly, for any $t \in (-\epsilon, 0)$ \[ \begin{aligned} \mathbb{P}\Bigl( X \le a \Bigr) & = \mathbb{P}\Bigl( e^{t \, X} \ge e^{t \, a} \Bigr) \\ & \\ & \le \frac{ \mathbb{E}\left[ e^{t \, X} \right] }{ e^{t \, a} } \\ & \\ & = M_X(t) \, e^{-t \, a} \end{aligned} \]
Find the sharpest bound using $t$ that minimizes \[ H(t) = M_X(t) \, e^{-t \, a} \] over $t \in (0, \epsilon)$ or $(-\epsilon, 0)$

Chernoff bounds - Example: Poisson distributions

Let $X \thicksim \mathcal{P}(\lambda)$, then \[ M_X(t) = e^{\lambda \, \left( e^t - 1 \right) } \]

thus, for any $a > 0$ and $t > 0$, \[ \begin{aligned} \mathbb{P}\Bigl( X \ge a \Bigr) \, & \le \, e^{\lambda \, \left( e^t - 1 \right) } \, e^{-a \, t} \\ & = e^{\lambda \, \left( e^t - 1 \right) - a \, t } \end{aligned} \]

Now find the minimum of the right hand side \[ \min_{t > 0} \, e^{\lambda \, \left( e^t - 1 \right) - a \, t } \, = \, \min_{t > 0} \ \lambda \left( e^t - 1 \right) - a \, t \]

Chernoff bounds - Example: Poisson distributions

The minimum occurs when \[ t \, = \, \log \left( a / \lambda \right) \] (check that this is indeed a minimum)
We also need $t > 0$, so the minimum occurs on $\mathbb{R}_+$ if $a > \lambda$.
In that case ($a > \lambda$): \[ \begin{aligned} \mathbb{P}\Bigl( X \ge a \Bigr) \, & \le e^{\lambda \, \left( a / \lambda - 1 \right) - a \, \log( a / \lambda) } \\ & = e^{ a \, - \lambda} \, \left( \lambda / a \right)^a \\ & = e^{ - \lambda} \, \left( e \, \lambda / a \right)^a \end{aligned} \]

Chernoff bounds - Example: Poisson distributions

Compare bound with exact calculation
Let $X \thicksim \mathcal{P}(10)$. Chernoff’s bound gives \[ \mathbb{P}\Bigl( X \ge 20 \Bigr) \le 0.021 \]

and the exact calculation is: \[ \mathbb{P}\Bigl( X \ge 20 \Bigr) = 1 - \mathbb{P}\Bigl( X \le 19 \Bigr) = 0.00345 \]

Discuss

Jensen’s inequality - Convexity

A function $h : \mathcal{I} \to \mathbb{R}$ is called convex if for any $x_1, x_2 \in \mathcal{I}$: \[ h \Bigl( a \, x_1 + (1 - a) \, x_2 \Bigr) \, \le \, a \, h( x_1) + (1-a) \, h(x_2) \quad \forall a \in [0, 1] \]
If $h$ is twice differentiable, then this is equivalent to $h''(x) \ge 0$ for $x \in \mathcal{I}$
Examples:
- $h(x) = x^2$, $h(x) = e^x$, $h(x) = e^{-x}$, $x \in \mathbb{R}$;
- $h(x) = - \log(x)$, $x > 0$;
- $h(x) = |x|$, $x \in \mathbb{R}$.

Jensen’s inequality

Let $X$ be a random variable with range $\mathcal{R}_X$
If $h$ is a convex function over $\mathcal{R}_X$, and $\mathbb{E}\left[ X \right]$ and $\mathbb{E}\left[ h(X) \right]$ exist and are finite, then \[ \mathbb{E}\Bigl[ h \left( X \right) \Bigr] \, \ge \, h \Bigl( \mathbb{E}\left[ X \right] \Bigr) \]
Examples:
- $\mathbb{E}\left[ X^2 \right] \ge \left( \mathbb{E}\left[ X \right] \right)^2$
- $\mathbb{E}\left[ \log( X ) \right] \le \log \left( \mathbb{E}\left[ X \right] \right)$
- $\Bigl| \mathbb{E}\left[ X \right] \Bigr| \le \mathbb{E}\Bigl[ \left| X \right| \Bigr]$ (H"older or Cauchy-Schwartz inequalities)

Kullback-Leibler divergence

Let $f$ and $g$ be two densities or pmf’s. The Kullback-Leibler divergence between them is a measure of how different one of them is with respect to the other. Formally: \[ \begin{aligned} D_{KL} \Bigl( f, g \Bigr) \, & = \, \int_{-\infty}^{+\infty} \, f(t) \, \log \left( \frac{f(t)}{g(t)} \right) \, dt \, \\ & \\ D_{KL} \Bigl( f, g \Bigr) \, & = \, \sum_{k \in \mathcal{R}} \, f(k) \, \log \left( \frac{f(k)}{g(k)} \right) \end{aligned} \] (Assume that if $g(a) = 0 \Rightarrow f(a) = 0$)
Prove that $D_{KL} ( f, g ) \ge 0$

Kullback-Leibler divergence

Proof: Define $h:\mathbb{R} \to \mathbb{R}$ by \[ h(t) \, = \, \left( \frac{g(t)}{f(t)} \right) \] then \[ D_{KL} \Bigl( f, g \Bigr) \, = \, \int _{-\infty}^{+\infty} \, -\log(h(t)) \, f(t) \, dt \ = \ \mathbb{E}_f \left[ -\log(h(t)) \right] \]

Note that $-\log$ is convex, thus: \[ D_{KL} \Bigl( f, g \Bigr) = \, \mathbb{E}_f \left[ -\log(h(t)) \right] \ge -\log \left( \mathbb{E}_f[ h(t) ] \right) \ge \, -\log \left( \int _{-\infty}^{+\infty} \, g(t) \, dt \right) = 0 \]

Module 9

Matias Salibian Barrera

Inequalities

Auxiliary results

Auxiliary results

Auxiliary results

Auxiliary results

Markov’s inequality

Markov’s inequality

Chebyshev’s Inequality

Example 1

Example 1

Example 2

Example 2

Weak Law of Large Numbers

The Central Limit Theorem (CLT)

Approximate Distribution of \(\overline{X}_n\)

Example 2 - Reloaded

Example 2 - Reloaded

Example 2 - Reloaded

Example 2 - Reloaded - Discussion

Chernoff bounds

Chernoff bounds

Chernoff bounds - Example: Poisson distributions

Chernoff bounds - Example: Poisson distributions

Chernoff bounds - Example: Poisson distributions

Jensen’s inequality - Convexity

Jensen’s inequality

Kullback-Leibler divergence

Kullback-Leibler divergence