Module 9

Inequalities and asymptotics


Matias Salibian Barrera

Last modified — 26 Nov 2025

Inequalities

  • How far can \(X\) be from its expected value \(\mathbb{E}[ X ]\)?

  • Better: how likely is it that \(X\) is \(\varepsilon\) far from \(\mathbb{E}[ X ]\)?

\[ \mathbb{P}\Bigl( \left| X - \mathbb{E}[X] \Bigr| > \varepsilon \right) \quad ? \]

  • Intuitively, this should depend on the size of \(\Bigl| X - \mathbb{E}[X] \Bigr|\), or \(\Bigl( X - \mathbb{E}[X] \Bigr)^2\) which are random, so perhaps, it will depend on

\[ \mathbb{E}\Bigl( X - \mathbb{E}[X] \Bigr)^2 \ = \ V(X) \]

Auxiliary results

  • Let \(X\) be a continuous or discrete random variable with pdf/pmf \(f_X\).

  • If there is a set \(A\) such that \(\mathbb{P}\left( X \in A \right) = 1\), and if \(h\) is a function such that \(\mathbb{E}[ h(X) ]\) exists, then

\[ \int_{A^c} h(t) \, f_X(t) \, dt = 0 \] or

\[ \sum_{k \in A^c} h(k) \, f_X(k) = 0 \]

Auxiliary results

  • Under the same conditions as before, we have

\[ \mathbb{E}[ h(X) ] \, = \, \int_{A} h(t) \, f_X(t) \, dt \]

or

\[ \mathbb{E}[ h(X) ] = \sum_{k \in A} h(k) \, f_X(k) \] Proof: \[ \mathbb{E}[ h(X) ] = \int h(t) \, f_X(t) \, dt = \int_{A} h(t) \, f_X(t) \, dt + \int_{A^c} h(t) \, f_X(t) \, dt = \int_{A} h(t) \, f_X(t) \, dt \]

Auxiliary results

  • If \(\mathbb{E}[ g(X) ]\) exists and \[ \mathbb{P}\Bigl( g(X) \ge 0 \Bigr) = 1 \] then \[ \mathbb{E}[ g(X) ] = \int g(t) \, f_X(t) \, dt \ \ge \ 0 \]

  • This implies that, if \(\mathbb{E}[ h(X) ]\) exists then \(\int_{h(t) \ge 0} h(t) \, f_X(t) \, dt \, \ge \, 0\)

Proof: \[ \int_{h(t) \ge 0} h(t) \, f_X(t) \, dt = \int \left[ I_{[0, +\infty)}(h(t)) \, h(t) \right] \, f_X(t) \, dt \ge 0 \] because \(g(t) = \left[ I_{[0, +\infty)}(h(t)) \, h(t) \right] \ge 0\) for all \(t \in \mathbb{R}\)

Auxiliary results

  • Suppose that \(\mathbb{E}[g(X)]\) and \(\mathbb{E}[h(X)]\) exist and \[ \mathbb{P}\Bigl( g(X) \ge h(X) \Bigr) \, = \, 1 \] then \[ \mathbb{E}[ g(X) ] \, \ge \, \mathbb{E}[ h(X) ] \]

Proof: Let \(b(t) = g(t) - h(t)\) and \[ A = \bigl\{ \, t \in \mathbb{R} : b(t) \ge 0 \, \bigr\} \] then \(\mathbb{P}( X \in A ) = 1\) and \[ \mathbb{E}[ g(X) ] - \mathbb{E}[ h(X)] = \mathbb{E}[ b(X) ] = \int_{A} b(t) \, f_X(t) \, dt \ge 0 \]

Markov’s inequality

  • Let \(X\) be a non-negative random variable:

\[ \mathbb{P}\Bigl( \, X \ge 0 \, \Bigr) = 1 \]

Then, for any \(a > 0\) we have

\[ \mathbb{P}\Bigl( \, X \ge a \, \Bigr) \le \frac{ \mathbb{E}[ X ] }{a} \]

Markov’s inequality

Proof:

\[ \begin{aligned} \mathbb{E}[ X ] = \int t \, f_X(t) \, dt &= \int_{0 \le t} t \, f_X(t) \, dt \\ & \\ & = \int_{0 \le t < a} t \, f_X(t) \, dt \ + \ \int_{a \le t} t \, f_X(t) \, dt \\ & \\ & \ge \int_{a \le t} t \, f_X(t) \, dt \ge a \, \int_{a \le t} f_X(t) \, dt = a \, \mathbb{P}( X \ge a) \end{aligned} \]

Chebyshev’s Inequality

Let \(X\) be a random variable with \(\mathbb{E}[X] = \mu\) and \(V(X) = \sigma^2\), then, for any \(k > 0\) we have

\[ \mathbb{P}\Bigl( \, \left| X - \mu \right| \ge k \, \Bigr) \ \le \ \frac{\sigma^2}{k^2} \] Proof:

Note that \(\mathbb{P}\left( (X - \mu)^2 \ge 0 \right) = 1\), hence, by Markov’s inequality:

\[ \mathbb{P}\Bigl( \, \left| X - \mu \right| \ge k \, \Bigr) = \mathbb{P}\Bigl( \, \left( X - \mu \right)^2 \ge k^2 \, \Bigr) \le \frac{\mathbb{E}[ ( X - \mu)^2 ]}{k^2} = \frac{\sigma^2}{k^2} \]

Example 1

  • The number of customers coming to a service station each day is a random variable \(X\) with \(\mathbb{E}[X] = 60\) and \(V(X) = 16\) (and \(\mathbb{P}( X \ge 0) = 1\), of course).
  1. What can you say about the probability that at least 70 customers come on any given day?

  2. What can you say about the probability that the number of customers coming to the station in a given day is between 50 and 70?

Example 1

  1. Using Markov’s inequality, since \(\mathbb{P}( X \ge 0 ) = 1\): \[ \mathbb{P}\Bigl( X \ge 70 \Bigr) \le \frac{60}{70} = 0.86 \]

  2. Using Chebyshev’s inequality \[ \begin{aligned} \mathbb{P}\Bigl( 50 < X < 70 \Bigr) &= \mathbb{P}\Bigl( | X - 60 | < 10 \Bigr) \\ & \\ &= 1 - \mathbb{P}\Bigl( | X - 60 | \ge 10 \Bigr) \\ & \\ & \ge 1 - \frac{16}{10^2} = 0.84 \end{aligned} \]

Example 2

  • A measure of the distance to a distant star is a random variable with mean \(\mu\) (the unknown true distance) and variance \(4\) light years.

  • An astronomer will perform several independent measurements, \(X_{i}\), \(i=1,2,...,n\) of the distance and use their average \(\overline{X}_n\) as an estimate for the true distance.

  • How many measurements must she perform so that \[ \mathbb{P}\left( \bigl\vert \overline{X}_n-\mu \bigr\vert <0.5\text{ light years} \right) \, \ge \, 0.95 \ ? \]

Example 2

  • Recall that \[ \mathbb{E}\left[ \overline{X}_n \right] \, = \, \mu \] and \[ V \left[ \overline{X}_n \right] \, = \, \frac{4}{n} \]

  • Thus, using Chebyshev’s inequality applied to \(\overline{X}_n\): \[ \mathbb{P}\biggl( \bigl| \overline{X}_n - \mu \bigr| < 1/2 \biggr) \, \ge \, 1 - \frac{4}{n \, (1/2)^2} = 1 - \frac{16}{n} \] Hence, to have \(1 - 16/n \ge 0.95\) she needs \(n \ge 16/0.05 = 320\) independent measurements.

Weak Law of Large Numbers

  • Let \(X_{1},X_{2}, \ldots, X_{n} \, \ldots\) be independent and identically distributed (i.i.d) random variables with finite mean \(\mu\). Then, for all \(\epsilon >0\): \[ \lim_{n\rightarrow \infty } \mathbb{P}\Bigl( \bigl| \overline{X}_n-\mu \bigr| \ge \epsilon \Bigr) \, = \, 0 \] where \[ \overline{X}_n \, = \, \frac{1}{n} \, \sum_{i=1}^n X_i \]

  • Interpretation: The probability that the average of a large number of independent measurements of a random variable deviates (by any fixed amount) from their expected value
    decreases as the averages involve more observations.

  • In Statistics this is recast as follows: the sample average is a consistent estimator for the population mean.

The Central Limit Theorem (CLT)

  • Let \(X_{1},X_{2}, \ldots, X_{n} \, \ldots\) be independent and identically distributed (i.i.d) random variables with finite mean \(\mu\) and variance \(\sigma^2\).

Then, for any \(z \in \mathbb{R}\):

\[ \lim_{n\rightarrow \infty } \mathbb{P}\left( \frac{\overline{X}_n-\mu }{\sigma /\sqrt{n}} \ \le \ z \, \right) = \Phi \left( z\right) \] where \(\Phi( z )\) is the CDF of a \(\mathcal{N}(0,1)\) distribution

Approximate Distribution of \(\overline{X}_n\)

  • The CLT states that the distribution of the standardized sample average (which is a random variable) \[ \frac{\overline{X}_n - \mu }{\sigma /\sqrt{n}} \] is approximately that of a \(\mathcal{N}(0, 1)\) r.v.

  • As a rule of thumb, the approximation is good for \(n \ge 30\) (see the Berry-Esseen Theorem)

  • In other words: \[ \overline{X}_n \text{ \ is \ approximately \ } \mathcal{N}\left( \mu ,\frac{\sigma ^{2}}{n} \right) \text{ \ for large }n \]

Example 2 - Reloaded

  • A measure of the distance to a distant star is a random variable with mean $$ (the true distance) and variance \(4\) light years.

  • An astronomer will perform several independent measurements, \(X_{i},\) \(% i=1,2,...,n\) of the distance and use their average \(\overline{X}_n\) as an estimate for the true distance.

  • How many measurements must she perform so that \[ \mathbb{P}\left( \bigl\vert \overline{X}_n - \mu \bigr\vert <0.5 \text{ light years} \right) \, \ge \, 0.95 \ ? \]

Example 2 - Reloaded

\[ \begin{aligned} \mathbb{P}\left( \bigl\vert \overline{X}_n - \mu \bigr\vert <0.5\right) &= \mathbb{P}\left( -0.5 < \overline{X}_n - \mu <0.5\right) \\ & \\ &= \mathbb{P}\left( \frac{-0.5}{2/\sqrt{n}}<\frac{\overline{X}_n - \mu }{2/\sqrt{n}} < \frac{0.5}{2/\sqrt{n}}\right) \\ & \\ & \thickapprox \Phi \left( \frac{0.5}{2/\sqrt{n}}\right) -\Phi \left( - \frac{0.5}{2/\sqrt{n}}\right) \\ & \\ &= 2\Phi \left( \frac{0.5}{2/\sqrt{n}}\right) -1 \end{aligned} \]

Example 2 - Reloaded

\[ \begin{aligned} 2\Phi \left( \frac{0.5}{2/\sqrt{n}}\right) -1 & \ge 0.95 \\ & \\ \Phi \left( \frac{0.5}{2/\sqrt{n}}\right) & \ge 0.975 \\ & \\ \sqrt{n}\frac{0.5}{2} & \ge \Phi ^{-1}\left( 0.975\right) =1.96 \\ & \\ n & \ge \left( \frac{1.96\times 2}{0.5}\right)^{2}=61.47 \end{aligned} \]

Hence she needs \(n \ge 62\) measurements.

Example 2 - Reloaded - Discussion

  • Using Chebyshev’s inequality we conclude we need \[ n \ge 320 \] independent observations.

  • Using the CLT we conclude we need \[ n \ge 62 \] independent observations

  • Which one is right? Is either wrong? Why the difference?

Chernoff bounds

  • Let \(X\) be a r.v. for which its MGF exists for \(t \in (-\epsilon, \epsilon)\) for some \(\epsilon > 0\).

  • Then, for any \(a \in \mathbb{R}\) and \(t \in (0, \epsilon)\) \[ \begin{aligned} \mathbb{P}\Bigl( X \ge a \Bigr) & = \mathbb{P}\Bigl( e^{t \, X} \ge e^{t \, a} \Bigr) \\ & \\ & \le \frac{ \mathbb{E}\left[ e^{t \, X} \right] }{ e^{t \, a} } \\ & \\ & = M_X(t) \, e^{-t \, a} \end{aligned} \] (using Markov’s inequality)

Chernoff bounds

  • Similarly, for any \(t \in (-\epsilon, 0)\) \[ \begin{aligned} \mathbb{P}\Bigl( X \le a \Bigr) & = \mathbb{P}\Bigl( e^{t \, X} \ge e^{t \, a} \Bigr) \\ & \\ & \le \frac{ \mathbb{E}\left[ e^{t \, X} \right] }{ e^{t \, a} } \\ & \\ & = M_X(t) \, e^{-t \, a} \end{aligned} \]

  • Find the sharpest bound using \(t\) that minimizes \[ H(t) = M_X(t) \, e^{-t \, a} \] over \(t \in (0, \epsilon)\) or \((-\epsilon, 0)\)

Chernoff bounds - Example: Poisson distributions

  • Let \(X \thicksim \mathcal{P}(\lambda)\), then \[ M_X(t) = e^{\lambda \, \left( e^t - 1 \right) } \]

thus, for any \(a > 0\) and \(t > 0\), \[ \begin{aligned} \mathbb{P}\Bigl( X \ge a \Bigr) \, & \le \, e^{\lambda \, \left( e^t - 1 \right) } \, e^{-a \, t} \\ & = e^{\lambda \, \left( e^t - 1 \right) - a \, t } \end{aligned} \]

  • Now find the minimum of the right hand side \[ \min_{t > 0} \, e^{\lambda \, \left( e^t - 1 \right) - a \, t } \, = \, \min_{t > 0} \ \lambda \left( e^t - 1 \right) - a \, t \]

Chernoff bounds - Example: Poisson distributions

  • The minimum occurs when \[ t \, = \, \log \left( a / \lambda \right) \] (check that this is indeed a minimum)

  • We also need \(t > 0\), so the minimum occurs on \(\mathbb{R}_+\) if \(a > \lambda\).

  • In that case (\(a > \lambda\)): \[ \begin{aligned} \mathbb{P}\Bigl( X \ge a \Bigr) \, & \le e^{\lambda \, \left( a / \lambda - 1 \right) - a \, \log( a / \lambda) } \\ & = e^{ a \, - \lambda} \, \left( \lambda / a \right)^a \\ & = e^{ - \lambda} \, \left( e \, \lambda / a \right)^a \end{aligned} \]

Chernoff bounds - Example: Poisson distributions

  • Compare bound with exact calculation

  • Let \(X \thicksim \mathcal{P}(10)\). Chernoff’s bound gives \[ \mathbb{P}\Bigl( X \ge 20 \Bigr) \le 0.021 \]

and the exact calculation is: \[ \mathbb{P}\Bigl( X \ge 20 \Bigr) = 1 - \mathbb{P}\Bigl( X \le 19 \Bigr) = 0.00345 \]

  • Discuss

Jensen’s inequality - Convexity

  • A function \(h : \mathcal{I} \to \mathbb{R}\) is called convex if for any \(x_1, x_2 \in \mathcal{I}\): \[ h \Bigl( a \, x_1 + (1 - a) \, x_2 \Bigr) \, \le \, a \, h( x_1) + (1-a) \, h(x_2) \quad \forall a \in [0, 1] \]

  • If \(h\) is twice differentiable, then this is equivalent to \(h''(x) \ge 0\) for \(x \in \mathcal{I}\)

  • Examples:

    • \(h(x) = x^2\), \(h(x) = e^x\), \(h(x) = e^{-x}\), \(x \in \mathbb{R}\);
    • \(h(x) = - \log(x)\), \(x > 0\);
    • \(h(x) = |x|\), \(x \in \mathbb{R}\).

Jensen’s inequality

  • Let \(X\) be a random variable with range \(\mathcal{R}_X\)

  • If \(h\) is a convex function over \(\mathcal{R}_X\), and \(\mathbb{E}\left[ X \right]\) and \(\mathbb{E}\left[ h(X) \right]\) exist and are finite, then \[ \mathbb{E}\Bigl[ h \left( X \right) \Bigr] \, \ge \, h \Bigl( \mathbb{E}\left[ X \right] \Bigr) \]

  • Examples:

    • \(\mathbb{E}\left[ X^2 \right] \ge \left( \mathbb{E}\left[ X \right] \right)^2\)
    • \(\mathbb{E}\left[ \log( X ) \right] \le \log \left( \mathbb{E}\left[ X \right] \right)\)
    • \(\Bigl| \mathbb{E}\left[ X \right] \Bigr| \le \mathbb{E}\Bigl[ \left| X \right| \Bigr]\) (H"older or Cauchy-Schwartz inequalities)

Kullback-Leibler divergence

  • Let \(f\) and \(g\) be two densities or pmf’s. The Kullback-Leibler divergence between them is a measure of how different one of them is with respect to the other. Formally: \[ \begin{aligned} D_{KL} \Bigl( f, g \Bigr) \, & = \, \int_{-\infty}^{+\infty} \, f(t) \, \log \left( \frac{f(t)}{g(t)} \right) \, dt \, \\ & \\ D_{KL} \Bigl( f, g \Bigr) \, & = \, \sum_{k \in \mathcal{R}} \, f(k) \, \log \left( \frac{f(k)}{g(k)} \right) \end{aligned} \] (Assume that if \(g(a) = 0 \Rightarrow f(a) = 0\))

  • Prove that \(D_{KL} ( f, g ) \ge 0\)

Kullback-Leibler divergence

Proof: Define \(h:\mathbb{R} \to \mathbb{R}\) by \[ h(t) \, = \, \left( \frac{g(t)}{f(t)} \right) \] then \[ D_{KL} \Bigl( f, g \Bigr) \, = \, \int _{-\infty}^{+\infty} \, -\log(h(t)) \, f(t) \, dt \ = \ \mathbb{E}_f \left[ -\log(h(t)) \right] \]

  • Note that \(-\log\) is convex, thus: \[ D_{KL} \Bigl( f, g \Bigr) = \, \mathbb{E}_f \left[ -\log(h(t)) \right] \ge -\log \left( \mathbb{E}_f[ h(t) ] \right) \ge \, -\log \left( \int _{-\infty}^{+\infty} \, g(t) \, dt \right) = 0 \]