Inequalities and asymptotics
Last modified — 26 Nov 2025
How far can \(X\) be from its expected value \(\mathbb{E}[ X ]\)?
Better: how likely is it that \(X\) is \(\varepsilon\) far from \(\mathbb{E}[ X ]\)?
\[ \mathbb{P}\Bigl( \left| X - \mathbb{E}[X] \Bigr| > \varepsilon \right) \quad ? \]
\[ \mathbb{E}\Bigl( X - \mathbb{E}[X] \Bigr)^2 \ = \ V(X) \]
Let \(X\) be a continuous or discrete random variable with pdf/pmf \(f_X\).
If there is a set \(A\) such that \(\mathbb{P}\left( X \in A \right) = 1\), and if \(h\) is a function such that \(\mathbb{E}[ h(X) ]\) exists, then
\[ \int_{A^c} h(t) \, f_X(t) \, dt = 0 \] or
\[ \sum_{k \in A^c} h(k) \, f_X(k) = 0 \]
\[ \mathbb{E}[ h(X) ] \, = \, \int_{A} h(t) \, f_X(t) \, dt \]
or
\[ \mathbb{E}[ h(X) ] = \sum_{k \in A} h(k) \, f_X(k) \] Proof: \[ \mathbb{E}[ h(X) ] = \int h(t) \, f_X(t) \, dt = \int_{A} h(t) \, f_X(t) \, dt + \int_{A^c} h(t) \, f_X(t) \, dt = \int_{A} h(t) \, f_X(t) \, dt \]
If \(\mathbb{E}[ g(X) ]\) exists and \[ \mathbb{P}\Bigl( g(X) \ge 0 \Bigr) = 1 \] then \[ \mathbb{E}[ g(X) ] = \int g(t) \, f_X(t) \, dt \ \ge \ 0 \]
This implies that, if \(\mathbb{E}[ h(X) ]\) exists then \(\int_{h(t) \ge 0} h(t) \, f_X(t) \, dt \, \ge \, 0\)
Proof: \[ \int_{h(t) \ge 0} h(t) \, f_X(t) \, dt = \int \left[ I_{[0, +\infty)}(h(t)) \, h(t) \right] \, f_X(t) \, dt \ge 0 \] because \(g(t) = \left[ I_{[0, +\infty)}(h(t)) \, h(t) \right] \ge 0\) for all \(t \in \mathbb{R}\)
Proof: Let \(b(t) = g(t) - h(t)\) and \[ A = \bigl\{ \, t \in \mathbb{R} : b(t) \ge 0 \, \bigr\} \] then \(\mathbb{P}( X \in A ) = 1\) and \[ \mathbb{E}[ g(X) ] - \mathbb{E}[ h(X)] = \mathbb{E}[ b(X) ] = \int_{A} b(t) \, f_X(t) \, dt \ge 0 \]
\[ \mathbb{P}\Bigl( \, X \ge 0 \, \Bigr) = 1 \]
Then, for any \(a > 0\) we have
\[ \mathbb{P}\Bigl( \, X \ge a \, \Bigr) \le \frac{ \mathbb{E}[ X ] }{a} \]
Proof:
\[ \begin{aligned} \mathbb{E}[ X ] = \int t \, f_X(t) \, dt &= \int_{0 \le t} t \, f_X(t) \, dt \\ & \\ & = \int_{0 \le t < a} t \, f_X(t) \, dt \ + \ \int_{a \le t} t \, f_X(t) \, dt \\ & \\ & \ge \int_{a \le t} t \, f_X(t) \, dt \ge a \, \int_{a \le t} f_X(t) \, dt = a \, \mathbb{P}( X \ge a) \end{aligned} \]
Let \(X\) be a random variable with \(\mathbb{E}[X] = \mu\) and \(V(X) = \sigma^2\), then, for any \(k > 0\) we have
\[ \mathbb{P}\Bigl( \, \left| X - \mu \right| \ge k \, \Bigr) \ \le \ \frac{\sigma^2}{k^2} \] Proof:
Note that \(\mathbb{P}\left( (X - \mu)^2 \ge 0 \right) = 1\), hence, by Markov’s inequality:
\[ \mathbb{P}\Bigl( \, \left| X - \mu \right| \ge k \, \Bigr) = \mathbb{P}\Bigl( \, \left( X - \mu \right)^2 \ge k^2 \, \Bigr) \le \frac{\mathbb{E}[ ( X - \mu)^2 ]}{k^2} = \frac{\sigma^2}{k^2} \]
What can you say about the probability that at least 70 customers come on any given day?
What can you say about the probability that the number of customers coming to the station in a given day is between 50 and 70?
Using Markov’s inequality, since \(\mathbb{P}( X \ge 0 ) = 1\): \[ \mathbb{P}\Bigl( X \ge 70 \Bigr) \le \frac{60}{70} = 0.86 \]
Using Chebyshev’s inequality \[ \begin{aligned} \mathbb{P}\Bigl( 50 < X < 70 \Bigr) &= \mathbb{P}\Bigl( | X - 60 | < 10 \Bigr) \\ & \\ &= 1 - \mathbb{P}\Bigl( | X - 60 | \ge 10 \Bigr) \\ & \\ & \ge 1 - \frac{16}{10^2} = 0.84 \end{aligned} \]
A measure of the distance to a distant star is a random variable with mean \(\mu\) (the unknown true distance) and variance \(4\) light years.
An astronomer will perform several independent measurements, \(X_{i}\), \(i=1,2,...,n\) of the distance and use their average \(\overline{X}_n\) as an estimate for the true distance.
How many measurements must she perform so that \[ \mathbb{P}\left( \bigl\vert \overline{X}_n-\mu \bigr\vert <0.5\text{ light years} \right) \, \ge \, 0.95 \ ? \]
Recall that \[ \mathbb{E}\left[ \overline{X}_n \right] \, = \, \mu \] and \[ V \left[ \overline{X}_n \right] \, = \, \frac{4}{n} \]
Thus, using Chebyshev’s inequality applied to \(\overline{X}_n\): \[ \mathbb{P}\biggl( \bigl| \overline{X}_n - \mu \bigr| < 1/2 \biggr) \, \ge \, 1 - \frac{4}{n \, (1/2)^2} = 1 - \frac{16}{n} \] Hence, to have \(1 - 16/n \ge 0.95\) she needs \(n \ge 16/0.05 = 320\) independent measurements.
Let \(X_{1},X_{2}, \ldots, X_{n} \, \ldots\) be independent and identically distributed (i.i.d) random variables with finite mean \(\mu\). Then, for all \(\epsilon >0\): \[ \lim_{n\rightarrow \infty } \mathbb{P}\Bigl( \bigl| \overline{X}_n-\mu \bigr| \ge \epsilon \Bigr) \, = \, 0 \] where \[ \overline{X}_n \, = \, \frac{1}{n} \, \sum_{i=1}^n X_i \]
Interpretation: The probability that the average of a large number of independent measurements of a random variable deviates (by any fixed amount) from their expected value
decreases as the averages involve more observations.
In Statistics this is recast as follows: the sample average is a consistent estimator for the population mean.
Then, for any \(z \in \mathbb{R}\):
\[ \lim_{n\rightarrow \infty } \mathbb{P}\left( \frac{\overline{X}_n-\mu }{\sigma /\sqrt{n}} \ \le \ z \, \right) = \Phi \left( z\right) \] where \(\Phi( z )\) is the CDF of a \(\mathcal{N}(0,1)\) distribution
The CLT states that the distribution of the standardized sample average (which is a random variable) \[ \frac{\overline{X}_n - \mu }{\sigma /\sqrt{n}} \] is approximately that of a \(\mathcal{N}(0, 1)\) r.v.
As a rule of thumb, the approximation is good for \(n \ge 30\) (see the Berry-Esseen Theorem)
In other words: \[ \overline{X}_n \text{ \ is \ approximately \ } \mathcal{N}\left( \mu ,\frac{\sigma ^{2}}{n} \right) \text{ \ for large }n \]
A measure of the distance to a distant star is a random variable with mean $$ (the true distance) and variance \(4\) light years.
An astronomer will perform several independent measurements, \(X_{i},\) \(% i=1,2,...,n\) of the distance and use their average \(\overline{X}_n\) as an estimate for the true distance.
How many measurements must she perform so that \[ \mathbb{P}\left( \bigl\vert \overline{X}_n - \mu \bigr\vert <0.5 \text{ light years} \right) \, \ge \, 0.95 \ ? \]
\[ \begin{aligned} \mathbb{P}\left( \bigl\vert \overline{X}_n - \mu \bigr\vert <0.5\right) &= \mathbb{P}\left( -0.5 < \overline{X}_n - \mu <0.5\right) \\ & \\ &= \mathbb{P}\left( \frac{-0.5}{2/\sqrt{n}}<\frac{\overline{X}_n - \mu }{2/\sqrt{n}} < \frac{0.5}{2/\sqrt{n}}\right) \\ & \\ & \thickapprox \Phi \left( \frac{0.5}{2/\sqrt{n}}\right) -\Phi \left( - \frac{0.5}{2/\sqrt{n}}\right) \\ & \\ &= 2\Phi \left( \frac{0.5}{2/\sqrt{n}}\right) -1 \end{aligned} \]
\[ \begin{aligned} 2\Phi \left( \frac{0.5}{2/\sqrt{n}}\right) -1 & \ge 0.95 \\ & \\ \Phi \left( \frac{0.5}{2/\sqrt{n}}\right) & \ge 0.975 \\ & \\ \sqrt{n}\frac{0.5}{2} & \ge \Phi ^{-1}\left( 0.975\right) =1.96 \\ & \\ n & \ge \left( \frac{1.96\times 2}{0.5}\right)^{2}=61.47 \end{aligned} \]
Hence she needs \(n \ge 62\) measurements.
Using Chebyshev’s inequality we conclude we need \[ n \ge 320 \] independent observations.
Using the CLT we conclude we need \[ n \ge 62 \] independent observations
Which one is right? Is either wrong? Why the difference?
Let \(X\) be a r.v. for which its MGF exists for \(t \in (-\epsilon, \epsilon)\) for some \(\epsilon > 0\).
Then, for any \(a \in \mathbb{R}\) and \(t \in (0, \epsilon)\) \[ \begin{aligned} \mathbb{P}\Bigl( X \ge a \Bigr) & = \mathbb{P}\Bigl( e^{t \, X} \ge e^{t \, a} \Bigr) \\ & \\ & \le \frac{ \mathbb{E}\left[ e^{t \, X} \right] }{ e^{t \, a} } \\ & \\ & = M_X(t) \, e^{-t \, a} \end{aligned} \] (using Markov’s inequality)
Similarly, for any \(t \in (-\epsilon, 0)\) \[ \begin{aligned} \mathbb{P}\Bigl( X \le a \Bigr) & = \mathbb{P}\Bigl( e^{t \, X} \ge e^{t \, a} \Bigr) \\ & \\ & \le \frac{ \mathbb{E}\left[ e^{t \, X} \right] }{ e^{t \, a} } \\ & \\ & = M_X(t) \, e^{-t \, a} \end{aligned} \]
Find the sharpest bound using \(t\) that minimizes \[ H(t) = M_X(t) \, e^{-t \, a} \] over \(t \in (0, \epsilon)\) or \((-\epsilon, 0)\)
thus, for any \(a > 0\) and \(t > 0\), \[ \begin{aligned} \mathbb{P}\Bigl( X \ge a \Bigr) \, & \le \, e^{\lambda \, \left( e^t - 1 \right) } \, e^{-a \, t} \\ & = e^{\lambda \, \left( e^t - 1 \right) - a \, t } \end{aligned} \]
The minimum occurs when \[ t \, = \, \log \left( a / \lambda \right) \] (check that this is indeed a minimum)
We also need \(t > 0\), so the minimum occurs on \(\mathbb{R}_+\) if \(a > \lambda\).
In that case (\(a > \lambda\)): \[ \begin{aligned} \mathbb{P}\Bigl( X \ge a \Bigr) \, & \le e^{\lambda \, \left( a / \lambda - 1 \right) - a \, \log( a / \lambda) } \\ & = e^{ a \, - \lambda} \, \left( \lambda / a \right)^a \\ & = e^{ - \lambda} \, \left( e \, \lambda / a \right)^a \end{aligned} \]
Compare bound with exact calculation
Let \(X \thicksim \mathcal{P}(10)\). Chernoff’s bound gives \[ \mathbb{P}\Bigl( X \ge 20 \Bigr) \le 0.021 \]
and the exact calculation is: \[ \mathbb{P}\Bigl( X \ge 20 \Bigr) = 1 - \mathbb{P}\Bigl( X \le 19 \Bigr) = 0.00345 \]
A function \(h : \mathcal{I} \to \mathbb{R}\) is called convex if for any \(x_1, x_2 \in \mathcal{I}\): \[ h \Bigl( a \, x_1 + (1 - a) \, x_2 \Bigr) \, \le \, a \, h( x_1) + (1-a) \, h(x_2) \quad \forall a \in [0, 1] \]
If \(h\) is twice differentiable, then this is equivalent to \(h''(x) \ge 0\) for \(x \in \mathcal{I}\)
Examples:
Let \(X\) be a random variable with range \(\mathcal{R}_X\)
If \(h\) is a convex function over \(\mathcal{R}_X\), and \(\mathbb{E}\left[ X \right]\) and \(\mathbb{E}\left[ h(X) \right]\) exist and are finite, then \[ \mathbb{E}\Bigl[ h \left( X \right) \Bigr] \, \ge \, h \Bigl( \mathbb{E}\left[ X \right] \Bigr) \]
Examples:
Let \(f\) and \(g\) be two densities or pmf’s. The Kullback-Leibler divergence between them is a measure of how different one of them is with respect to the other. Formally: \[ \begin{aligned} D_{KL} \Bigl( f, g \Bigr) \, & = \, \int_{-\infty}^{+\infty} \, f(t) \, \log \left( \frac{f(t)}{g(t)} \right) \, dt \, \\ & \\ D_{KL} \Bigl( f, g \Bigr) \, & = \, \sum_{k \in \mathcal{R}} \, f(k) \, \log \left( \frac{f(k)}{g(k)} \right) \end{aligned} \] (Assume that if \(g(a) = 0 \Rightarrow f(a) = 0\))
Prove that \(D_{KL} ( f, g ) \ge 0\)
Proof: Define \(h:\mathbb{R} \to \mathbb{R}\) by \[ h(t) \, = \, \left( \frac{g(t)}{f(t)} \right) \] then \[ D_{KL} \Bigl( f, g \Bigr) \, = \, \int _{-\infty}^{+\infty} \, -\log(h(t)) \, f(t) \, dt \ = \ \mathbb{E}_f \left[ -\log(h(t)) \right] \]
Stat 302 - Winter 2025/26