Stat 406
Daniel J. McDonald
Last modified – 18 September 2023
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \]
Variance of an estimator.
Irreducible error when making predictions.
These are 2 of the 3 components of the “Prediction Risk” \(R_n\)
We need to be specific about what we mean when we say bias.
Bias is neither good nor bad in and of itself.
A very simple example: let \(Z_1,\ \ldots,\ Z_n \sim N(\mu, 1)\). - We don’t know \(\mu\), so we try to use the data (the \(Z_i\)’s) to estimate it.
\(\widehat{\mu}_1 = 12\),
\(\widehat{\mu}_2=Z_6\),
\(\widehat{\mu}_3=\overline{Z}\).
Calculate the bias and variance of each estimator.
If I want to predict \(Y\) from \(X\), it is almost always the case that
\[ \mu(x) = \Expect{Y\given X=x} \neq x^{\top}\beta \]
So the bias of using a linear model is not zero.
Why? Because
\[ \Expect{Y\given X=x}-x^\top\beta \neq \Expect{Y\given X=x} - \mu(x) = 0. \]
We can include as many predictors as we like,
but this doesn’t change the fact that the world is non-linear.
Suppose we want to predict \(Y\),
we know \(E[Y]= \mu \in \mathbb{R}\) and \(\textrm{Var}[Y] = 1\).
Our data is \(\{y_1,\ldots,y_n\}\)
We have considered estimating \(\mu\) in various ways, and using \(\widehat{Y} = \widehat{\mu}\)
Let’s try one more: \(\widehat Y_a = a\overline{Y}_n\) for some \(a \in (0,1]\).
\(\widehat Y_a = a\overline{Y}_n\) for some \(a \in (0,1]\)
\[ R_n(\widehat Y_a) = \Expect{(\widehat Y_a-Y)^2} = (1 - a)^2\mu^2 + \frac{a^2}{n} +1 \]
We can minimize this in \(a\) to get the best possible prediction risk for an estimator of the form \(\widehat Y_a\):
\[ \argmin_{a} R_n(\widehat Y_a) = \left(\frac{\mu^2}{\mu^2 + 1/n} \right) \]
What happens if \(\mu \ll 1\)?
Important
Wait a minute! I’m saying there is a better estimator than \(\overline{Y}_n\)!
\[ R_n(\widehat Y_a) = (a - 1)^2\mu^2 + \frac{a^2}{n} + \sigma^2 \]
If \(\mu=\) 1 and \(n=\) 5
then it is better to predict with 0.83 \(\overline{Y}_5\)
than with \(\overline{Y}_5\) itself.
For this \(a =\) 0.83 and \(n=5\)
\(R_5(\widehat{Y}_a) =\) 1.17
\(R_5(\overline{Y}_5)=\) 1.2
(Now using generic prediction function \(f\))
\[ R_n(f) = \Expect{(Y - f(X))^2} \]
Why should we care about \(R_n(f)\)?
👍 Measures predictive accuracy on average.
👍 How much confidence should you have in \(f\)’s predictions.
👍 Compare with other predictors: \(R_n(f)\) vs \(R_n(g)\)
🤮 This is hard: Don’t know the distribution of the data (if I knew the truth, this would be easy)
\[R_n(\widehat{Y}_a)=(a - 1)^2\mu^2 + \frac{a^2}{n} + 1\]
prediction risk = \(\textrm{bias}^2\) + variance + irreducible error
estimation risk = \(\textrm{bias}^2\) + variance
What is \(R_n(\widehat{Y}_a)\) for our estimator \(\widehat{Y}_a=a\overline{Y}_n\)?
\[\begin{aligned} \textrm{bias}(\widehat{Y}_a) &= \Expect{a\overline{Y}_n} - \mu=(a-1)\mu\\ \textrm{var}(\widehat f(x)) &= \Expect{ \left(a\overline{Y}_n - \Expect{a\overline{Y}_n}\right)^2} =a^2\Expect{\left(\overline{Y}_n-\mu\right)^2}=\frac{a^2}{n} \\ \sigma^2 &= \Expect{(Y-\mu)^2}=1 \end{aligned}\]Important
Implication: prediction risk is proportional to estimation risk. However, defining estimation risk requires stronger assumptions.
Tip
In order to make good predictions, we want our prediction risk to be small. This means that we want to “balance” the bias and variance.
cols = c(blue, red, green, orange)
par(mfrow = c(2, 2), bty = "n", ann = FALSE, xaxt = "n", yaxt = "n",
family = "serif", mar = c(0, 0, 0, 0), oma = c(0, 2, 2, 0))
library(mvtnorm)
mv <- matrix(c(0, 0, 0, 0, -.5, -.5, -.5, -.5), 4, byrow = TRUE)
va <- matrix(c(.02, .02, .1, .1, .02, .02, .1, .1), 4, byrow = TRUE)
for (i in 1:4) {
plot(0, 0, ylim = c(-2, 2), xlim = c(-2, 2), pch = 19, cex = 42,
col = blue, ann = FALSE, pty = "s")
points(0, 0, pch = 19, cex = 30, col = "white")
points(0, 0, pch = 19, cex = 18, col = green)
points(0, 0, pch = 19, cex = 6, col = orange)
points(rmvnorm(20, mean = mv[i, ], sigma = diag(va[i, ])), cex = 1, pch = 19)
switch(i,
"1" = {
mtext("low variance", 3, cex = 2)
mtext("low bias", 2, cex = 2)
},
"2" = mtext("high variance", 3, cex = 2),
"3" = mtext("high bias", 2, cex = 2)
)
}
bias: how well does \(\widehat{f}(x)\) approximate the truth \(\Expect{Y\given X=x}\)
If we allow more complicated possible \(\widehat{f}\), lower bias. Flexibility \(\Rightarrow\) Expressivity
But, more flexibility \(\Rightarrow\) larger variance
Complicated models are hard to estimate precisely for fixed \(n\)
Irreducible error
Sadly, that whole exercise depends on knowing the truth to evaluate \(E\ldots\)
Estimating risk
UBC Stat 406 - 2023