04 Bias and variance

Stat 406

Daniel J. McDonald

Last modified – 18 September 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \]

We just talked about

  • Variance of an estimator.

  • Irreducible error when making predictions.

  • These are 2 of the 3 components of the “Prediction Risk” \(R_n\)

Component 3, the Bias

We need to be specific about what we mean when we say bias.

Bias is neither good nor bad in and of itself.

A very simple example: let \(Z_1,\ \ldots,\ Z_n \sim N(\mu, 1)\). - We don’t know \(\mu\), so we try to use the data (the \(Z_i\)’s) to estimate it.

  • I propose 3 estimators:
    1. \(\widehat{\mu}_1 = 12\),

    2. \(\widehat{\mu}_2=Z_6\),

    3. \(\widehat{\mu}_3=\overline{Z}\).

  • The bias (by definition) of my estimator is \(E[\widehat{\mu_i}]-\mu\).

Calculate the bias and variance of each estimator.

Regression in general

If I want to predict \(Y\) from \(X\), it is almost always the case that

\[ \mu(x) = \Expect{Y\given X=x} \neq x^{\top}\beta \]

So the bias of using a linear model is not zero.


Why? Because

\[ \Expect{Y\given X=x}-x^\top\beta \neq \Expect{Y\given X=x} - \mu(x) = 0. \]

We can include as many predictors as we like,

but this doesn’t change the fact that the world is non-linear.

(Continuation) Predicting new Y’s

Suppose we want to predict \(Y\),

we know \(E[Y]= \mu \in \mathbb{R}\) and \(\textrm{Var}[Y] = 1\).

Our data is \(\{y_1,\ldots,y_n\}\)

We have considered estimating \(\mu\) in various ways, and using \(\widehat{Y} = \widehat{\mu}\)



Let’s try one more: \(\widehat Y_a = a\overline{Y}_n\) for some \(a \in (0,1]\).

One can show… (wait for the proof)

\(\widehat Y_a = a\overline{Y}_n\) for some \(a \in (0,1]\)

\[ R_n(\widehat Y_a) = \Expect{(\widehat Y_a-Y)^2} = (1 - a)^2\mu^2 + \frac{a^2}{n} +1 \]

We can minimize this in \(a\) to get the best possible prediction risk for an estimator of the form \(\widehat Y_a\):

\[ \argmin_{a} R_n(\widehat Y_a) = \left(\frac{\mu^2}{\mu^2 + 1/n} \right) \]

What happens if \(\mu \ll 1\)?

Important

Wait a minute! I’m saying there is a better estimator than \(\overline{Y}_n\)!

Bias-variance tradeoff: Estimating the mean

\[ R_n(\widehat Y_a) = (a - 1)^2\mu^2 + \frac{a^2}{n} + \sigma^2 \]

mu = 1; n = 5; sig = 1

To restate

If \(\mu=\) 1 and \(n=\) 5

then it is better to predict with 0.83 \(\overline{Y}_5\)

than with \(\overline{Y}_5\) itself.

For this \(a =\) 0.83 and \(n=5\)

  1. \(R_5(\widehat{Y}_a) =\) 1.17

  2. \(R_5(\overline{Y}_5)=\) 1.2

Prediction risk

(Now using generic prediction function \(f\))

\[ R_n(f) = \Expect{(Y - f(X))^2} \]

Why should we care about \(R_n(f)\)?

👍 Measures predictive accuracy on average.

👍 How much confidence should you have in \(f\)’s predictions.

👍 Compare with other predictors: \(R_n(f)\) vs \(R_n(g)\)

🤮 This is hard: Don’t know the distribution of the data (if I knew the truth, this would be easy)

Bias-variance decomposition

\[R_n(\widehat{Y}_a)=(a - 1)^2\mu^2 + \frac{a^2}{n} + 1\]

  1. prediction risk = \(\textrm{bias}^2\) + variance + irreducible error

  2. estimation risk = \(\textrm{bias}^2\) + variance

What is \(R_n(\widehat{Y}_a)\) for our estimator \(\widehat{Y}_a=a\overline{Y}_n\)?

\[\begin{aligned} \textrm{bias}(\widehat{Y}_a) &= \Expect{a\overline{Y}_n} - \mu=(a-1)\mu\\ \textrm{var}(\widehat f(x)) &= \Expect{ \left(a\overline{Y}_n - \Expect{a\overline{Y}_n}\right)^2} =a^2\Expect{\left(\overline{Y}_n-\mu\right)^2}=\frac{a^2}{n} \\ \sigma^2 &= \Expect{(Y-\mu)^2}=1 \end{aligned}\]

This decomposition holds generally

\[\begin{aligned} R_n(\hat{Y}) &= \Expect{(Y-\hat{Y})^2} \\ &= \Expect{(Y-\mu + \mu - \hat{Y})^2} \\ &= \Expect{(Y-\mu)^2} + \Expect{(\mu - \hat{Y})^2} + 2\Expect{(Y-\mu)(\mu-\hat{Y})}\\ &= \Expect{(Y-\mu)^2} + \Expect{(\mu - \hat{Y})^2} + 0\\ &= \text{irr. error} + \text{estimation risk}\\ &= \sigma^2 + \Expect{(\mu - E[\hat{Y}] + E[\hat{Y}] - \hat{Y})^2}\\ &= \sigma^2 + \Expect{(\mu - E[\hat{Y}])^2} + \Expect{(E[\hat{Y}] - \hat{Y})^2} + 2\Expect{(\mu-E[\hat{Y}])(E[\hat{Y}] - \hat{Y})}\\ &= \sigma^2 + \Expect{(\mu - E[\hat{Y}])^2} + \Expect{(E[\hat{Y}] - \hat{Y})^2} + 0\\ &= \text{irr. error} + \text{squared bias} + \text{variance} \end{aligned}\]

Bias-variance decomposition

\[\begin{aligned} R_n(\hat{Y}) &= \Expect{(Y-\hat{Y})^2} \\ &= \text{irr. error} + \text{estimation risk}\\ &= \text{irr. error} + \text{squared bias} + \text{variance} \end{aligned}\]

Important

Implication: prediction risk is proportional to estimation risk. However, defining estimation risk requires stronger assumptions.

Tip

In order to make good predictions, we want our prediction risk to be small. This means that we want to “balance” the bias and variance.

Code
cols = c(blue, red, green, orange)
par(mfrow = c(2, 2), bty = "n", ann = FALSE, xaxt = "n", yaxt = "n", 
    family = "serif", mar = c(0, 0, 0, 0), oma = c(0, 2, 2, 0))
library(mvtnorm)
mv <- matrix(c(0, 0, 0, 0, -.5, -.5, -.5, -.5), 4, byrow = TRUE)
va <- matrix(c(.02, .02, .1, .1, .02, .02, .1, .1), 4, byrow = TRUE)

for (i in 1:4) {
  plot(0, 0, ylim = c(-2, 2), xlim = c(-2, 2), pch = 19, cex = 42, 
       col = blue, ann = FALSE, pty = "s")
  points(0, 0, pch = 19, cex = 30, col = "white")
  points(0, 0, pch = 19, cex = 18, col = green)
  points(0, 0, pch = 19, cex = 6, col = orange)
  points(rmvnorm(20, mean = mv[i, ], sigma = diag(va[i, ])), cex = 1, pch = 19)
  switch(i,
    "1" = {
      mtext("low variance", 3, cex = 2)
      mtext("low bias", 2, cex = 2)
    },
    "2" = mtext("high variance", 3, cex = 2),
    "3" = mtext("high bias", 2, cex = 2)
  )
}

Bias-variance tradeoff: Overview

bias: how well does \(\widehat{f}(x)\) approximate the truth \(\Expect{Y\given X=x}\)

  • If we allow more complicated possible \(\widehat{f}\), lower bias. Flexibility \(\Rightarrow\) Expressivity

  • But, more flexibility \(\Rightarrow\) larger variance

  • Complicated models are hard to estimate precisely for fixed \(n\)

  • Irreducible error

Sadly, that whole exercise depends on knowing the truth to evaluate \(E\ldots\)

Next time…

Estimating risk