04 Bias and variance

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 16 September 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

We just talked about

  • Variance of an estimator.

  • Irreducible error when making predictions.

  • These are 2 of the 3 components of the “Prediction Risk” \(R_n\)

Component 3, the Bias

We need to be specific about what we mean when we say bias.

Bias is neither good nor bad in and of itself.

A very simple example: let \(Y_1,\ \ldots,\ Y_n \sim N(\mu, 1)\). - We don’t know \(\mu\), so we try to use the data (the \(Y_i\)’s) to estimate it.

  • I propose 3 estimators:
    1. \(\widehat{\mu}_1 = 12\),

    2. \(\widehat{\mu}_2=Y_6\),

    3. \(\widehat{\mu}_3=\overline{Y}\).

  • The bias (by definition) of my estimator is \(E[\widehat{\mu_i}]-\mu\).

Calculate the bias and variance of each estimator.

Regression in general

If I want to predict \(Y\) from \(X\), it is almost always the case that

\[ \mu(x) = \Expect{Y\given X=x} \neq x^{\top}\beta \]

So the bias of using a linear model is not zero.


Why? Because

\[ \Expect{Y\given X=x}-x^\top\beta \neq \Expect{Y\given X=x} - \mu(x) = 0. \]

We can include as many predictors as we like,

but this doesn’t change the fact that the world is non-linear.

(Continuation) Predicting new Y’s

Suppose we want to predict \(Y\),

we know \(E[Y]= \mu \in \mathbb{R}\) and \(\textrm{Var}[Y] = 1\).

Our data is \(\{y_1,\ldots,y_n\}\)

We have considered estimating \(\mu\) in various ways, and using \(\widehat{Y} = \widehat{\mu}\)



Let’s try one more: \(\widehat Y_a = a\overline{Y}_n\) for some \(a \in (0,1]\).

One can show… (wait for the proof)

\(\widehat Y_a = a\overline{Y}_n\) for some \(a \in (0,1]\)

\[ R_n(\widehat Y_a) = \Expect{(\widehat Y_a-Y)^2} = (1 - a)^2\mu^2 + \frac{a^2}{n} +1 \]

We can minimize this to get the best possible prediction risk for an estimator of the form \(\widehat Y_a\):

\[ \argmin_{a} R_n(\widehat Y_a) = \left(\frac{\mu^2}{\mu^2 + 1/n} \right)\qquad \min_{a} R_n(\widehat Y_a) = 1+\left(\frac{\mu^2}{n\mu^2 + 1} \right). \]

Is this less than or greater than the risk we saw for \(\bar Y\)?

Am I cheating here?

Important

Wait a minute! I’m saying there is a better estimator than \(\overline{Y}_n\)!

Bias-variance tradeoff: Estimating the mean

\[ R_n(\widehat Y_a) = (a - 1)^2\mu^2 + \frac{a^2}{n} + \sigma^2 \]

mu = 1; n = 5; sig = 1

To restate

If \(\mu=\) 1 and \(n=\) 5

then it is better to predict with 0.83 \(\overline{Y}_5\)

than with \(\overline{Y}_5\) itself.

For this \(a =\) 0.83 and \(n=5\)

  1. \(R_5(\widehat{Y}_a) =\) 1.17

  2. \(R_5(\overline{Y}_5)=\) 1.2

Bias-variance decomposition

\[R_n(\widehat{Y}_a)=(a - 1)^2\mu^2 + \frac{a^2}{n} + 1\]

  1. prediction risk = \(\textrm{bias}^2\) + variance + irreducible error

  2. estimation risk = \(\textrm{bias}^2\) + variance

What is \(R_n(\widehat{Y}_a)\) for our estimator \(\widehat{Y}_a=a\overline{Y}_n\)?

\[\begin{aligned} \textrm{bias}(\widehat{Y}_a) &= \Expect{a\overline{Y}_n} - \mu=(a-1)\mu\\ \textrm{var}(\widehat f(x)) &= \Expect{ \left(a\overline{Y}_n - \Expect{a\overline{Y}_n}\right)^2} =a^2\Expect{\left(\overline{Y}_n-\mu\right)^2}=\frac{a^2}{n} \\ \sigma^2 &= \Expect{(Y-\mu)^2}=1 \end{aligned}\]

This decomposition holds generally

\[\begin{aligned} R_n(\hat{Y}) &= \Expect{(Y-\hat{Y})^2} \\ &= \Expect{(Y-\mu + \mu - \hat{Y})^2} \\ &= \Expect{(Y-\mu)^2} + \Expect{(\mu - \hat{Y})^2} + 2\Expect{(Y-\mu)(\mu-\hat{Y})}\\ &= \Expect{(Y-\mu)^2} + \Expect{(\mu - \hat{Y})^2} + 0\\ &= \text{irr. error} + \text{estimation risk}\\ &= \sigma^2 + \Expect{(\mu - E[\hat{Y}] + E[\hat{Y}] - \hat{Y})^2}\\ &= \sigma^2 + \Expect{(\mu - E[\hat{Y}])^2} + \Expect{(E[\hat{Y}] - \hat{Y})^2} + 2\Expect{(\mu-E[\hat{Y}])(E[\hat{Y}] - \hat{Y})}\\ &= \sigma^2 + \Expect{(\mu - E[\hat{Y}])^2} + \Expect{(E[\hat{Y}] - \hat{Y})^2} + 0\\ &= \text{irr. error} + \text{squared bias} + \text{variance} \end{aligned}\]

Bias-variance decomposition

\[\begin{aligned} R_n(\hat{Y}) &= \Expect{(Y-\hat{Y})^2} \\ &= \text{irr. error} + \text{estimation risk}\\ &= \text{irr. error} + \text{squared bias} + \text{variance} \end{aligned}\]

Important

Implication: prediction risk is estimation risk plus something you can’t control. However, defining estimation risk requires stronger assumptions (not always just estimating a parameter).

Tip

In order to make good predictions, we want our prediction risk to be small. This means that we want to “balance” the bias and variance.

Code
cols = c(blue, red, green, orange)
par(mfrow = c(2, 2), bty = "n", ann = FALSE, xaxt = "n", yaxt = "n", 
    family = "serif", mar = c(0, 0, 0, 0), oma = c(0, 2, 2, 0))
library(mvtnorm)
mv <- matrix(c(0, 0, 0, 0, -.5, -.5, -.5, -.5), 4, byrow = TRUE)
va <- matrix(c(.02, .02, .1, .1, .02, .02, .1, .1), 4, byrow = TRUE)

for (i in 1:4) {
  plot(0, 0, ylim = c(-2, 2), xlim = c(-2, 2), pch = 19, cex = 42, 
       col = blue, ann = FALSE, pty = "s")
  points(0, 0, pch = 19, cex = 30, col = "white")
  points(0, 0, pch = 19, cex = 18, col = green)
  points(0, 0, pch = 19, cex = 6, col = orange)
  points(rmvnorm(20, mean = mv[i, ], sigma = diag(va[i, ])), cex = 1, pch = 19)
  switch(i,
    "1" = {
      mtext("low variance", 3, cex = 2)
      mtext("low bias", 2, cex = 2)
    },
    "2" = mtext("high variance", 3, cex = 2),
    "3" = mtext("high bias", 2, cex = 2)
  )
}

Bias-variance tradeoff: Overview

bias: how well does \(\widehat{f}(x)\) approximate the truth \(\Expect{Y\given X=x}\)

  • If we allow more complicated possible \(\widehat{f}\), lower bias. Flexibility \(\Rightarrow\) Expressivity

  • But, more flexibility \(\Rightarrow\) larger variance

  • Complicated models are hard to estimate precisely for fixed \(n\)

  • Irreducible error

Sadly, that whole exercise depends on knowing the truth to evaluate \(E\ldots\)

Next time…

Estimating risk