04 Bias and variance

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 16 September 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

We just talked about

Variance of an estimator.
Irreducible error when making predictions.
These are 2 of the 3 components of the “Prediction Risk” \(R_n\)

Component 3, the Bias

We need to be specific about what we mean when we say bias.

Bias is neither good nor bad in and of itself.

A very simple example: let \(Y_1,\ \ldots,\ Y_n \sim N(\mu, 1)\). - We don’t know \(\mu\), so we try to use the data (the \(Y_i\)’s) to estimate it.

I propose 3 estimators:
1. \(\widehat{\mu}_1 = 12\),
2. \(\widehat{\mu}_2=Y_6\),
3. \(\widehat{\mu}_3=\overline{Y}\).
The bias (by definition) of my estimator is \(E[\widehat{\mu_i}]-\mu\).

Calculate the bias and variance of each estimator.

Regression in general

If I want to predict \(Y\) from \(X\), it is almost always the case that

\[ \mu(x) = \Expect{Y\given X=x} \neq x^{\top}\beta \]

So the bias of using a linear model is not zero.

Why? Because

\[ \Expect{Y\given X=x}-x^\top\beta \neq \Expect{Y\given X=x} - \mu(x) = 0. \]

We can include as many predictors as we like,

but this doesn’t change the fact that the world is non-linear.

(Continuation) Predicting new Y’s

Suppose we want to predict \(Y\),

we know \(E[Y]= \mu \in \mathbb{R}\) and \(\textrm{Var}[Y] = 1\).

Our data is \(\{y_1,\ldots,y_n\}\)

We have considered estimating \(\mu\) in various ways, and using \(\widehat{Y} = \widehat{\mu}\)

Let’s try one more: \(\widehat Y_a = a\overline{Y}_n\) for some \(a \in (0,1]\).

One can show… (wait for the proof)

\(\widehat Y_a = a\overline{Y}_n\) for some \(a \in (0,1]\)

\[ R_n(\widehat Y_a) = \Expect{(\widehat Y_a-Y)^2} = (1 - a)^2\mu^2 + \frac{a^2}{n} +1 \]

We can minimize this to get the best possible prediction risk for an estimator of the form \(\widehat Y_a\):

\[ \argmin_{a} R_n(\widehat Y_a) = \left(\frac{\mu^2}{\mu^2 + 1/n} \right)\qquad \min_{a} R_n(\widehat Y_a) = 1+\left(\frac{\mu^2}{n\mu^2 + 1} \right). \]

Is this less than or greater than the risk we saw for \(\bar Y\)?

Am I cheating here?

Important

Wait a minute! I’m saying there is a better estimator than \(\overline{Y}_n\)!

Bias-variance tradeoff: Estimating the mean

\[ R_n(\widehat Y_a) = (a - 1)^2\mu^2 + \frac{a^2}{n} + \sigma^2 \]

mu = 1; n = 5; sig = 1

To restate

If \(\mu=\) 1 and \(n=\) 5

then it is better to predict with 0.83 \(\overline{Y}_5\)

than with \(\overline{Y}_5\) itself.

For this \(a =\) 0.83 and \(n=5\)

\(R_5(\widehat{Y}_a) =\) 1.17
\(R_5(\overline{Y}_5)=\) 1.2

Bias-variance decomposition

\[R_n(\widehat{Y}_a)=(a - 1)^2\mu^2 + \frac{a^2}{n} + 1\]

prediction risk = \(\textrm{bias}^2\) + variance + irreducible error
estimation risk = \(\textrm{bias}^2\) + variance

What is \(R_n(\widehat{Y}_a)\) for our estimator \(\widehat{Y}_a=a\overline{Y}_n\)?

\[\begin{aligned} \textrm{bias}(\widehat{Y}_a) &= \Expect{a\overline{Y}_n} - \mu=(a-1)\mu\\ \textrm{var}(\widehat f(x)) &= \Expect{ \left(a\overline{Y}_n - \Expect{a\overline{Y}_n}\right)^2} =a^2\Expect{\left(\overline{Y}_n-\mu\right)^2}=\frac{a^2}{n} \\ \sigma^2 &= \Expect{(Y-\mu)^2}=1 \end{aligned}\]

This decomposition holds generally

\[\begin{aligned} R_n(\hat{Y}) &= \Expect{(Y-\hat{Y})^2} \\ &= \Expect{(Y-\mu + \mu - \hat{Y})^2} \\ &= \Expect{(Y-\mu)^2} + \Expect{(\mu - \hat{Y})^2} + 2\Expect{(Y-\mu)(\mu-\hat{Y})}\\ &= \Expect{(Y-\mu)^2} + \Expect{(\mu - \hat{Y})^2} + 0\\ &= \text{irr. error} + \text{estimation risk}\\ &= \sigma^2 + \Expect{(\mu - E[\hat{Y}] + E[\hat{Y}] - \hat{Y})^2}\\ &= \sigma^2 + \Expect{(\mu - E[\hat{Y}])^2} + \Expect{(E[\hat{Y}] - \hat{Y})^2} + 2\Expect{(\mu-E[\hat{Y}])(E[\hat{Y}] - \hat{Y})}\\ &= \sigma^2 + \Expect{(\mu - E[\hat{Y}])^2} + \Expect{(E[\hat{Y}] - \hat{Y})^2} + 0\\ &= \text{irr. error} + \text{squared bias} + \text{variance} \end{aligned}\]

Bias-variance decomposition

\[\begin{aligned} R_n(\hat{Y}) &= \Expect{(Y-\hat{Y})^2} \\ &= \text{irr. error} + \text{estimation risk}\\ &= \text{irr. error} + \text{squared bias} + \text{variance} \end{aligned}\]

Important

Implication: prediction risk is estimation risk plus something you can’t control. However, defining estimation risk requires stronger assumptions (not always just estimating a parameter).

Tip

In order to make good predictions, we want our prediction risk to be small. This means that we want to “balance” the bias and variance.

Code

cols = c(blue, red, green, orange)
par(mfrow = c(2, 2), bty = "n", ann = FALSE, xaxt = "n", yaxt = "n", 
    family = "serif", mar = c(0, 0, 0, 0), oma = c(0, 2, 2, 0))
library(mvtnorm)
mv <- matrix(c(0, 0, 0, 0, -.5, -.5, -.5, -.5), 4, byrow = TRUE)
va <- matrix(c(.02, .02, .1, .1, .02, .02, .1, .1), 4, byrow = TRUE)

for (i in 1:4) {
  plot(0, 0, ylim = c(-2, 2), xlim = c(-2, 2), pch = 19, cex = 42, 
       col = blue, ann = FALSE, pty = "s")
  points(0, 0, pch = 19, cex = 30, col = "white")
  points(0, 0, pch = 19, cex = 18, col = green)
  points(0, 0, pch = 19, cex = 6, col = orange)
  points(rmvnorm(20, mean = mv[i, ], sigma = diag(va[i, ])), cex = 1, pch = 19)
  switch(i,
    "1" = {
      mtext("low variance", 3, cex = 2)
      mtext("low bias", 2, cex = 2)
    },
    "2" = mtext("high variance", 3, cex = 2),
    "3" = mtext("high bias", 2, cex = 2)
  )
}

Bias-variance tradeoff: Overview

bias: how well does \(\widehat{f}(x)\) approximate the truth \(\Expect{Y\given X=x}\)

If we allow more complicated possible \(\widehat{f}\), lower bias. Flexibility \(\Rightarrow\) Expressivity
But, more flexibility \(\Rightarrow\) larger variance
Complicated models are hard to estimate precisely for fixed \(n\)
Irreducible error

Sadly, that whole exercise depends on knowing the truth to evaluate \(E\ldots\)

Next time…

Estimating risk