06 Information Criteria

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 25 September 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

LOO-CV

Train \(\hat f_1\) on all but data point 1, calculate \(\tilde R_1 = \ell(Y_1, \hat f_1(X_1))\).
Do the same for each data point \(i\), calculate \(\tilde R_i\)
Estimate \(R_n \approx \hat R_n = \frac{1}{n}\sum_{i=1}^n \tilde R_i\)

Has low bias 🎉 and (probably) low variance 🎉

Not often possible to use: requires training \(n\) models 🤮

LOO-CV: Math to the rescue!

Consider models where predictions are a linear function of the training responses, i.e.,

\[ \hat{\mathbf y} = {\mathbf H} {\mathbf y} \]

where we collected terms into matrices and vectors (\({\mathbf h_i}\) can be any functions):

\(\hat{\mathbf y} = \begin{bmatrix} \hat Y_1 & \cdots & \hat Y_n \end{bmatrix}^\top \in \mathbb R^{n}\)
\({\mathbf y} = \begin{bmatrix} Y_1 & \cdots & Y_n \end{bmatrix}^\top \in \mathbb R^{n}\)
\(\mathbf H = \begin{bmatrix} \mathbf h_1(X_{1:n}) & \cdots & \mathbf h_n(X_{1:n}) \end{bmatrix}^\top \in \mathbb R^{n \times n}\)

For example, OLS:

\[ \hat{\mathbf y} = {\mathbf X} \hat \beta, \qquad \hat\beta = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y \]

By inspection \(\mathbf H = \mathbf X (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top\)

LOO-CV: Math to the rescue!

For models where predictions are a linear function of the training responses*,

LOO-CV has a closed-form expression! Just need to fit once:

\[\mbox{LOO-CV} \,\, \hat R_n = \frac{1}{n} \sum_{i=1}^n \frac{(Y_i -\widehat{Y}_i)^2}{(1-{\boldsymbol H}_{ii})^2}.\]

Numerator is the squared residual (loss) for training point \(i\).
Denominator weights each residual by diagonal of \(H\) some factor
\(H_{ii}\) are leverage/hat values: tell you what happens when moving data point \(i\) a bit

*: plus some technicalities

Tip

Deriving this sucks. I wouldn’t recommend doing it yourself.

Computing the formula

cv_nice <- function(mdl) mean((residuals(mdl) / (1 - hatvalues(mdl)))^2)

What happens when we can’t use the formula?

(And can we get a better intuition about what’s going on?)

\[ \hat{\mathbf y} = \mathbf H \mathbf y, \qquad \mbox{LOO-CV} = \frac{1}{n} \sum_{i=1}^n \frac{(Y_i -\widehat{Y}_i)^2}{(1-{\boldsymbol H}_{ii})^2} \]

Let’s look at OLS again… \[ \hat Y = X \hat \beta, \qquad \beta = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y \]

This implies that \(\mathbf H = \mathbf X (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top\)

One really nice property of \(\mathbf H\) is that \(\tr{\mathbf H} = p\) (where \(X_n \in \R^p\)). (Why?)

Generalizing the LOO-CV formula

Let’s call \(\tr{\mathbf H} = p\) the degrees-of-freedom (or just df) of our OLS estimator.
(Intuition: we have \(p\) parameters to fit, or \(p\) “degrees of freedom”)

Idea: in our LOO-CV formula, approximate each \({\mathbf H}_{ii}\) with the average \(\frac 1 n \sum_{i=1}^n {\mathbf H}_{ii}\).

Then…

\[ \mbox{LOO-CV} = \frac{1}{n} \sum_{i=1}^n \frac{(y_i -\widehat{y}_i)^2}{(1-{\mathbf H}_{ii})^2} \approx \frac{\text{MSE}}{(1-\text{df}/n)^2} \triangleq \text{GCV} \]

GCV stands for Generalized CV estimate

Generalized CV

\[\textrm{GCV} = \frac{\textrm{MSE}}{(1-\textrm{df}/n)^2}\]

We can use this formula for models that aren’t of the form \(\widehat{y}_i = \boldsymbol h_i(\mathbf{X})^\top \mathbf{y}\).
(Assuming we have some model-specific formula for estimating \(\textrm{df}\).)

Observations

GCV > training error (Why?)
What happens as \(n\) increases?
What happens as \(\text{df}\) (\(p\) in our OLS model) increases?

Mallows \(C_p\)

Let’s see if we can generalize risk estimators from OLS (\(Y \sim \mathcal{N}(X^T\beta, \sigma^2)\)) in other ways.

Consider the estimation risk of estimating \(\mu_i = X_i^T\beta\) with \(\hat Y_i = X_i^T\hat\beta\):

\[R_n = E\left[\frac{1}{n}\sum_{i=1}^n (\hat Y_i - \mu_i)^2\right]\]

Using the usual decomposition tricks:

\[ R_n= \Expect{\frac{1}{n}\sum (\widehat Y_i-\mu_i)^2} = \underbrace{\frac{1}{n}\sum \Expect{(\widehat Y_i-Y_i)^2}}_{\text{train MSE}} - \underbrace{\sigma^2}_{\text{noise}} + \underbrace{\frac{2}{n}\sum\Cov{Y_i}{\widehat Y_i}}_{\text{???}} \]

Mallows \(C_p\)

\[\Expect{\frac{1}{n}\sum (\widehat Y_i-\mu_i)^2} = \underbrace{\frac{1}{n}\sum \Expect{(\widehat Y_i-Y_i)^2}}_{\text{training error}} - \underbrace{\sigma^2}_{\text{noise}} + \underbrace{\frac{2}{n}\sum\Cov{Y_i}{\widehat Y_i}}_{\text{???}} \]

Recall that \(\widehat{\mathbf{Y}} = \mathbf H \mathbf{Y}\) for some matrix \(\mathbf H\),

\(\sum\Cov{Y_i}{\widehat Y_i} = \Expect{\mathbf{Y}^\top \mathbf H \mathbf{Y}} = \sigma^2 \textrm{tr}(\mathbf H)\)

This gives Mallow’s \(C_p\) aka Stein’s Unbiased Risk Estimator:

\[ C_p = \text{MSE} + 2\hat{\sigma}^2 \: \textrm{df}/n \]

Mallow’s \(C_p\)

\[ C_p = \text{MSE} + 2\hat{\sigma}^2 \: \textrm{df}/n\] (We derived it for the OLS model, but again it can be generalized to other models.)

Important

Unfortunately, \(\text{df}\) may be difficult or impossible to calculate for complicated prediction methods. But one can often estimate it well. This idea is beyond the level of this course.

Observations

\(C_p\) > training error
What happens as \(n\) increases?
What happens as \(\text{df}\) (\(p\) in our OLS model) increases?
What happens as the irreducible noise increase?

Last Day of Module 1

(Risk, CV, Information Criteria, Model/Variable Selection)

HW 1 Due tonight (Sept 26) at 11pm

Where We’ve Been (and What We Have Left)

Risk (and test error) of a predictor:
- \(R_n = E[T_n]\)
- \(R_n = \text{Est. Risk + Irr. Noise}\)
- \(\text{Est. Risk} = \text{Bias}^2 + \text{Var}\)
- Best model = lowest risk

Estimating risk with holdout data
- ~~Estimate risk with training error~~ (bad!)
- \(k\)-fold cross-validation (CV)
- Leave-one-out CV (LOO-CV)

Estimating risk with information criteria
- Information criteria
- Mallows \(C_p\)
- AIC/BIC

Using risk estimates for model/variable selection
- What is model selection?
- Practical algorithms for model/variable selection

Cross Validation

Code

par(mar = c(0, 0, 0, 0))
plot(NA, NA, ylim = c(0, 5), xlim = c(0, 10), bty = "n", yaxt = "n", xaxt = "n")
rect(0, .1 + c(0, 2, 3, 4), 10, .9 + c(0, 2, 3, 4), col = blue, density = 10)
rect(c(0, 1, 2, 9), rev(.1 + c(0, 2, 3, 4)), c(1, 2, 3, 10), 
     rev(.9 + c(0, 2, 3, 4)), col = red, density = 10)
points(c(5, 5, 5), 1 + 1:3 / 4, pch = 19)
text(.5 + c(0, 1, 2, 9), .5 + c(4, 3, 2, 0), c("1", "2", "3", "K"), cex = 3, 
     col = red)
text(6, 4.5, "Training data", cex = 3, col = blue)
text(2, 1.5, "Validation data", cex = 3, col = red)

Unbiased estimate of \(R_{n(1-1/K)}\) (but biased estimate of \(R_n\))
Requires training \(K\) models

GCV and Mallow’s \(C_p\)

\[\textrm{GCV} = \frac{\textrm{MSE}}{(1-\textrm{df}/n)^2}\] (a hacked version of the magic LOO-CV formula)

\[C_p = \mathrm{MSE} + 2 \hat{\sigma}^2 \textrm{df}/n\] (closed-form derivation of risk for OLS model with fixed \(X\)s)

\(\mathrm{MSE} = \frac 1 N \sum_{i=1}^N (y_i - \hat f(x_i))^2\) is the training error
\(\mathrm{df}\) is the degrees of freedom
For OLS, \(\mathrm{df} = p\) (number of predictors)

Akaike Information Criterion and Bayesian Information Criterion

These have a very similar flavor to \(C_p\), but their genesis is different.

Without going into too much detail, they look like

\(\textrm{AIC}/n = -2\textrm{log-likelihood}/n + 2\textrm{df}/n\)

\(\textrm{BIC}/n = -2\textrm{log-likelihood}/n + 2\log(n)\textrm{df}/n\)

What is log-likelihood?

In the case of \(Y = X \hat \beta + \epsilon\) with \(\epsilon \sim \mathcal{N}(0, \hat \sigma^2)\)…

AIC and BIC

In the case of \(Y = X \hat \beta + \epsilon\) with \(\epsilon \sim \mathcal{N}(0, \hat \sigma^2)\)

\[ \text{log-likelihood} = -\frac{1}{2} \left( n\log(2\pi) + n\log(\hat \sigma^2) + \left(\textstyle \sum_{i=1}^n (Y_i - X\hat\beta)^2\right)/ \hat \sigma^2 \right) \]

\(\sum_{i=1}^n (Y_i - X\hat\beta)^2 = n(\textrm{MSE})\)
\(\hat \sigma^2 = \left(\textstyle \sum_{i=1}^n (Y_i - X\hat\beta)^2\right) / n = \textrm{MSE}\)

So, after simplifying…

\[\textrm{AIC}/n = \log(\textrm{MSE}) + 2 \textrm{df}/n + \mathrm{Const.}\]

\[\textrm{BIC}/n = \log(\textrm{MSE}) + 2 \log(n) \textrm{df}/n + \mathrm{Const.}\]

Important

Unfortunately, different books/software/notes define these differently. Even different R packages. This is super annoying.

Summary

We have seen 5 risk estimators:

~~Training error, R^2, p-values~~
k-fold CV / LOO-CV
Generalized CV
Mallows \(C_p\)
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)

Geoff & Trevor’s recommendation:

Use CV. (LOO-CV if you have a magic formula, k-fold otherwise.)

(So why did we just suffer through learning all those boring information criteria????)

Information criteria help us reason about models statistically

For regression models (with Gaussian noise), our information criteria are… \[ \begin{align} \textrm{GCV} &= \textrm{MSE} / (1-\textrm{df}/n)^2 \\ C_p &= \mathrm{MSE} + 2 \hat \sigma^2 \textrm{df}/n \\ \textrm{AIC}/n &= \log(\textrm{MSE}) + 2 \textrm{df}/n + \mathrm{Const.} \\ \textrm{BIC}/n &= \log(\textrm{MSE}) + 2 \log(n) \textrm{df}/n + \mathrm{Const.} \end{align} \]

What is similar about these criteria?

What is different?

Over-fitting vs. Under-fitting

All 4 information criteria become large (i.e. we have high risk) when either

Training MSE is large
The ratio of \(\textrm{df}/n\) is large

The two failure modes of models

Over-fitting means estimating a really complicated function when you don’t have enough data.
Under-fitting means estimating a really simple function when you have lots of data.

For an overfit model, is MSE large or is \(\textrm{df}/n\) large? What about for an underfit model?

Are overfit models high-bias or high-variance? What about underfit models?

Over-fitting vs. Under-fitting

Overfitting = high variance = large \(\textrm{df}/n\)
Underfitting = high biase = large training \(\textrm{MSE}\)

Tip

We have a diagonstic for determining if we’re high bias versus high variance!
Is training MSE large, or is \(\textrm{df}/n\)?

The best way to avoid overfitting/underfitting is to use a reasonable estimate of prediction risk to choose how complicated your model should be.

A few more caveats

Tempting to “just compare” risk estimates from vastly different models.

For example,

different transformations of the predictors,
different transformations of the response,
Poisson likelihood vs. Gaussian likelihood in glm()

This is not always justified.

We can compare “nested” models (i.e. where one model is a special case of another)
Different likelihoods aren’t comparable.
Residuals / response variables on different scales aren’t directly comparable.

Next time …

Using risk estimates for model selection.