06 Information criteria

Stat 406

Daniel J. McDonald

Last modified – 26 September 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

Generalized CV

Last time we saw a nice trick, that works some of the time (OLS, Ridge regression,…)

\[\mbox{LOO-CV} = \frac{1}{n} \sum_{i=1}^n \frac{(y_i -\widehat{y}_i)^2}{(1-h_{ii})^2} = \frac{1}{n} \sum_{i=1}^n \frac{\widehat{e}_i^2}{(1-h_{ii})^2}.\]

  1. \(\widehat{\y} = \widehat{f}(\mathbf{X}) = \mathbf{H}\mathbf{y}\) for some matrix \(\mathbf{H}\).
  2. A technical thing.

\[\newcommand{\H}{\mathbf{H}}\]

This is another nice trick.

Idea: replace \(h_{ii}\) with \(\frac{1}{n}\sum_{i=1}^n h_{ii} = \frac{1}{n}\textrm{tr}(\mathbf{H})\)

Let’s call \(\textrm{tr}(\mathbf{H})\) the degrees-of-freedom (or just df)

\[\textrm{GCV} = \frac{1}{n} \sum_{i=1}^n \frac{\widehat{e}_i^2}{(1-\textrm{df}/n)^2} = \frac{\textrm{MSE}}{(1-\textrm{df}/n)^2}\]

Where does this stuff come from?

What are hatvalues?

cv_nice <- function(mdl) mean((residuals(mdl) / (1 - hatvalues(mdl)))^2)

In OLS, \(\widehat{\y} = \X\widehat{\beta} = \X(\X^\top \X)^{-1}\X^\top \y\)

We often call \(\mathbf{H} = \X(\X^\top \X)^{-1}\X^\top\) the Hat matrix, because it puts the hat on \(\y\)

GCV uses \(\textrm{tr}(\mathbf{H})\).

For lm(), this is just p, the number of predictors (Why?)

This is one way of understanding the name degrees-of-freedom

Alternative interpretation:

Suppose, \(Y_i\) is independent from some distribution with mean \(\mu_i\) and variance \(\sigma^2\)

(remember: in the linear model \(\Expect{Y_i} = x_i^\top \beta = \mu_i\) )

Let \(\widehat{\mathbf{Y}}\) be an estimator of \(\mu\) (all \(i=1,\ldots,n\) elements of the vector).

\[\begin{aligned} & \Expect{\frac{1}{n}\sum (\widehat Y_i-\mu_i)^2} \\ &= \Expect{\frac{1}{n}\sum (\widehat Y_i-Y_i + Y_i -\mu_i)^2}\\ &= \frac{1}{n}\Expect{\sum (\widehat Y_i-Y_i)^2} + \frac{1}{n}\Expect{\sum (Y_i-\mu_i)^2} + \frac{2}{n}\Expect{\sum (\widehat Y_i-Y_i)(Y_i-\mu_i)}\\ &= \frac{1}{n}\sum \Expect{(\widehat Y_i-Y_i)^2} + \sigma^2 + \frac{2}{n}\Expect{\sum (\widehat Y_i-Y_i)(Y_i-\mu_i)} = \cdots =\\ &= \frac{1}{n}\sum \Expect{(\widehat Y_i-Y_i)^2} - \sigma^2 + \frac{2}{n}\sum\Cov{Y_i}{\widehat Y_i} \end{aligned}\]

Alternative interpretation:

\[\Expect{\frac{1}{n}\sum (\widehat Y_i-\mu_i)^2} = \frac{1}{n}\sum \Expect{(\widehat Y_i-Y_i)^2} - \sigma^2 + \frac{2}{n}\sum\Cov{Y_i}{\widehat Y_i}\]

Now, if \(\widehat{\mathbf{Y}} = \H \mathbf{Y}\) for some matrix \(\H\),

\(\sum\Cov{Y_i}{\widehat Y_i} = \Expect{\mathbf{Y}^\top \H \mathbf{Y}} = \sigma^2 \textrm{tr}(\H)\)

This gives Mallow’s \(C_p\) aka Stein’s Unbiased Risk Estimator:

\(MSE + 2\hat{\sigma}^2\textrm{df}/n\)

Important

Unfortunately, df may be difficult or impossible to calculate for complicated prediction methods. But one can often estimate it well. This idea is beyond the level of this course.

AIC and BIC

These have a very similar flavor to \(C_p\), but their genesis is different.

Without going into too much detail, they look like

\(\textrm{AIC}/n = -2\textrm{loglikelihood}/n + 2\textrm{df}/n\)

\(\textrm{BIC}/n = -2\textrm{loglikelihood}/n + 2\log(n)\textrm{df}/n\)

In the case of a linear model with Gaussian errors and \(p\) predictors

\[\begin{aligned} \textrm{AIC}/n &= \log(2\pi) + \log(RSS/n) + 2(p+1)/n \\ &\propto \log(RSS) + 2(p+1)/n \end{aligned}\]

( \(p+1\) because of the unknown variance, intercept included in \(p\) or not)

Important

Unfortunately, different books/software/notes define these differently. Even different R packages. This is super annoying.

Forms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here.

Over-fitting vs. Under-fitting

Over-fitting means estimating a really complicated function when you don’t have enough data.

This is likely a low-bias / high-variance situation.

Under-fitting means estimating a really simple function when you have lots of data.

This is likely a high-bias / low-variance situation.

Both of these outcomes are bad (they have high risk \(=\) big \(R_n\) ).

The best way to avoid them is to use a reasonable estimate of prediction risk to choose how complicated your model should be.

Recommendations

When comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV.

CV is usually easiest to make sense of and doesn’t depend on other unknown parameters.

But, it requires refitting the model.

Also, it can be strange in cases with discrete predictors, time series, repeated measurements, graph structures, etc.

High-level intuition of these:

  • GCV tends to choose “dense” models.

  • Theory says AIC chooses the “best predicting model” asymptotically.

  • Theory says BIC should choose the “true model” asymptotically, tends to select fewer predictors.

  • In some special cases, AIC = Cp = SURE \(\approx\) LOO-CV

  • As a technical point, CV (or validation set) is estimating error on new data, unseen \((X_0, Y_0)\), while AIC / CP are estimating error on new Y at the observed \(x_1,\ldots,x_n\). This is subtle.

My recommendation:

Use CV

A few more caveats

It is often tempting to “just compare” risk estimates from vastly different models.

For example,

  • different transformations of the predictors,

  • different transformations of the response,

  • Poisson likelihood vs. Gaussian likelihood in glm()

This is not always justified.

  1. The “high-level intuition” is for “nested” models.

  2. Different likelihoods aren’t comparable.

  3. Residuals / response variables on different scales aren’t directly comparable.

“Validation set” is easy, because you’re always comparing to the “right” thing. But it has lots of drawbacks.

Next time …

Greedy selection