Stat 406
Geoff Pleiss, Trevor Campbell
Last modified – 25 September 2024
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]
Has low bias 🎉 and (probably) low variance 🎉
Not often possible to use: requires training \(n\) models 🤮
Consider models where predictions are a linear function of the training responses, i.e.,
\[ \hat{\mathbf y} = {\mathbf H} {\mathbf y} \]
where we collected terms into matrices and vectors (\({\mathbf h_i}\) can be any functions):
For example, OLS:
\[ \hat{\mathbf y} = {\mathbf X} \hat \beta, \qquad \hat\beta = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y \]
By inspection \(\mathbf H = \mathbf X (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top\)
For models where predictions are a linear function of the training responses*,
LOO-CV has a closed-form expression! Just need to fit once:
\[\mbox{LOO-CV} \,\, \hat R_n = \frac{1}{n} \sum_{i=1}^n \frac{(Y_i -\widehat{Y}_i)^2}{(1-{\boldsymbol H}_{ii})^2}.\]
*: plus some technicalities
Tip
Deriving this sucks. I wouldn’t recommend doing it yourself.
(And can we get a better intuition about what’s going on?)
\[ \hat{\mathbf y} = \mathbf H \mathbf y, \qquad \mbox{LOO-CV} = \frac{1}{n} \sum_{i=1}^n \frac{(Y_i -\widehat{Y}_i)^2}{(1-{\boldsymbol H}_{ii})^2} \]
Let’s look at OLS again… \[ \hat Y = X \hat \beta, \qquad \beta = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y \]
This implies that \(\mathbf H = \mathbf X (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top\)
Let’s call \(\tr{\mathbf H} = p\) the degrees-of-freedom (or just df) of our OLS estimator.
(Intuition: we have \(p\) parameters to fit, or \(p\) “degrees of freedom”)
Idea: in our LOO-CV formula, approximate each \({\mathbf H}_{ii}\) with the average \(\frac 1 n \sum_{i=1}^n {\mathbf H}_{ii}\).
Then…
\[ \mbox{LOO-CV} = \frac{1}{n} \sum_{i=1}^n \frac{(y_i -\widehat{y}_i)^2}{(1-{\mathbf H}_{ii})^2} \approx \frac{\text{MSE}}{(1-\text{df}/n)^2} \triangleq \text{GCV} \]
GCV stands for Generalized CV estimate
\[\textrm{GCV} = \frac{\textrm{MSE}}{(1-\textrm{df}/n)^2}\]
We can use this formula for models that aren’t of the form \(\widehat{y}_i = \boldsymbol h_i(\mathbf{X})^\top \mathbf{y}\).
(Assuming we have some model-specific formula for estimating \(\textrm{df}\).)
Let’s see if we can generalize risk estimators from OLS (\(Y \sim \mathcal{N}(X^T\beta, \sigma^2)\)) in other ways.
Consider the estimation risk of estimating \(\mu_i = X_i^T\beta\) with \(\hat Y_i = X_i^T\hat\beta\):
\[R_n = E\left[\frac{1}{n}\sum_{i=1}^n (\hat Y_i - \mu_i)^2\right]\]
Using the usual decomposition tricks:
\[ R_n= \Expect{\frac{1}{n}\sum (\widehat Y_i-\mu_i)^2} = \underbrace{\frac{1}{n}\sum \Expect{(\widehat Y_i-Y_i)^2}}_{\text{train MSE}} - \underbrace{\sigma^2}_{\text{noise}} + \underbrace{\frac{2}{n}\sum\Cov{Y_i}{\widehat Y_i}}_{\text{???}} \]
\[\Expect{\frac{1}{n}\sum (\widehat Y_i-\mu_i)^2} = \underbrace{\frac{1}{n}\sum \Expect{(\widehat Y_i-Y_i)^2}}_{\text{training error}} - \underbrace{\sigma^2}_{\text{noise}} + \underbrace{\frac{2}{n}\sum\Cov{Y_i}{\widehat Y_i}}_{\text{???}} \]
Recall that \(\widehat{\mathbf{Y}} = \mathbf H \mathbf{Y}\) for some matrix \(\mathbf H\),
\(\sum\Cov{Y_i}{\widehat Y_i} = \Expect{\mathbf{Y}^\top \mathbf H \mathbf{Y}} = \sigma^2 \textrm{tr}(\mathbf H)\)
This gives Mallow’s \(C_p\) aka Stein’s Unbiased Risk Estimator:
\[ C_p = \text{MSE} + 2\hat{\sigma}^2 \: \textrm{df}/n \]
\[ C_p = \text{MSE} + 2\hat{\sigma}^2 \: \textrm{df}/n\] (We derived it for the OLS model, but again it can be generalized to other models.)
Important
Unfortunately, \(\text{df}\) may be difficult or impossible to calculate for complicated prediction methods. But one can often estimate it well. This idea is beyond the level of this course.
(Risk, CV, Information Criteria, Model/Variable Selection)
par(mar = c(0, 0, 0, 0))
plot(NA, NA, ylim = c(0, 5), xlim = c(0, 10), bty = "n", yaxt = "n", xaxt = "n")
rect(0, .1 + c(0, 2, 3, 4), 10, .9 + c(0, 2, 3, 4), col = blue, density = 10)
rect(c(0, 1, 2, 9), rev(.1 + c(0, 2, 3, 4)), c(1, 2, 3, 10),
rev(.9 + c(0, 2, 3, 4)), col = red, density = 10)
points(c(5, 5, 5), 1 + 1:3 / 4, pch = 19)
text(.5 + c(0, 1, 2, 9), .5 + c(4, 3, 2, 0), c("1", "2", "3", "K"), cex = 3,
col = red)
text(6, 4.5, "Training data", cex = 3, col = blue)
text(2, 1.5, "Validation data", cex = 3, col = red)
\[\textrm{GCV} = \frac{\textrm{MSE}}{(1-\textrm{df}/n)^2}\] (a hacked version of the magic LOO-CV formula)
\[C_p = \mathrm{MSE} + 2 \hat{\sigma}^2 \textrm{df}/n\] (closed-form derivation of risk for OLS model with fixed \(X\)s)
These have a very similar flavor to \(C_p\), but their genesis is different.
Without going into too much detail, they look like
\(\textrm{AIC}/n = -2\textrm{log-likelihood}/n + 2\textrm{df}/n\)
\(\textrm{BIC}/n = -2\textrm{log-likelihood}/n + 2\log(n)\textrm{df}/n\)
What is log-likelihood?
In the case of \(Y = X \hat \beta + \epsilon\) with \(\epsilon \sim \mathcal{N}(0, \hat \sigma^2)\)…
In the case of \(Y = X \hat \beta + \epsilon\) with \(\epsilon \sim \mathcal{N}(0, \hat \sigma^2)\)
\[ \text{log-likelihood} = -\frac{1}{2} \left( n\log(2\pi) + n\log(\hat \sigma^2) + \left(\textstyle \sum_{i=1}^n (Y_i - X\hat\beta)^2\right)/ \hat \sigma^2 \right) \]
So, after simplifying…
\[\textrm{AIC}/n = \log(\textrm{MSE}) + 2 \textrm{df}/n + \mathrm{Const.}\]
\[\textrm{BIC}/n = \log(\textrm{MSE}) + 2 \log(n) \textrm{df}/n + \mathrm{Const.}\]
Important
Unfortunately, different books/software/notes define these differently. Even different R packages. This is super annoying.
We have seen 5 risk estimators:
Use CV. (LOO-CV if you have a magic formula, k-fold otherwise.)
(So why did we just suffer through learning all those boring information criteria????)
For regression models (with Gaussian noise), our information criteria are… \[ \begin{align} \textrm{GCV} &= \textrm{MSE} / (1-\textrm{df}/n)^2 \\ C_p &= \mathrm{MSE} + 2 \hat \sigma^2 \textrm{df}/n \\ \textrm{AIC}/n &= \log(\textrm{MSE}) + 2 \textrm{df}/n + \mathrm{Const.} \\ \textrm{BIC}/n &= \log(\textrm{MSE}) + 2 \log(n) \textrm{df}/n + \mathrm{Const.} \end{align} \]
All 4 information criteria become large (i.e. we have high risk) when either
For an overfit model, is MSE large or is \(\textrm{df}/n\) large? What about for an underfit model?
Are overfit models high-bias or high-variance? What about underfit models?
Tip
We have a diagonstic for determining if we’re high bias versus high variance!
Is training MSE large, or is \(\textrm{df}/n\)?
The best way to avoid overfitting/underfitting is to use a reasonable estimate of prediction risk to choose how complicated your model should be.
Tempting to “just compare” risk estimates from vastly different models.
For example,
different transformations of the predictors,
different transformations of the response,
Poisson likelihood vs. Gaussian likelihood in glm()
This is not always justified.
We can compare “nested” models (i.e. where one model is a special case of another)
Different likelihoods aren’t comparable.
Residuals / response variables on different scales aren’t directly comparable.
Using risk estimates for model selection.
UBC Stat 406 - 2024