Stat 406
Daniel J. McDonald
Last modified – 26 September 2023
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]
Last time we saw a nice trick, that works some of the time (OLS, Ridge regression,…)
\[\mbox{LOO-CV} = \frac{1}{n} \sum_{i=1}^n \frac{(y_i -\widehat{y}_i)^2}{(1-h_{ii})^2} = \frac{1}{n} \sum_{i=1}^n \frac{\widehat{e}_i^2}{(1-h_{ii})^2}.\]
\[\newcommand{\H}{\mathbf{H}}\]
Idea: replace \(h_{ii}\) with \(\frac{1}{n}\sum_{i=1}^n h_{ii} = \frac{1}{n}\textrm{tr}(\mathbf{H})\)
Let’s call \(\textrm{tr}(\mathbf{H})\) the degrees-of-freedom (or just df)
\[\textrm{GCV} = \frac{1}{n} \sum_{i=1}^n \frac{\widehat{e}_i^2}{(1-\textrm{df}/n)^2} = \frac{\textrm{MSE}}{(1-\textrm{df}/n)^2}\]
Where does this stuff come from?
hatvalues
?In OLS, \(\widehat{\y} = \X\widehat{\beta} = \X(\X^\top \X)^{-1}\X^\top \y\)
We often call \(\mathbf{H} = \X(\X^\top \X)^{-1}\X^\top\) the Hat matrix, because it puts the hat on \(\y\)
GCV uses \(\textrm{tr}(\mathbf{H})\).
For lm()
, this is just p
, the number of predictors (Why?)
This is one way of understanding the name degrees-of-freedom
Suppose, \(Y_i\) is independent from some distribution with mean \(\mu_i\) and variance \(\sigma^2\)
(remember: in the linear model \(\Expect{Y_i} = x_i^\top \beta = \mu_i\) )
Let \(\widehat{\mathbf{Y}}\) be an estimator of \(\mu\) (all \(i=1,\ldots,n\) elements of the vector).
\[\Expect{\frac{1}{n}\sum (\widehat Y_i-\mu_i)^2} = \frac{1}{n}\sum \Expect{(\widehat Y_i-Y_i)^2} - \sigma^2 + \frac{2}{n}\sum\Cov{Y_i}{\widehat Y_i}\]
Now, if \(\widehat{\mathbf{Y}} = \H \mathbf{Y}\) for some matrix \(\H\),
\(\sum\Cov{Y_i}{\widehat Y_i} = \Expect{\mathbf{Y}^\top \H \mathbf{Y}} = \sigma^2 \textrm{tr}(\H)\)
This gives Mallow’s \(C_p\) aka Stein’s Unbiased Risk Estimator:
\(MSE + 2\hat{\sigma}^2\textrm{df}/n\)
Important
Unfortunately, df may be difficult or impossible to calculate for complicated prediction methods. But one can often estimate it well. This idea is beyond the level of this course.
These have a very similar flavor to \(C_p\), but their genesis is different.
Without going into too much detail, they look like
\(\textrm{AIC}/n = -2\textrm{loglikelihood}/n + 2\textrm{df}/n\)
\(\textrm{BIC}/n = -2\textrm{loglikelihood}/n + 2\log(n)\textrm{df}/n\)
In the case of a linear model with Gaussian errors and \(p\) predictors
\[\begin{aligned} \textrm{AIC}/n &= \log(2\pi) + \log(RSS/n) + 2(p+1)/n \\ &\propto \log(RSS) + 2(p+1)/n \end{aligned}\]( \(p+1\) because of the unknown variance, intercept included in \(p\) or not)
Important
Unfortunately, different books/software/notes define these differently. Even different R packages. This is super annoying.
Forms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here.
Over-fitting means estimating a really complicated function when you don’t have enough data.
This is likely a low-bias / high-variance situation.
Under-fitting means estimating a really simple function when you have lots of data.
This is likely a high-bias / low-variance situation.
Both of these outcomes are bad (they have high risk \(=\) big \(R_n\) ).
The best way to avoid them is to use a reasonable estimate of prediction risk to choose how complicated your model should be.
When comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV.
CV is usually easiest to make sense of and doesn’t depend on other unknown parameters.
But, it requires refitting the model.
Also, it can be strange in cases with discrete predictors, time series, repeated measurements, graph structures, etc.
GCV tends to choose “dense” models.
Theory says AIC chooses the “best predicting model” asymptotically.
Theory says BIC should choose the “true model” asymptotically, tends to select fewer predictors.
In some special cases, AIC = Cp = SURE \(\approx\) LOO-CV
As a technical point, CV (or validation set) is estimating error on new data, unseen \((X_0, Y_0)\), while AIC / CP are estimating error on new Y at the observed \(x_1,\ldots,x_n\). This is subtle.
Use CV
It is often tempting to “just compare” risk estimates from vastly different models.
For example,
different transformations of the predictors,
different transformations of the response,
Poisson likelihood vs. Gaussian likelihood in glm()
This is not always justified.
The “high-level intuition” is for “nested” models.
Different likelihoods aren’t comparable.
Residuals / response variables on different scales aren’t directly comparable.
“Validation set” is easy, because you’re always comparing to the “right” thing. But it has lots of drawbacks.
Greedy selection
UBC Stat 406 - 2023