Stat 406

Daniel J. McDonald

Last modified – 26 September 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

Last time we saw a nice trick, that works some of the time (OLS, Ridge regression,…)

\[\mbox{LOO-CV} = \frac{1}{n} \sum_{i=1}^n \frac{(y_i -\widehat{y}_i)^2}{(1-h_{ii})^2} = \frac{1}{n} \sum_{i=1}^n \frac{\widehat{e}_i^2}{(1-h_{ii})^2}.\]

- \(\widehat{\y} = \widehat{f}(\mathbf{X}) = \mathbf{H}\mathbf{y}\) for some matrix \(\mathbf{H}\).
- A technical thing.

\[\newcommand{\H}{\mathbf{H}}\]

Idea: replace \(h_{ii}\) with \(\frac{1}{n}\sum_{i=1}^n h_{ii} = \frac{1}{n}\textrm{tr}(\mathbf{H})\)

Let’s call \(\textrm{tr}(\mathbf{H})\) the *degrees-of-freedom* (or just *df*)

\[\textrm{GCV} = \frac{1}{n} \sum_{i=1}^n \frac{\widehat{e}_i^2}{(1-\textrm{df}/n)^2} = \frac{\textrm{MSE}}{(1-\textrm{df}/n)^2}\]

Where does this stuff come from?

`hatvalues`

?In OLS, \(\widehat{\y} = \X\widehat{\beta} = \X(\X^\top \X)^{-1}\X^\top \y\)

We often call \(\mathbf{H} = \X(\X^\top \X)^{-1}\X^\top\) the Hat matrix, because it puts the hat on \(\y\)

GCV uses \(\textrm{tr}(\mathbf{H})\).

For `lm()`

, this is just `p`

, the number of predictors (Why?)

This is one way of understanding the name *degrees-of-freedom*

Suppose, \(Y_i\) is independent from some distribution with mean \(\mu_i\) and variance \(\sigma^2\)

(remember: in the linear model \(\Expect{Y_i} = x_i^\top \beta = \mu_i\) )

Let \(\widehat{\mathbf{Y}}\) be an estimator of \(\mu\) (all \(i=1,\ldots,n\) elements of the vector).

\[\begin{aligned}
& \Expect{\frac{1}{n}\sum (\widehat Y_i-\mu_i)^2} \\
&= \Expect{\frac{1}{n}\sum (\widehat Y_i-Y_i + Y_i -\mu_i)^2}\\
&= \frac{1}{n}\Expect{\sum (\widehat Y_i-Y_i)^2} + \frac{1}{n}\Expect{\sum (Y_i-\mu_i)^2} + \frac{2}{n}\Expect{\sum (\widehat Y_i-Y_i)(Y_i-\mu_i)}\\
&= \frac{1}{n}\sum \Expect{(\widehat Y_i-Y_i)^2} + \sigma^2 + \frac{2}{n}\Expect{\sum (\widehat Y_i-Y_i)(Y_i-\mu_i)} = \cdots =\\
&= \frac{1}{n}\sum \Expect{(\widehat Y_i-Y_i)^2} - \sigma^2 + \frac{2}{n}\sum\Cov{Y_i}{\widehat Y_i}
\end{aligned}\]

\[\Expect{\frac{1}{n}\sum (\widehat Y_i-\mu_i)^2} = \frac{1}{n}\sum \Expect{(\widehat Y_i-Y_i)^2} - \sigma^2 + \frac{2}{n}\sum\Cov{Y_i}{\widehat Y_i}\]

Now, if \(\widehat{\mathbf{Y}} = \H \mathbf{Y}\) for some matrix \(\H\),

\(\sum\Cov{Y_i}{\widehat Y_i} = \Expect{\mathbf{Y}^\top \H \mathbf{Y}} = \sigma^2 \textrm{tr}(\H)\)

This gives *Mallow’s \(C_p\)* aka *Stein’s Unbiased Risk Estimator*:

\(MSE + 2\hat{\sigma}^2\textrm{df}/n\)

**Important**

Unfortunately, df may be difficult or impossible to calculate for complicated prediction methods. But one can often estimate it well. This idea is beyond the level of this course.

These have a very similar flavor to \(C_p\), but their genesis is different.

Without going into too much detail, they look like

\(\textrm{AIC}/n = -2\textrm{loglikelihood}/n + 2\textrm{df}/n\)

\(\textrm{BIC}/n = -2\textrm{loglikelihood}/n + 2\log(n)\textrm{df}/n\)

In the case of a linear model with Gaussian errors and \(p\) predictors

\[\begin{aligned} \textrm{AIC}/n &= \log(2\pi) + \log(RSS/n) + 2(p+1)/n \\ &\propto \log(RSS) + 2(p+1)/n \end{aligned}\]( \(p+1\) because of the unknown variance, intercept included in \(p\) or not)

**Important**

Unfortunately, different books/software/notes define these differently. Even different R packages. This is **super annoying**.

Forms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here.

Over-fitting means estimating a really complicated function when you don’t have enough data.

This is likely a low-bias / high-variance situation.

Under-fitting means estimating a really simple function when you have lots of data.

This is likely a high-bias / low-variance situation.

Both of these outcomes are bad (they have high risk \(=\) big \(R_n\) ).

The best way to avoid them is to use a reasonable estimate of *prediction risk* to choose how complicated your model should be.

When comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV.

CV is usually easiest to make sense of and doesn’t depend on other unknown parameters.

But, it requires refitting the model.

Also, it can be strange in cases with discrete predictors, time series, repeated measurements, graph structures, etc.

GCV tends to choose “dense” models.

Theory says AIC chooses the “best predicting model” asymptotically.

Theory says BIC should choose the “true model” asymptotically, tends to select fewer predictors.

In some special cases, AIC = Cp = SURE \(\approx\) LOO-CV

As a technical point, CV (or validation set) is estimating error on new data, unseen \((X_0, Y_0)\), while AIC / CP are estimating error on new Y at the observed \(x_1,\ldots,x_n\). This is subtle.

Use CV

It is often tempting to “just compare” risk estimates from vastly different models.

For example,

different transformations of the predictors,

different transformations of the response,

Poisson likelihood vs. Gaussian likelihood in

`glm()`

This is not always justified.

The “high-level intuition” is for “nested” models.

Different likelihoods aren’t comparable.

Residuals / response variables on different scales aren’t directly comparable.

“Validation set” is easy, because you’re always comparing to the “right” thing. But it has lots of drawbacks.

Greedy selection

UBC Stat 406 - 2023