Stat 406

Daniel J. McDonald

Last modified – 18 September 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \]

Last time, we saw

\(R_n(\widehat{f}) = E[(Y-\widehat{f}(X))^2]\)

prediction risk = \(\textrm{bias}^2\) + variance + irreducible error

We argued that we want procedures that produce \(\widehat{f}\) with small \(R_n\).

How do we estimate \(R_n\)?

The training error in regression is

\[\widehat{R}_n(\widehat{f}) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{f}(x_i))^2\]

Here, the \(n\) is doubly used (annoying, but simple): \(n\) observations to create \(\widehat{f}\) and \(n\) terms in the sum.

**Important**

Training error is a bad estimator for \(R_n(\widehat{f})\).

So we should **never** use it.

You should not use `anova`

or the \(p\)-values from the `lm`

output for this purpose.

These things are to determine whether those

parametersare different from zero if you were to repeat the experiment many times, if the model were true, etc. etc.

Not the same as *“are they useful for prediction = do they help me get smaller \(R_n\)?”*

While it’s crummy, Training Error is an estimator of \(R_n(\hat{f})\)

Recall, \(R_n(\hat{f})\) is a *parameter* (a property of the data distribution)

So we can ask “is \(\widehat{R}(\hat{f})\) a good estimator for \(R_n(\hat{f})\)?”

Both are just numbers, so perhaps a good way to measure is

\[ E[(R_n - \widehat{R})^2] = \cdots = (R_n - E[\widehat{R}])^2 + \Var{\widehat{R}} \]

Choices you make determine how good this is.

We can try to balance it’s bias and variance…

One option is to have a separate “held out” or “validation set”.

👍 Estimates the test error

👍 Fast computationally

🤮 Estimate is random

🤮 Estimate has high variance (depends on 1 choice of split)

🤮 Estimate has some bias because we only used some of the data

In my experience, CS has particular definitions of “training”, “validation”, and “test” data.

I think these are not quite the same as in Statistics.

- Test data - Hypothetical data you don’t get to see, ever. Infinite amounts drawn from the population.
*Expected test error*or*Risk*is an expected value over this distribution. It’s*not*a sum over some data kept aside.

- Sometimes I’ll give you “test data”. You pretend that this is a good representation of the expectation and use it to see how well you did on the training data.
- Training data - This is data that you get to touch.
- Validation set - Often, we need to
*choose models*. One way to do this is to split off some of your training data and pretend that it’s like a “Test Set”.

When and how you split your training data can be very important.

One reason that \(\widehat{R}_n(\widehat{f})\) is bad is that we are using the same data to pick \(\widehat{f}\) **AND** to estimate \(R_n\).

“Validation set” fixes this, but holds out a particular, fixed block of data we pretend mimics the “test data”

What if we set aside one observation, say the first one \((y_1, x_1)\).

We estimate \(\widehat{f}^{(1)}\) without using the first observation.

Then we test our prediction:

\[\widetilde{R}_1(\widehat{f}^{(1)}) = (y_1 -\widehat{f}^{(1)}(x_1))^2.\]

(why the notation \(\widetilde{R}_1\)? Because we’re estimating the risk with 1 observation. )

But that was only one data point \((y_1, x_1)\). Why stop there?

Do the same with \((y_2, x_2)\)! Get an estimate \(\widehat{f}^{(2)}\) without using it, then

\[\widetilde{R}_1(\widehat{f}^{(2)}) = (y_2 -\widehat{f}^{(2)}(x_2))^2.\]

We can keep doing this until we try it for every data point.

And then average them! (Averages are good)

\[\mbox{LOO-CV} = \frac{1}{n}\sum_{i=1}^n \widetilde{R}_1(\widehat{f}^{(i)}) = \frac{1}{n}\sum_{i=1}^n (y_i - \widehat{f}^{(i)}(x_i))^2\]

This is **leave-one-out cross validation**

🤮 Each held out set is small \((n=1)\). Therefore, the variance of the Squared Error of each prediction is high.

🤮 The training sets overlap. This is bad.

- Usually, averaging reduces variance: \(\Var{\overline{X}} = \frac{1}{n^2}\sum_{i=1}^n \Var{X_i} = \frac{1}{n}\Var{X_1}.\)
- But only if the variables are independent. If not, then \(\Var{\overline{X}} = \frac{1}{n^2}\Var{ \sum_{i=1}^n X_i} = \frac{1}{n}\Var{X_1} + \frac{1}{n^2}\sum_{i\neq j} \Cov{X_i}{X_j}.\)
- Since the training sets overlap a lot, that covariance can be pretty big.

🤮 We have to estimate this model \(n\) times.

🎉 Bias is low because we used almost all the data to fit the model: \(E[\mbox{LOO-CV}] = R_{n-1}\)

To alleviate some of these problems, people usually use \(K\)-fold cross validation.

The idea of \(K\)-fold is

- Divide the data into \(K\) groups.
- Leave a group out and estimate with the rest.
- Test on the held-out group. Calculate an average risk over these \(\sim n/K\) data.
- Repeat for all \(K\) groups.
- Average the average risks.

🎉 Less overlap, smaller covariance.

🎉 Larger hold-out sets, smaller variance.

🎉 Less computations (only need to estimate \(K\) times)

🤮 LOO-CV is (nearly) unbiased for \(R_n\)

🤮 K-fold CV is unbiased for \(R_{n(1-1/K)}\)

The risk depends on how much data you use to estimate the model. \(R_n\) depends on \(n\).

```
par(mar = c(0, 0, 0, 0))
plot(NA, NA, ylim = c(0, 5), xlim = c(0, 10), bty = "n", yaxt = "n", xaxt = "n")
rect(0, .1 + c(0, 2, 3, 4), 10, .9 + c(0, 2, 3, 4), col = blue, density = 10)
rect(c(0, 1, 2, 9), rev(.1 + c(0, 2, 3, 4)), c(1, 2, 3, 10),
rev(.9 + c(0, 2, 3, 4)), col = red, density = 10)
points(c(5, 5, 5), 1 + 1:3 / 4, pch = 19)
text(.5 + c(0, 1, 2, 9), .5 + c(4, 3, 2, 0), c("1", "2", "3", "K"), cex = 3,
col = red)
text(6, 4.5, "Training data", cex = 3, col = blue)
text(2, 1.5, "Validation data", cex = 3, col = red)
```

```
#' @param data The full data set
#' @param estimator Function. Has 1 argument (some data) and fits a model.
#' @param predictor Function. Has 2 args (the fitted model, the_newdata) and produces predictions
#' @param error_fun Function. Has one arg: the test data, with fits added.
#' @param kfolds Integer. The number of folds.
kfold_cv <- function(data, estimator, predictor, error_fun, kfolds = 5) {
n <- nrow(data)
fold_labels <- sample(rep(1:kfolds, length.out = n))
errors <- double(kfolds)
for (fold in seq_len(kfolds)) {
test_rows <- fold_labels == fold
train <- data[!test_rows, ]
test <- data[test_rows, ]
current_model <- estimator(train)
test$.preds <- predictor(current_model, test)
errors[fold] <- error_fun(test)
}
mean(errors)
}
```

```
somedata <- data.frame(z = rnorm(100), x1 = rnorm(100), x2 = rnorm(100))
est <- function(dataset) lm(z ~ ., data = dataset)
pred <- function(mod, dataset) predict(mod, newdata = dataset)
error_fun <- function(testdata) mutate(testdata, errs = (z - .preds)^2) |> pull(errs) |> mean()
kfold_cv(somedata, est, pred, error_fun, 5)
```

`[1] 0.9532271`

**For a certain “nice” models**, one can show

(after pages of tedious algebra which I wouldn’t wish on my worst enemy, but might, in a fit of rage assign as homework to belligerent students)

\[\mbox{LOO-CV} = \frac{1}{n} \sum_{i=1}^n \frac{(y_i -\widehat{y}_i)^2}{(1-h_{ii})^2} = \frac{1}{n} \sum_{i=1}^n \frac{\widehat{e}_i^2}{(1-h_{ii})^2}.\]

This trick means that you only have to fit the model once rather than \(n\) times!

You still have to calculate this for each model!

“Nice” requires:

- \(\widehat{y}_i = h_i(\mathbf{X})^\top \mathbf{y}\) for some vector \(h_i\)
- \(e^{(i)} = \frac{\widehat{e}_i}{(1-h_{ii})}\)

More tricks and what’s up with the name “hatvalues”

UBC Stat 406 - 2023