05 Estimating test MSE

Stat 406

Daniel J. McDonald

Last modified – 18 September 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \]

Estimating prediction risk

Last time, we saw

\(R_n(\widehat{f}) = E[(Y-\widehat{f}(X))^2]\)

prediction risk = \(\textrm{bias}^2\) + variance + irreducible error

We argued that we want procedures that produce \(\widehat{f}\) with small \(R_n\).

How do we estimate \(R_n\)?

Don’t use training error

The training error in regression is

\[\widehat{R}_n(\widehat{f}) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{f}(x_i))^2\]

Here, the \(n\) is doubly used (annoying, but simple): \(n\) observations to create \(\widehat{f}\) and \(n\) terms in the sum.

Important

Training error is a bad estimator for \(R_n(\widehat{f})\).

So we should never use it.

These all have the same \(R^2\) and Training Error

Code
ans <- anscombe |>
  pivot_longer(everything(), names_to = c(".value", "set"), 
               names_pattern = "(.)(.)")
ggplot(ans, aes(x, y)) + 
  geom_point(colour = orange, size = 3) + 
  geom_smooth(method = "lm", se = FALSE, color = blue, linewidth = 2) +
  facet_wrap(~set, labeller = label_both)

ans %>% 
  group_by(set) |> 
  summarise(
    R2 = summary(lm(y ~ x))$r.sq, 
    train_error = mean((y - predict(lm(y ~ x)))^2)
  ) |>
  kableExtra::kable(digits = 2)
set R2 train_error
1 0.67 1.25
2 0.67 1.25
3 0.67 1.25
4 0.67 1.25

Adding “junk” predictors increases \(R^2\) and decreases Training Error

n <- 100
p <- 10
q <- 0:30
x <- matrix(rnorm(n * (p + max(q))), nrow = n)
y <- x[, 1:p] %*% c(5:1, 1:5) + rnorm(n, 0, 10)

regress_on_junk <- function(q) {
  x <- x[, 1:(p + q)]
  mod <- lm(y ~ x)
  tibble(R2 = summary(mod)$r.sq,  train_error = mean((y - predict(mod))^2))
}
Code
map(q, regress_on_junk) |> 
  list_rbind() |>
  mutate(q = q) |>
  pivot_longer(-q) |>
  ggplot(aes(q, value, colour = name)) +
  geom_line(linewidth = 2) + xlab("train_error") +
  scale_colour_manual(values = c(blue, orange), guide = "none") +
  facet_wrap(~ name, scales = "free_y")

Other things you can’t use

You should not use anova

or the \(p\)-values from the lm output for this purpose.

These things are to determine whether those parameters are different from zero if you were to repeat the experiment many times, if the model were true, etc. etc.

Not the same as “are they useful for prediction = do they help me get smaller \(R_n\)?”

Risk of Risk

While it’s crummy, Training Error is an estimator of \(R_n(\hat{f})\)

Recall, \(R_n(\hat{f})\) is a parameter (a property of the data distribution)

So we can ask “is \(\widehat{R}(\hat{f})\) a good estimator for \(R_n(\hat{f})\)?”

Both are just numbers, so perhaps a good way to measure is

\[ E[(R_n - \widehat{R})^2] = \cdots = (R_n - E[\widehat{R}])^2 + \Var{\widehat{R}} \]

Choices you make determine how good this is.

We can try to balance it’s bias and variance…

Things you can use.

Held out sets

One option is to have a separate “held out” or “validation set”.

👍 Estimates the test error

👍 Fast computationally

🤮 Estimate is random

🤮 Estimate has high variance (depends on 1 choice of split)

🤮 Estimate has some bias because we only used some of the data

Aside

In my experience, CS has particular definitions of “training”, “validation”, and “test” data.

I think these are not quite the same as in Statistics.

  • Test data - Hypothetical data you don’t get to see, ever. Infinite amounts drawn from the population.
    • Expected test error or Risk is an expected value over this distribution. It’s not a sum over some data kept aside.
  • Sometimes I’ll give you “test data”. You pretend that this is a good representation of the expectation and use it to see how well you did on the training data.
  • Training data - This is data that you get to touch.
  • Validation set - Often, we need to choose models. One way to do this is to split off some of your training data and pretend that it’s like a “Test Set”.

When and how you split your training data can be very important.

Intuition for CV

One reason that \(\widehat{R}_n(\widehat{f})\) is bad is that we are using the same data to pick \(\widehat{f}\) AND to estimate \(R_n\).

“Validation set” fixes this, but holds out a particular, fixed block of data we pretend mimics the “test data”

What if we set aside one observation, say the first one \((y_1, x_1)\).

We estimate \(\widehat{f}^{(1)}\) without using the first observation.

Then we test our prediction:

\[\widetilde{R}_1(\widehat{f}^{(1)}) = (y_1 -\widehat{f}^{(1)}(x_1))^2.\]

(why the notation \(\widetilde{R}_1\)? Because we’re estimating the risk with 1 observation. )

Keep going

But that was only one data point \((y_1, x_1)\). Why stop there?

Do the same with \((y_2, x_2)\)! Get an estimate \(\widehat{f}^{(2)}\) without using it, then

\[\widetilde{R}_1(\widehat{f}^{(2)}) = (y_2 -\widehat{f}^{(2)}(x_2))^2.\]

We can keep doing this until we try it for every data point.

And then average them! (Averages are good)

\[\mbox{LOO-CV} = \frac{1}{n}\sum_{i=1}^n \widetilde{R}_1(\widehat{f}^{(i)}) = \frac{1}{n}\sum_{i=1}^n (y_i - \widehat{f}^{(i)}(x_i))^2\]

This is leave-one-out cross validation

Problems with LOO-CV

🤮 Each held out set is small \((n=1)\). Therefore, the variance of the Squared Error of each prediction is high.

🤮 The training sets overlap. This is bad.

  • Usually, averaging reduces variance: \(\Var{\overline{X}} = \frac{1}{n^2}\sum_{i=1}^n \Var{X_i} = \frac{1}{n}\Var{X_1}.\)
  • But only if the variables are independent. If not, then \(\Var{\overline{X}} = \frac{1}{n^2}\Var{ \sum_{i=1}^n X_i} = \frac{1}{n}\Var{X_1} + \frac{1}{n^2}\sum_{i\neq j} \Cov{X_i}{X_j}.\)
  • Since the training sets overlap a lot, that covariance can be pretty big.

🤮 We have to estimate this model \(n\) times.

🎉 Bias is low because we used almost all the data to fit the model: \(E[\mbox{LOO-CV}] = R_{n-1}\)

K-fold CV

To alleviate some of these problems, people usually use \(K\)-fold cross validation.

The idea of \(K\)-fold is

  1. Divide the data into \(K\) groups.
  2. Leave a group out and estimate with the rest.
  3. Test on the held-out group. Calculate an average risk over these \(\sim n/K\) data.
  4. Repeat for all \(K\) groups.
  5. Average the average risks.

🎉 Less overlap, smaller covariance.

🎉 Larger hold-out sets, smaller variance.

🎉 Less computations (only need to estimate \(K\) times)

🤮 LOO-CV is (nearly) unbiased for \(R_n\)

🤮 K-fold CV is unbiased for \(R_{n(1-1/K)}\)

The risk depends on how much data you use to estimate the model. \(R_n\) depends on \(n\).

A picture

Code
par(mar = c(0, 0, 0, 0))
plot(NA, NA, ylim = c(0, 5), xlim = c(0, 10), bty = "n", yaxt = "n", xaxt = "n")
rect(0, .1 + c(0, 2, 3, 4), 10, .9 + c(0, 2, 3, 4), col = blue, density = 10)
rect(c(0, 1, 2, 9), rev(.1 + c(0, 2, 3, 4)), c(1, 2, 3, 10), 
     rev(.9 + c(0, 2, 3, 4)), col = red, density = 10)
points(c(5, 5, 5), 1 + 1:3 / 4, pch = 19)
text(.5 + c(0, 1, 2, 9), .5 + c(4, 3, 2, 0), c("1", "2", "3", "K"), cex = 3, 
     col = red)
text(6, 4.5, "Training data", cex = 3, col = blue)
text(2, 1.5, "Validation data", cex = 3, col = red)

Code

#' @param data The full data set
#' @param estimator Function. Has 1 argument (some data) and fits a model. 
#' @param predictor Function. Has 2 args (the fitted model, the_newdata) and produces predictions
#' @param error_fun Function. Has one arg: the test data, with fits added.
#' @param kfolds Integer. The number of folds.
kfold_cv <- function(data, estimator, predictor, error_fun, kfolds = 5) {
  n <- nrow(data)
  fold_labels <- sample(rep(1:kfolds, length.out = n))
  errors <- double(kfolds)
  for (fold in seq_len(kfolds)) {
    test_rows <- fold_labels == fold
    train <- data[!test_rows, ]
    test <- data[test_rows, ]
    current_model <- estimator(train)
    test$.preds <- predictor(current_model, test)
    errors[fold] <- error_fun(test)
  }
  mean(errors)
}
somedata <- data.frame(z = rnorm(100), x1 = rnorm(100), x2 = rnorm(100))
est <- function(dataset) lm(z ~ ., data = dataset)
pred <- function(mod, dataset) predict(mod, newdata = dataset)
error_fun <- function(testdata) mutate(testdata, errs = (z - .preds)^2) |> pull(errs) |> mean()
kfold_cv(somedata, est, pred, error_fun, 5)
[1] 0.9532271

Trick

For a certain “nice” models, one can show

(after pages of tedious algebra which I wouldn’t wish on my worst enemy, but might, in a fit of rage assign as homework to belligerent students)

\[\mbox{LOO-CV} = \frac{1}{n} \sum_{i=1}^n \frac{(y_i -\widehat{y}_i)^2}{(1-h_{ii})^2} = \frac{1}{n} \sum_{i=1}^n \frac{\widehat{e}_i^2}{(1-h_{ii})^2}.\]

  • This trick means that you only have to fit the model once rather than \(n\) times!

  • You still have to calculate this for each model!

cv_nice <- function(mdl) mean( (residuals(mdl) / (1 - hatvalues(mdl)))^2 )

Trick

cv_nice <- function(mdl) mean( (residuals(mdl) / (1 - hatvalues(mdl)))^2 )

“Nice” requires:

  1. \(\widehat{y}_i = h_i(\mathbf{X})^\top \mathbf{y}\) for some vector \(h_i\)
  2. \(e^{(i)} = \frac{\widehat{e}_i}{(1-h_{ii})}\)

Next time…

More tricks and what’s up with the name “hatvalues”