Stat 406
Daniel J. McDonald
Last modified – 18 September 2023
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \]
Last time, we saw
\(R_n(\widehat{f}) = E[(Y-\widehat{f}(X))^2]\)
prediction risk = \(\textrm{bias}^2\) + variance + irreducible error
We argued that we want procedures that produce \(\widehat{f}\) with small \(R_n\).
How do we estimate \(R_n\)?
The training error in regression is
\[\widehat{R}_n(\widehat{f}) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{f}(x_i))^2\]
Here, the \(n\) is doubly used (annoying, but simple): \(n\) observations to create \(\widehat{f}\) and \(n\) terms in the sum.
Important
Training error is a bad estimator for \(R_n(\widehat{f})\).
So we should never use it.
You should not use anova
or the \(p\)-values from the lm
output for this purpose.
These things are to determine whether those parameters are different from zero if you were to repeat the experiment many times, if the model were true, etc. etc.
Not the same as “are they useful for prediction = do they help me get smaller \(R_n\)?”
While it’s crummy, Training Error is an estimator of \(R_n(\hat{f})\)
Recall, \(R_n(\hat{f})\) is a parameter (a property of the data distribution)
So we can ask “is \(\widehat{R}(\hat{f})\) a good estimator for \(R_n(\hat{f})\)?”
Both are just numbers, so perhaps a good way to measure is
\[ E[(R_n - \widehat{R})^2] = \cdots = (R_n - E[\widehat{R}])^2 + \Var{\widehat{R}} \]
Choices you make determine how good this is.
We can try to balance it’s bias and variance…
One option is to have a separate “held out” or “validation set”.
👍 Estimates the test error
👍 Fast computationally
🤮 Estimate is random
🤮 Estimate has high variance (depends on 1 choice of split)
🤮 Estimate has some bias because we only used some of the data
In my experience, CS has particular definitions of “training”, “validation”, and “test” data.
I think these are not quite the same as in Statistics.
When and how you split your training data can be very important.
One reason that \(\widehat{R}_n(\widehat{f})\) is bad is that we are using the same data to pick \(\widehat{f}\) AND to estimate \(R_n\).
“Validation set” fixes this, but holds out a particular, fixed block of data we pretend mimics the “test data”
What if we set aside one observation, say the first one \((y_1, x_1)\).
We estimate \(\widehat{f}^{(1)}\) without using the first observation.
Then we test our prediction:
\[\widetilde{R}_1(\widehat{f}^{(1)}) = (y_1 -\widehat{f}^{(1)}(x_1))^2.\]
(why the notation \(\widetilde{R}_1\)? Because we’re estimating the risk with 1 observation. )
But that was only one data point \((y_1, x_1)\). Why stop there?
Do the same with \((y_2, x_2)\)! Get an estimate \(\widehat{f}^{(2)}\) without using it, then
\[\widetilde{R}_1(\widehat{f}^{(2)}) = (y_2 -\widehat{f}^{(2)}(x_2))^2.\]
We can keep doing this until we try it for every data point.
And then average them! (Averages are good)
\[\mbox{LOO-CV} = \frac{1}{n}\sum_{i=1}^n \widetilde{R}_1(\widehat{f}^{(i)}) = \frac{1}{n}\sum_{i=1}^n (y_i - \widehat{f}^{(i)}(x_i))^2\]
This is leave-one-out cross validation
🤮 Each held out set is small \((n=1)\). Therefore, the variance of the Squared Error of each prediction is high.
🤮 The training sets overlap. This is bad.
🤮 We have to estimate this model \(n\) times.
🎉 Bias is low because we used almost all the data to fit the model: \(E[\mbox{LOO-CV}] = R_{n-1}\)
To alleviate some of these problems, people usually use \(K\)-fold cross validation.
The idea of \(K\)-fold is
🎉 Less overlap, smaller covariance.
🎉 Larger hold-out sets, smaller variance.
🎉 Less computations (only need to estimate \(K\) times)
🤮 LOO-CV is (nearly) unbiased for \(R_n\)
🤮 K-fold CV is unbiased for \(R_{n(1-1/K)}\)
The risk depends on how much data you use to estimate the model. \(R_n\) depends on \(n\).
par(mar = c(0, 0, 0, 0))
plot(NA, NA, ylim = c(0, 5), xlim = c(0, 10), bty = "n", yaxt = "n", xaxt = "n")
rect(0, .1 + c(0, 2, 3, 4), 10, .9 + c(0, 2, 3, 4), col = blue, density = 10)
rect(c(0, 1, 2, 9), rev(.1 + c(0, 2, 3, 4)), c(1, 2, 3, 10),
rev(.9 + c(0, 2, 3, 4)), col = red, density = 10)
points(c(5, 5, 5), 1 + 1:3 / 4, pch = 19)
text(.5 + c(0, 1, 2, 9), .5 + c(4, 3, 2, 0), c("1", "2", "3", "K"), cex = 3,
col = red)
text(6, 4.5, "Training data", cex = 3, col = blue)
text(2, 1.5, "Validation data", cex = 3, col = red)
#' @param data The full data set
#' @param estimator Function. Has 1 argument (some data) and fits a model.
#' @param predictor Function. Has 2 args (the fitted model, the_newdata) and produces predictions
#' @param error_fun Function. Has one arg: the test data, with fits added.
#' @param kfolds Integer. The number of folds.
kfold_cv <- function(data, estimator, predictor, error_fun, kfolds = 5) {
n <- nrow(data)
fold_labels <- sample(rep(1:kfolds, length.out = n))
errors <- double(kfolds)
for (fold in seq_len(kfolds)) {
test_rows <- fold_labels == fold
train <- data[!test_rows, ]
test <- data[test_rows, ]
current_model <- estimator(train)
test$.preds <- predictor(current_model, test)
errors[fold] <- error_fun(test)
}
mean(errors)
}
somedata <- data.frame(z = rnorm(100), x1 = rnorm(100), x2 = rnorm(100))
est <- function(dataset) lm(z ~ ., data = dataset)
pred <- function(mod, dataset) predict(mod, newdata = dataset)
error_fun <- function(testdata) mutate(testdata, errs = (z - .preds)^2) |> pull(errs) |> mean()
kfold_cv(somedata, est, pred, error_fun, 5)
[1] 0.9532271
For a certain “nice” models, one can show
(after pages of tedious algebra which I wouldn’t wish on my worst enemy, but might, in a fit of rage assign as homework to belligerent students)
\[\mbox{LOO-CV} = \frac{1}{n} \sum_{i=1}^n \frac{(y_i -\widehat{y}_i)^2}{(1-h_{ii})^2} = \frac{1}{n} \sum_{i=1}^n \frac{\widehat{e}_i^2}{(1-h_{ii})^2}.\]
This trick means that you only have to fit the model once rather than \(n\) times!
You still have to calculate this for each model!
“Nice” requires:
More tricks and what’s up with the name “hatvalues”
UBC Stat 406 - 2023