or the \(p\)-values from the lm output for this purpose.

These things are to determine whether those parameters are different from zero if you were to repeat the experiment many times, if the model were true, etc. etc.

In other words, they are useful for inference problems.

This is not the same as being useful for prediction problems (i.e. how to get small \(R_n\)).

Don’t use training error: the formal argument

Our training error \(\hat R_n(\hat f)\) is an estimator of \(R_n\).

So we can ask “is \(\widehat{R}_n(\hat{f})\) a good estimator for \(R_n\)?”

The error of our risk estimator

Let’s measure the error of our empirical risk estimator:

\[E[(R_n - \hat R_n(\hat f))^2]\](What is the expectation with respect to?)

The error of our risk estimator

\[E[(R_n - \hat R_n(\hat f))^2]\]

\(R_n\) is deterministic (we average over test data and training data)

\(\hat R_n(\hat f)\) also only depends on training data

So the expectation is with respect to our training dataset

As before, we can decompose the error of our risk estimator into bias and variance

Formalizing why \(\hat R_n(\hat f)\) is a bad estimator of \(R_n\)

Consider an alternative estimator built from \(\{ (X_j, Y_j) \}_{j=1}^m\) that was not part of the training set. \[\tilde R_m(\hat f) = {\textstyle \frac{1}{m} \sum_{j=1}^m} \ell(Y_j, \hat f(X_j)),
\] The error of this estimator can also be decompsed into bias and variance\[
E[(R_n - \tilde R_m(\hat f))^2] = \underbrace{( R_n - E_{\hat f,X_j,Y_j}[\tilde R_m(\hat f)])^2}_{\text{bias}} + \underbrace{E_{\hat f,X_j,Y_j}[( \tilde R_m(\hat f) - E_{\hat f,X_j,Y_j}[\tilde R_m(\hat f)])^2]}_{\text{variance}}
\]

Is the bias of \(\tilde R_m(\hat f)\) small or large? Why?

Is the bias of \(\tilde R_m(\hat f)\) small or large? Why?

One option is to have a separate “holdout” or “validation” dataset.

Tip

This option follows the logic on the previous slide.
If we randomly “hold out” \(\{ (X_j, Y_j) \}_{j=1}^m\) from the training set, we can use this data to get an (nearly) unbiased estimator of \(R_n\). \[
R_n \approx \tilde R_m(\hat f) \triangleq {\textstyle{\frac 1 m \sum_{j=1}^m \ell ( Y_j - \hat Y_j(X_j))}}
\]

👍 Estimates the test error

👍 Fast computationally

🤮 Estimate is random

🤮 Estimate has high variance (depends on 1 choice of split)

🤮 Estimate has a little bias (because we aren’t estimating \(\hat f\) from all of the training data)

Aside

In my experience, CS has particular definitions of “training”, “validation”, and “test” data.

I think these are not quite the same as in Statistics.

Test data - Hypothetical data you don’t get to see, ever. Infinite amounts drawn from the population.

Expected test error or Risk is an expected value over this distribution. It’s not a sum over some data kept aside.

Sometimes I’ll give you “test data”. You pretend that this is a good representation of the expectation and use it to see how well you did on the training data.

Training data - This is “holdout” data that you get to touch.

Validation set - Often, we need to choose models. One way to do this is to split off some of your training data and pretend that it’s like a “Test Set”.

When and how you split your training data can be very important.

Announcements Sept 24

Lab Section 03 on monday next week: it’s a holiday. Your lab will be due Friday instead.

Everyone else’s lab: same time as usual.

Review of Risk (Estimation)

We fixed a bunch of subtle issues in the Sept 19 Risk Estimation slides.

And I’ve noticed some related confusion in my office hours, so it’s review time!

Tip

After this lecture, make sure to review the whole Risk Estimation slide deck to see all the fixed terminology/definitions/examples.

Risk vs. Test Error

Risk (\(R_n\)): expected error when the training data have not yet been observed(random!)

depends on true data distribution, predictor

Test Error (\(T_n\)): expected error when the training data have been observed(fixed!)

depends on true data distribution, predictor, training data

\(R_n = \E[T_n]\)

we mostly care about risk when designing a new predictor / estimator

we want to know how well it will work on future training and test data

we mostly care about test error when we’ve fit a predictor / estimator

we want to know how well it will work on future test data

An important clarification

Important

In previous lectures, we’ve used \(R_n(\hat f)\) to denote risk.

Confusing: the \(\hat \mu\) argument looks like a trained predictor, so \(R_n(\hat f)\) looks like a function of training data. Not true!

We will just avoid this confusion from now on and use \(R_n\) for risk.

Risk vs. Test Error: Warmup in 1D

Model: \(\mathcal{P} = \{ P: \quad Y \sim \mathcal N(\mu, 1), \quad \mu\in\R\}\), loss \(\ell(y,\hat y) = (y-\hat y)^2\)

Training Data: \(Y_i\in\R\) from some unknown “true” \(P_0 \in \mathcal{P}\) (i.e., “true” \(\mu_0\))

Predictor: function of training data\(\hat\mu : \R^n \to \R\)

e.g., scaled empirical average \(\hat\mu(Y_{1:n}) = \alpha\frac{1}{n}\sum_{n=1}^N Y_n\) for \(\alpha > 0\)

Risk\(R_n\): expected loss over both training and test data \(R_n = E[\ell(Y,\hat\mu(Y_{1:n}))]\)

function of true dist \(P_0\) and predictor function\(\hat\mu(\cdot)\)

averages over randomness in training data

Test Error\(T_n\): expected loss over only test data \(T_n(\hat\mu) = E[\ell(Y,\hat\mu(Y_{1:n})) | Y_{1:n}]\)

function of true dist \(P_0\), predictor function\(\hat\mu(\cdot)\), and training data

training data (and trained predictor) are known/fixed

Risk vs. Test Error: Warmup in 1D

Model: \(\mathcal{P} = \{ P: \quad Y \sim \mathcal N(\mu, 1), \quad \mu\in\R\}\), loss \(\ell(y,\hat y) = (y-\hat y)^2\)

Training Data: \(Y_i\in\R\) from some unknown “true” \(P_0 \in \mathcal{P}\) (i.e., “true” \(\mu_0\))

Predictor: scaled empirical average \(\hat\mu(Y_{1:n}) = \alpha \frac{1}{n}\sum_{i=1}^n Y_i\) for a fixed \(\alpha > 0\)

🎉 Less costly than LOO CV (i.e., actually possible; only need to train \(K\) times)

💩 K-fold CV has higher sq. bias \((R_n - R_{n(1-1/K)})^2\) than LOO CV \((R_n - R_{n-1})^2\)

a bit painful to compare variance…

I hereby invoke the sacred incantation: “this exercise is left to the reader”

The overall risk \(R_n\) depends on \(n\).

Tip

In practice, most people just default to using 5-fold or 10-fold. This is probably fine in most cases.

K-fold CV: Code

#' @param data The full data set#' @param estimator Function. Has 1 argument (some data) and fits a model. #' @param predictor Function. Has 2 args (the fitted model, the_newdata) and produces predictions#' @param error_fun Function. Has one arg: the test data, with fits added.#' @param kfolds Integer. The number of folds.kfold_cv <-function(data, estimator, predictor, error_fun, kfolds =5) { n <-nrow(data) fold_labels <-sample(rep(1:kfolds, length.out = n)) errors <-double(kfolds)for (fold inseq_len(kfolds)) { test_rows <- fold_labels == fold train <- data[!test_rows, ] test <- data[test_rows, ] current_model <-estimator(train) test$.preds <-predictor(current_model, test) errors[fold] <-error_fun(test) }mean(errors)}