Stat 406
Geoff Pleiss, Trevor Campbell
Last modified – 18 September 2023
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \]
Model Selection means select a family of distributions for your data.
Ideally, we’d do this by comparing the \(R_n\) for one family with that for another.
We’d use whichever has smaller \(R_n\).
But \(R_n\) depends on the truth, so we estimate it with \(\widehat{R}\).
Then we use whichever has smaller \(\widehat{R}\).
The truth:
Model 1: y ~ x1 + x2
Model 2: y ~ x1 + x2 + x1*x2
Model 3: y ~ x2 + sin(x1 * x2)
(What are the families for each of these?)
forms <- list("y ~ x1 + x2", "y ~ x1 * x2", "y ~ x2 + sin(x1*x2)") |>
map(as.formula)
fits <- map(forms, ~ lm(.x, data = dat))
map(fits, ~ tibble(
R2 = summary(.x)$r.sq,
training_error = mean(residuals(.x)^2),
loocv = mean( (residuals(.x) / (1 - hatvalues(.x)))^2 ),
AIC = AIC(.x),
BIC = BIC(.x)
)) |> list_rbind()
# A tibble: 3 × 5
R2 training_error loocv AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.589 21.3 22.9 598. 608.
2 0.595 21.0 23.4 598. 611.
3 0.586 21.4 23.0 598. 609.
Model selection is very comprehensive
You choose a full statistical model (probability distribution) that will be hypothesized to have generated the data.
Variable selection is a subset of this. It means
choosing which predictors to include in a predictive model
Eliminating a predictor, means removing it from the model.
Some procedures automatically search predictors, and eliminate some.
We call this variable selection. But the procedure is implicitly selecting a model as well.
Making this all the more complicated, with lots of effort, we can map procedures/algorithms to larger classes of probability models, and analyze them.
Suppose we have a pile of predictors.
We estimate models with different subsets of predictors and use CV / Cp / AIC / BIC to decide which is preferred.
Sometimes you might have a few plausible subsets. Easy enough to choose with our criterion.
Sometimes you might just have a bunch of predictors, then what do you do?
Within each procedure, we’re comparing nested models.
set.seed(123)
n <- 406
df <- tibble( # like data.frame, but columns can be functions of preceding
x1 = rnorm(n),
x2 = rnorm(n, mean = 2, sd = 1),
x3 = rexp(n, rate = 1),
x4 = x2 + rnorm(n, sd = .1), # correlated with x2
x5 = x1 + rnorm(n, sd = .1), # correlated with x1
x6 = x1 - x2 + rnorm(n, sd = .1), # correlated with x2 and x1 (and others)
x7 = x1 + x3 + rnorm(n, sd = .1), # correlated with x1 and x3 (and others)
y = x1 * 3 + x2 / 3 + rnorm(n, sd = 2.2) # function of x1 and x2 only
)
\(\mathbf{x}_1\) and \(\mathbf{x}_2\) are the true predictors
But the rest are correlated with them
Call:
lm(formula = y ~ ., data = df)
Residuals:
Min 1Q Median 3Q Max
-6.7739 -1.4283 -0.0929 1.4257 7.5869
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03383 0.27700 0.122 0.90287
x1 6.70481 2.06743 3.243 0.00128 **
x2 -0.43945 1.71650 -0.256 0.79807
x3 1.37293 1.11524 1.231 0.21903
x4 -1.19911 1.17850 -1.017 0.30954
x5 -0.53918 1.07089 -0.503 0.61490
x6 -1.88547 1.21652 -1.550 0.12196
x7 -1.25245 1.10743 -1.131 0.25876
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.231 on 398 degrees of freedom
Multiple R-squared: 0.6411, Adjusted R-squared: 0.6347
F-statistic: 101.5 on 7 and 398 DF, p-value: < 2.2e-16
Call:
lm(formula = y ~ x1 + x2, data = df)
Residuals:
Min 1Q Median 3Q Max
-6.4519 -1.3873 -0.1941 1.3498 7.5533
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1676 0.2492 0.673 0.5015
x1 3.0316 0.1146 26.447 <2e-16 ***
x2 0.2447 0.1109 2.207 0.0279 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.233 on 403 degrees of freedom
Multiple R-squared: 0.6357, Adjusted R-squared: 0.6339
F-statistic: 351.6 on 2 and 403 DF, p-value: < 2.2e-16
Subset selection object
Call: regsubsets.formula(y ~ ., data = df)
7 Variables (and intercept)
Forced in Forced out
x1 FALSE FALSE
x2 FALSE FALSE
x3 FALSE FALSE
x4 FALSE FALSE
x5 FALSE FALSE
x6 FALSE FALSE
x7 FALSE FALSE
1 subsets of each size up to 7
Selection Algorithm: exhaustive
x1 x2 x3 x4 x5 x6 x7
1 ( 1 ) "*" " " " " " " " " " " " "
2 ( 1 ) "*" " " " " " " " " "*" " "
3 ( 1 ) "*" " " " " "*" " " "*" " "
4 ( 1 ) "*" " " "*" "*" " " "*" " "
5 ( 1 ) "*" " " "*" "*" " " "*" "*"
6 ( 1 ) "*" " " "*" "*" "*" "*" "*"
7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
Subset selection object
Call: regsubsets.formula(y ~ ., data = df, method = "forward")
7 Variables (and intercept)
Forced in Forced out
x1 FALSE FALSE
x2 FALSE FALSE
x3 FALSE FALSE
x4 FALSE FALSE
x5 FALSE FALSE
x6 FALSE FALSE
x7 FALSE FALSE
1 subsets of each size up to 7
Selection Algorithm: forward
x1 x2 x3 x4 x5 x6 x7
1 ( 1 ) "*" " " " " " " " " " " " "
2 ( 1 ) "*" " " " " " " " " "*" " "
3 ( 1 ) "*" " " " " "*" " " "*" " "
4 ( 1 ) "*" " " "*" "*" " " "*" " "
5 ( 1 ) "*" " " "*" "*" " " "*" "*"
6 ( 1 ) "*" " " "*" "*" "*" "*" "*"
7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
Subset selection object
Call: regsubsets.formula(y ~ ., data = df, method = "backward")
7 Variables (and intercept)
Forced in Forced out
x1 FALSE FALSE
x2 FALSE FALSE
x3 FALSE FALSE
x4 FALSE FALSE
x5 FALSE FALSE
x6 FALSE FALSE
x7 FALSE FALSE
1 subsets of each size up to 7
Selection Algorithm: backward
x1 x2 x3 x4 x5 x6 x7
1 ( 1 ) "*" " " " " " " " " " " " "
2 ( 1 ) "*" " " " " " " " " "*" " "
3 ( 1 ) "*" " " " " "*" " " "*" " "
4 ( 1 ) "*" " " "*" "*" " " "*" " "
5 ( 1 ) "*" " " "*" "*" " " "*" "*"
6 ( 1 ) "*" " " "*" "*" "*" "*" "*"
7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
somehow, for this seed, everything is the same
All of that was for one data set.
Doesn’t say which procedure is better generally.
If we want to know how they compare generally, we should repeat many times
I’m not going to do all subsets, just the truth, forward selection, backward, and the full model
For forward/backward selection, I’ll use Cp to choose the final size
You cannot compare the Cp scores between forward selection and backward selection to decide which to use.
Why not?
… Annoyingly, no predict method for regsubsets
, so we make one.
predict.regsubsets <- function(object, newdata, risk_estimate = c("cp", "bic"), ...) {
risk_estimate <- match.arg(risk_estimate)
chosen <- coef(object, which.min(summary(object)[[risk_estimate]]))
predictors <- names(chosen)
if (object$intercept) predictors <- predictors[-1]
X <- newdata[, predictors]
if (object$intercept) X <- cbind2(1, X)
drop(as.matrix(X) %*% chosen)
}
simulate_and_estimate_them_all <- function(n = 406) {
N <- 2 * n # generate 2x the amount of data (half train, half test)
df <- tibble( # generate data
x1 = rnorm(N),
x2 = rnorm(N, mean = 2),
x3 = rexp(N),
x4 = x2 + rnorm(N, sd = .1),
x5 = x1 + rnorm(N, sd = .1),
x6 = x1 - x2 + rnorm(N, sd = .1),
x7 = x1 + x3 + rnorm(N, sd = .1),
y = x1 * 3 + x2 / 3 + rnorm(N, sd = 2.2)
)
train <- df[1:n, ] # half the data for training
test <- df[(n + 1):N, ] # half the data for evaluation
oracle <- lm(y ~ x1 + x2 - 1, data = train) # knowing the right model, not the coefs
full <- lm(y ~ ., data = train)
stepup <- regsubsets(y ~ ., data = train, method = "forward")
stepdown <- regsubsets(y ~ ., data = train, method = "backward")
tibble(
y = test$y,
oracle = predict(oracle, newdata = test),
full = predict(full, newdata = test),
stepup = predict(stepup, newdata = test),
stepdown = predict(stepdown, newdata = test),
truth = drop(as.matrix(test[, c("x1", "x2")]) %*% c(3, 1/3))
)
}
set.seed(12345)
our_sim <- map(1:50, ~ simulate_and_estimate_them_all(406)) |>
list_rbind(names_to = "sim")
our_sim |>
group_by(sim) %>%
summarise(
across(oracle:truth, ~ mean((y - .)^2)),
.groups = "drop"
) %>%
transmute(across(oracle:stepdown, ~ . / truth - 1)) |>
pivot_longer(
everything(),
names_to = "method",
values_to = "mse"
) |>
ggplot(aes(method, mse, fill = method)) +
geom_boxplot(notch = TRUE) +
geom_hline(yintercept = 0, linewidth = 2) +
scale_fill_viridis_d() +
theme(legend.position = "none") +
scale_y_continuous(
labels = scales::label_percent()
) +
ylab("% increase in mse relative\n to the truth")
Module 2
regularization, constraints, and nonparametrics
UBC Stat 406 - 2024