07 Greedy selection

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 18 September 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \]

Recap

Model Selection means select a family of distributions for your data.

Ideally, we’d do this by comparing the \(R_n\) for one family with that for another.

We’d use whichever has smaller \(R_n\).

But \(R_n\) depends on the truth, so we estimate it with \(\widehat{R}\).

Then we use whichever has smaller \(\widehat{R}\).

Example

The truth:

dat <- tibble(
  x1 = rnorm(100), 
  x2 = rnorm(100),
  y = 3 + x1 - 5 * x2 + sin(x1 * x2 / (2 * pi)) + rnorm(100, sd = 5)
)

Model 1: y ~ x1 + x2

Model 2: y ~ x1 + x2 + x1*x2

Model 3: y ~ x2 + sin(x1 * x2)

(What are the families for each of these?)

Fit each model and estimate \(R_n\)

forms <- list("y ~ x1 + x2", "y ~ x1 * x2", "y ~ x2 + sin(x1*x2)") |> 
  map(as.formula)
fits <- map(forms, ~ lm(.x, data = dat))
map(fits, ~ tibble(
  R2 = summary(.x)$r.sq,
  training_error = mean(residuals(.x)^2),
  loocv = mean( (residuals(.x) / (1 - hatvalues(.x)))^2 ),
  AIC = AIC(.x),
  BIC = BIC(.x)
)) |> list_rbind()
# A tibble: 3 × 5
     R2 training_error loocv   AIC   BIC
  <dbl>          <dbl> <dbl> <dbl> <dbl>
1 0.589           21.3  22.9  598.  608.
2 0.595           21.0  23.4  598.  611.
3 0.586           21.4  23.0  598.  609.

Model Selection vs. Variable Selection

Model selection is very comprehensive

You choose a full statistical model (probability distribution) that will be hypothesized to have generated the data.

Variable selection is a subset of this. It means

choosing which predictors to include in a predictive model

Eliminating a predictor, means removing it from the model.

Some procedures automatically search predictors, and eliminate some.

We call this variable selection. But the procedure is implicitly selecting a model as well.

Making this all the more complicated, with lots of effort, we can map procedures/algorithms to larger classes of probability models, and analyze them.

Selecting variables / predictors with linear methods

Suppose we have a pile of predictors.

We estimate models with different subsets of predictors and use CV / Cp / AIC / BIC to decide which is preferred.

Sometimes you might have a few plausible subsets. Easy enough to choose with our criterion.

Sometimes you might just have a bunch of predictors, then what do you do?

All subsets
estimate model based on every possible subset of size \(|\mathcal{S}| \leq \min\{n, p\}\), use one with lowest risk estimate
Forward selection
start with \(\mathcal{S}=\varnothing\), add predictors greedily
Backward selection
start with \(\mathcal{S}=\{1,\ldots,p\}\), remove greedily
Hybrid
combine forward and backward smartly

Note:

Within each procedure, we’re comparing nested models.

Costs and benefits

All subsets
👍 estimates each subset
💣 takes \(2^p\) model fits when \(p<n\). If \(p=50\), this is about \(10^{15}\) models.
Forward selection
👍 computationally feasible
💣 ignores some models, correlated predictors means bad performance
Backward selection
👍 computationally feasible
💣 ignores some models, correlated predictors means bad performance
💣 doesn’t work if \(p>n\)
Hybrid
👍 visits more models than forward/backward
💣 slower

Synthetic example

set.seed(123)
n <- 406
df <- tibble( # like data.frame, but columns can be functions of preceding
  x1 = rnorm(n),
  x2 = rnorm(n, mean = 2, sd = 1),
  x3 = rexp(n, rate = 1),
  x4 = x2 + rnorm(n, sd = .1), # correlated with x2
  x5 = x1 + rnorm(n, sd = .1), # correlated with x1
  x6 = x1 - x2 + rnorm(n, sd = .1), # correlated with x2 and x1 (and others)
  x7 = x1 + x3 + rnorm(n, sd = .1), # correlated with x1 and x3 (and others)
  y = x1 * 3 + x2 / 3 + rnorm(n, sd = 2.2) # function of x1 and x2 only
)

\(\mathbf{x}_1\) and \(\mathbf{x}_2\) are the true predictors

But the rest are correlated with them

Full model

full <- lm(y ~ ., data = df)
summary(full)

Call:
lm(formula = y ~ ., data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7739 -1.4283 -0.0929  1.4257  7.5869 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.03383    0.27700   0.122  0.90287   
x1           6.70481    2.06743   3.243  0.00128 **
x2          -0.43945    1.71650  -0.256  0.79807   
x3           1.37293    1.11524   1.231  0.21903   
x4          -1.19911    1.17850  -1.017  0.30954   
x5          -0.53918    1.07089  -0.503  0.61490   
x6          -1.88547    1.21652  -1.550  0.12196   
x7          -1.25245    1.10743  -1.131  0.25876   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.231 on 398 degrees of freedom
Multiple R-squared:  0.6411,    Adjusted R-squared:  0.6347 
F-statistic: 101.5 on 7 and 398 DF,  p-value: < 2.2e-16

True model

truth <- lm(y ~ x1 + x2, data = df)
summary(truth)

Call:
lm(formula = y ~ x1 + x2, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4519 -1.3873 -0.1941  1.3498  7.5533 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.1676     0.2492   0.673   0.5015    
x1            3.0316     0.1146  26.447   <2e-16 ***
x2            0.2447     0.1109   2.207   0.0279 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.233 on 403 degrees of freedom
Multiple R-squared:  0.6357,    Adjusted R-squared:  0.6339 
F-statistic: 351.6 on 2 and 403 DF,  p-value: < 2.2e-16

All subsets

library(leaps)
trythemall <- regsubsets(y ~ ., data = df)
summary(trythemall)
Subset selection object
Call: regsubsets.formula(y ~ ., data = df)
7 Variables  (and intercept)
   Forced in Forced out
x1     FALSE      FALSE
x2     FALSE      FALSE
x3     FALSE      FALSE
x4     FALSE      FALSE
x5     FALSE      FALSE
x6     FALSE      FALSE
x7     FALSE      FALSE
1 subsets of each size up to 7
Selection Algorithm: exhaustive
         x1  x2  x3  x4  x5  x6  x7 
1  ( 1 ) "*" " " " " " " " " " " " "
2  ( 1 ) "*" " " " " " " " " "*" " "
3  ( 1 ) "*" " " " " "*" " " "*" " "
4  ( 1 ) "*" " " "*" "*" " " "*" " "
5  ( 1 ) "*" " " "*" "*" " " "*" "*"
6  ( 1 ) "*" " " "*" "*" "*" "*" "*"
7  ( 1 ) "*" "*" "*" "*" "*" "*" "*"

BIC and Cp

tibble(
  BIC = summary(trythemall)$bic, 
  Cp = summary(trythemall)$cp,
  size = 1:7
) |>
  pivot_longer(-size) |>
  ggplot(aes(size, value, colour = name)) + 
  geom_point() + 
  geom_line() + 
  facet_wrap(~name, scales = "free_y") + 
  ylab("") +
  scale_colour_manual(
    values = c(blue, orange), 
    guide = "none"
  )

Forward stepwise

stepup <- regsubsets(y ~ ., data = df, method = "forward")
summary(stepup)
Subset selection object
Call: regsubsets.formula(y ~ ., data = df, method = "forward")
7 Variables  (and intercept)
   Forced in Forced out
x1     FALSE      FALSE
x2     FALSE      FALSE
x3     FALSE      FALSE
x4     FALSE      FALSE
x5     FALSE      FALSE
x6     FALSE      FALSE
x7     FALSE      FALSE
1 subsets of each size up to 7
Selection Algorithm: forward
         x1  x2  x3  x4  x5  x6  x7 
1  ( 1 ) "*" " " " " " " " " " " " "
2  ( 1 ) "*" " " " " " " " " "*" " "
3  ( 1 ) "*" " " " " "*" " " "*" " "
4  ( 1 ) "*" " " "*" "*" " " "*" " "
5  ( 1 ) "*" " " "*" "*" " " "*" "*"
6  ( 1 ) "*" " " "*" "*" "*" "*" "*"
7  ( 1 ) "*" "*" "*" "*" "*" "*" "*"

BIC and Cp

tibble(
  BIC = summary(stepup)$bic,
  Cp = summary(stepup)$cp,
  size = 1:7
) |>
  pivot_longer(-size) |>
  ggplot(aes(size, value, colour = name)) +
  geom_point() +
  geom_line() +
  facet_wrap(~name, scales = "free_y") +
  ylab("") +
  scale_colour_manual(
    values = c(blue, orange),
    guide = "none"
  )

Backward selection

stepdown <- regsubsets(y ~ ., data = df, method = "backward")
summary(stepdown)
Subset selection object
Call: regsubsets.formula(y ~ ., data = df, method = "backward")
7 Variables  (and intercept)
   Forced in Forced out
x1     FALSE      FALSE
x2     FALSE      FALSE
x3     FALSE      FALSE
x4     FALSE      FALSE
x5     FALSE      FALSE
x6     FALSE      FALSE
x7     FALSE      FALSE
1 subsets of each size up to 7
Selection Algorithm: backward
         x1  x2  x3  x4  x5  x6  x7 
1  ( 1 ) "*" " " " " " " " " " " " "
2  ( 1 ) "*" " " " " " " " " "*" " "
3  ( 1 ) "*" " " " " "*" " " "*" " "
4  ( 1 ) "*" " " "*" "*" " " "*" " "
5  ( 1 ) "*" " " "*" "*" " " "*" "*"
6  ( 1 ) "*" " " "*" "*" "*" "*" "*"
7  ( 1 ) "*" "*" "*" "*" "*" "*" "*"

BIC and Cp

tibble(
  BIC = summary(stepdown)$bic,
  Cp = summary(stepdown)$cp,
  size = 1:7
) |>
  pivot_longer(-size) |>
  ggplot(aes(size, value, colour = name)) +
  geom_point() +
  geom_line() +
  facet_wrap(~name, scales = "free_y") +
  ylab("") +
  scale_colour_manual(
    values = c(blue, orange), 
    guide = "none"
  )





somehow, for this seed, everything is the same

Randomness and prediction error

All of that was for one data set.

Doesn’t say which procedure is better generally.

If we want to know how they compare generally, we should repeat many times

  1. Generate training data
  2. Estimate with different algorithms
  3. Predict held-out set data
  4. Examine prediction MSE (on held-out set)

I’m not going to do all subsets, just the truth, forward selection, backward, and the full model

For forward/backward selection, I’ll use Cp to choose the final size

☠️☠️ Danger ☠️☠️

You cannot compare the Cp scores between forward selection and backward selection to decide which to use.

Why not?

Code for simulation

… Annoyingly, no predict method for regsubsets, so we make one.

predict.regsubsets <- function(object, newdata, risk_estimate = c("cp", "bic"), ...) {
  risk_estimate <- match.arg(risk_estimate)
  chosen <- coef(object, which.min(summary(object)[[risk_estimate]]))
  predictors <- names(chosen)
  if (object$intercept) predictors <- predictors[-1]
  X <- newdata[, predictors]
  if (object$intercept) X <- cbind2(1, X)
  drop(as.matrix(X) %*% chosen)
}

simulate_and_estimate_them_all <- function(n = 406) {
  N <- 2 * n # generate 2x the amount of data (half train, half test)
  df <- tibble( # generate data
    x1 = rnorm(N), 
    x2 = rnorm(N, mean = 2), 
    x3 = rexp(N),
    x4 = x2 + rnorm(N, sd = .1), 
    x5 = x1 + rnorm(N, sd = .1),
    x6 = x1 - x2 + rnorm(N, sd = .1), 
    x7 = x1 + x3 + rnorm(N, sd = .1),
    y = x1 * 3 + x2 / 3 + rnorm(N, sd = 2.2)
  )
  train <- df[1:n, ] # half the data for training
  test <- df[(n + 1):N, ] # half the data for evaluation
  
  oracle <- lm(y ~ x1 + x2 - 1, data = train) # knowing the right model, not the coefs
  full <- lm(y ~ ., data = train)
  stepup <- regsubsets(y ~ ., data = train, method = "forward")
  stepdown <- regsubsets(y ~ ., data = train, method = "backward")
  
  tibble(
    y = test$y,
    oracle = predict(oracle, newdata = test),
    full = predict(full, newdata = test),
    stepup = predict(stepup, newdata = test),
    stepdown = predict(stepdown, newdata = test),
    truth = drop(as.matrix(test[, c("x1", "x2")]) %*% c(3, 1/3))
  )
}

set.seed(12345)
our_sim <- map(1:50, ~ simulate_and_estimate_them_all(406)) |>
  list_rbind(names_to = "sim")

What is “Oracle”

Delfi Apollons tempel

Results

our_sim |> 
  group_by(sim) %>%
  summarise(
    across(oracle:truth, ~ mean((y - .)^2)), 
    .groups = "drop"
  ) %>%
  transmute(across(oracle:stepdown, ~ . / truth - 1)) |> 
  pivot_longer(
    everything(), 
    names_to = "method", 
    values_to = "mse"
  ) |> 
  ggplot(aes(method, mse, fill = method)) +
  geom_boxplot(notch = TRUE) +
  geom_hline(yintercept = 0, linewidth = 2) +
  scale_fill_viridis_d() +
  theme(legend.position = "none") +
  scale_y_continuous(
    labels = scales::label_percent()
  ) +
  ylab("% increase in mse relative\n to the truth")

Next time…

Module 2

regularization, constraints, and nonparametrics