20 Boosting

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 03 November 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Last time

We learned about bagging, for averaging low-bias / high-variance estimators.

Today, we examine it’s opposite: Boosting.

Boosting also combines estimators, but it combines high-bias / low-variance estimators.

Boosting has a number of flavours. And if you Google descriptions, most are wrong.

For a deep (and accurate) treatment, see [ESL] Chapter 10

We’ll discuss 2 flavours: AdaBoost and Gradient Boosting

Neither requires a tree, but that’s the typical usage.

Boosting needs a “weak learner”, so small trees (stumps) are natural.

AdaBoost intuition (for classification)

At each iteration, we weight the observations.

Observations that are currently misclassified, get higher weights.

So on the next iteration, we’ll try harder to correctly classify our mistakes.

The number of iterations must be chosen.

AdaBoost (Freund and Schapire, generic)

Let \(G(x, \theta)\) be any weak learner

⛭ imagine a tree with one split: then \(\theta=\) (feature, split point)

Algorithm (AdaBoost) 🛠️

Set observation weights \(w_i=1/n\).
Until we quit ( \(m<M\) iterations )
1. Estimate the classifier \(G(x,\theta_m)\) using weights \(w_i\)
2. Calculate it’s weighted error \(\textrm{err}_m = \sum_{i=1}^n w_i I(y_i \neq G(x_i, \theta_m)) / \sum w_i\)
3. Set \(\alpha_m = \log((1-\textrm{err}_m)/\text{err}_m)\)
4. Update \(w_i \leftarrow w_i \exp(\alpha_m I(y_i \neq G(x_i,\theta_m)))\)
Final classifier is \(G(x) = \textrm{sign}\left( \sum_{m=1}^M \alpha_m G(x, \theta_m)\right)\)

Example Performance

Adaboost with decision stumps (just one split) on synthetic data

Using mobility data again

Code

library(kableExtra)
library(randomForest)
mob <- Stat406::mobility |>
  mutate(mobile = as.factor(Mobility > .1)) |>
  select(-ID, -Name, -Mobility, -State) |>
  drop_na()
n <- nrow(mob)
trainidx <- sample.int(n, floor(n * .75))
testidx <- setdiff(1:n, trainidx)
train <- mob[trainidx, ]
test <- mob[testidx, ]
rf <- randomForest(mobile ~ ., data = train)
bag <- randomForest(mobile ~ ., data = train, mtry = ncol(mob) - 1)
preds <- tibble(truth = test$mobile, rf = predict(rf, test), bag = predict(bag, test))

library(gbm)
train_boost <- train |>
  mutate(mobile = as.integer(mobile) - 1)
# needs {0, 1} responses
test_boost <- test |>
  mutate(mobile = as.integer(mobile) - 1)
adab <- gbm(
  mobile ~ .,
  data = train_boost,
  n.trees = 500,
  distribution = "adaboost"
)
preds$adab <- as.numeric(
  predict(adab, test_boost) > 0
)
par(mar = c(5, 11, 0, 1))
s <- summary(adab, las = 1)

Forward stagewise additive modeling (FSAM, completely generic)

Algorithm 🛠️

Set initial predictor \(f_0(x)=0\)
Until we quit ( \(m<M\) iterations )
1. Compute \((\beta_m, \theta_m) = \argmin_{\beta, \theta} \sum_{i=1}^n L\left(y_i,\ f_{m-1}(x_i) + \beta G(x_i,\ \theta)\right)\)
2. Set \(f_m(x) = f_{m-1}(x) + \beta_m G(x,\ \theta_m)\)
Final classifier is \(G(x, \theta_M) = \textrm{sign}\left( f_M(x) \right)\)

Here, \(L\) is a loss function that measures prediction accuracy

If (1) \(L(y,\ f(x))= \exp(-y f(x))\), (2) \(G\) is a classifier, and WLOG \(y \in \{-1, 1\}\)

FSAM is equivalent to AdaBoost. Proven 5 years later (Friedman, Hastie, and Tibshirani 2000).

So what?

It turns out that “exponential loss” \(L(y,\ f(x))= \exp(-y f(x))\) is not very robust.

Here are some other loss functions for 2-class classification

Want losses which penalize negative margin, but not positive margins.

Robust means don’t over-penalize large negatives

Gradient boosting

In the forward stagewise algorithm, we solved a minimization and then made an update:

\[f_m(x) = f_{m-1}(x) + \beta_m G(x, \theta_m)\]

For most loss functions \(L\) / procedures \(G\) this optimization is difficult: \[\argmin_{\beta, \theta} \sum_{i=1}^n L\left(y_i,\ f_{m-1}(x_i) + \beta G(x_i, \theta)\right)\]

💡 Just take one gradient step toward the minimum 💡

\[f_m(x) = f_{m-1}(x) -\gamma_m \nabla L(y,f_{m-1}(x)) = f_{m-1}(x) +\gamma_m \left(-\nabla L(y,f_{m-1}(x))\right)\]

This is called Gradient boosting

Notice how similar the update steps look.

Gradient boosting

\[f_m(x) = f_{m-1}(x) -\gamma_m \nabla L(y,f_{m-1}(x)) = f_{m-1}(x) +\gamma_m \left(-\nabla L(y,f_{m-1}(x))\right)\]

Gradient boosting goes only part of the way toward the minimum at each \(m\).

This has two advantages:

Since we’re not fitting \(\beta, \theta\) to the data as “hard”, the learner is weaker.
This procedure is computationally much simpler.

Simpler because we only require the gradient at one value, don’t have to fully optimize.

Gradient boosting – Algorithm 🛠️

Set initial predictor \(f_0(x)=\overline{\y}\)
Until we quit ( \(m<M\) iterations )
1. Compute pseudo-residuals (what is the gradient of \(L(y,f)=(y-f(x))^2\)?) \[r_i = -\frac{\partial L(y_i,f(x_i))}{\partial f(x_i)}\bigg|_{f(x_i)=f_{m-1}(x_i)}\]
2. Estimate weak learner, \(G(x, \theta_m)\), with the training set \(\{r_i, x_i\}\).
3. Find the step size \(\gamma_m = \argmin_\gamma \sum_{i=1}^n L(y_i, f_{m-1}(x_i) + \gamma G(x_i, \theta_m))\)
4. Set \(f_m(x) = f_{m-1}(x) + \gamma_m G(x, \theta_m)\)
Final predictor is \(f_M(x)\).

Gradient boosting modifications

grad_boost <- gbm(mobile ~ ., data = train_boost, n.trees = 500, distribution = "bernoulli")

Typically done with “small” trees, not stumps because of the gradient. You can specify the size. Usually 4-8 terminal nodes is recommended (more gives more interactions between predictors)
Usually modify the gradient step to \(f_m(x) = f_{m-1}(x) + \gamma_m \alpha G(x,\theta_m)\) with \(0<\alpha<1\). Helps to keep from fitting too hard.
Often combined with Bagging so that each step is fit using a bootstrap resample of the data. Gives us out-of-bag options.
There are many other extensions, notably XGBoost.

Results for `mobility`

Code

library(cowplot)
boost_preds <- tibble(
  adaboost = predict(adab, test_boost),
  gbm = predict(grad_boost, test_boost),
  truth = test$mobile
)
g1 <- ggplot(boost_preds, aes(adaboost, gbm, color = as.factor(truth))) +
  geom_text(aes(label = as.integer(truth) - 1)) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  xlab("adaboost margin") +
  ylab("gbm margin") +
  theme(legend.position = "none") +
  scale_color_manual(values = c("orange", "blue")) +
  annotate("text",
    x = -4, y = 5, color = red,
    label = paste(
      "gbm error\n",
      round(with(boost_preds, mean((gbm > 0) != truth)), 2)
    )
  ) +
  annotate("text",
    x = 4, y = -5, color = red,
    label = paste("adaboost error\n", round(with(boost_preds, mean((adaboost > 0) != truth)), 2))
  )
boost_oob <- tibble(
  adaboost = adab$oobag.improve, gbm = grad_boost$oobag.improve,
  ntrees = 1:500
)
g2 <- boost_oob %>%
  pivot_longer(-ntrees, values_to = "OOB_Error") %>%
  ggplot(aes(x = ntrees, y = OOB_Error, color = name)) +
  geom_line() +
  scale_color_manual(values = c(orange, blue)) +
  theme(legend.title = element_blank())
plot_grid(g1, g2, rel_widths = c(.4, .6))

Major takeaways

Two flavours of Boosting
1. AdaBoost (the original) and
2. gradient boosting (easier and more computationally friendly)
The connection is “Forward stagewise additive modelling” (AdaBoost is a special case)
The connection reveals that AdaBoost “isn’t robust because it uses exponential loss” (squared error is even worse)
Gradient boosting is a computationally easier version of FSAM
All use weak learners (compare to Bagging)
Think about the Bias-Variance implications
You can use these for regression or classification
You can do this with other weak learners besides trees.

Next time…

Neural networks and deep learning, the beginning

20 Boosting

Last time

AdaBoost intuition (for classification)

AdaBoost (Freund and Schapire, generic)

Example Performance

Using mobility data again

Forward stagewise additive modeling (FSAM, completely generic)

So what?

Gradient boosting

Gradient boosting

Gradient boosting – Algorithm 🛠️

Gradient boosting modifications

Results for mobility

Major takeaways

Next time…

Results for `mobility`