20 Boosting

class: center, middle, inverse, title-slide

.title[
# 20 Boosting
]
.author[
### STAT 406
]
.author[
### Daniel J. McDonald
]
.date[
### Last modified - 2022-11-02
]

---

## Last time

We learned about bagging, for averaging .secondary[low-bias] / .primary[high-variance] estimators.

Today, we examine it's opposite: __Boosting__.

__Boosting__ also combines estimators, but it combines __high-bias__ / low-variance estimators.

Boosting has a number of flavours. And if you Google descriptions, most are wrong.

For a deep (and accurate) treatment, see [ESL] Chapter 10

We'll discuss 2 flavours: AdaBoost and Gradient Boosting

Neither requires a tree, but that's the typical usage.

Boosting needs a "weak learner", so small trees (called stumps) are natural.

`$$\newcommand{\Expect}[1]{E\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]}
\newcommand{\given}{\ \vert\ }
\newcommand{\argmin}{\arg\min}
\newcommand{\argmax}{\arg\max}
\newcommand{\R}{\mathbb{R}}
\newcommand{\P}{\mathbb{P}}
\renewcommand{\hat}{\widehat}
\newcommand{\tr}[1]{\mbox{tr}(#1)}
\newcommand{\X}{\mathbf{X}}
\newcommand{\y}{\mathbf{y}}$$`

---

## AdaBoost intuition

At each iteration, we weight the __observations__.

Observations that are currently misclassified, get __higher__ weights.

So on the next iteration, we'll try harder to correctly classify our mistakes.

The number of iterations must be chosen.

---

## AdaBoost (Freund and Schapire)

Let `$G(x, \theta)$` be a weak learner (say a tree with one split)

.emphasis[

__Algorithm (AdaBoost):__

1. Set observation weights `$w_i=1/n$`.

2. Until we quit ( `$m<M$` iterations )
    
    a. Estimate the classifier `$G(x,\theta_m)$` using weights `$w_i$`
    
    b. Calculate it's weighted error `$\textrm{err}_m = \sum_{i=1}^n w_i I(y_i \neq G(x_i, \theta_m)) / \sum w_i$`
    
    c. Set `$\alpha_m = \log((1-\textrm{err}_m)/\text{err}_m)$`
    
    d. Update `$w_i \leftarrow w_i \exp(\alpha_m I(y_i \neq G(x_i,\theta_m)))$`

3. Final classifier is `$G(x) = \textrm{sign}\left( \sum_{m=1}^M \alpha_m G(x, \theta_m)\right)$`
]

---

## Using mobility data again

```r
library(gbm)
train_boost <- train %>% mutate(mobile = as.integer(mobile) - 1) # needs {0, 1} responses
test_boost <- test %>% mutate(mobile = as.integer(mobile) - 1)
adab <- gbm(mobile ~ ., data = train_boost, n.trees = 500, distribution = "adaboost")
preds$adab = as.numeric(predict(adab, test_boost) > 0)
par(mar = c(5,15,0,1))
summary(adab, las = 1)
```

```
##                                                 var    rel.inf
## Single_mothers                       Single_mothers 14.0404924
## Local_tax_rate                       Local_tax_rate  6.7989919
## Religious                                 Religious  6.6286695
## Commute                                     Commute  6.4179702
## Test_scores                             Test_scores  5.9974412
## Black                                         Black  4.8716771
## Manufacturing                         Manufacturing  4.7307967
## Gini_99                                     Gini_99  4.2246181
## Chinese_imports                     Chinese_imports  3.9555660
## Latitude                                   Latitude  3.8442520
## Foreign_born                           Foreign_born  2.8028885
## Middle_class                           Middle_class  2.7339875
## Progressivity                         Progressivity  2.6475539
## Graduation                               Graduation  2.4408142
## Share01                                     Share01  2.4262649
## Tuition                                     Tuition  2.3563082
## Longitude                                 Longitude  2.0708840
## Student_teacher_ratio         Student_teacher_ratio  2.0645286
## School_spending                     School_spending  1.9700753
## Migration_out                         Migration_out  1.8248691
## Married                                     Married  1.8081761
## HS_dropout                               HS_dropout  1.6722032
## Colleges                                   Colleges  1.5849022
## Local_gov_spending               Local_gov_spending  1.2733602
## Migration_in                           Migration_in  1.2450674
## Social_capital                       Social_capital  1.2442027
## Seg_racial                               Seg_racial  0.9630583
## Teenage_labor                         Teenage_labor  0.7700715
## Labor_force_participation Labor_force_participation  0.6406051
## Income                                       Income  0.6310008
## Divorced                                   Divorced  0.5880420
## Seg_poverty                             Seg_poverty  0.5662602
## Violent_crime                         Violent_crime  0.5654377
## Population                               Population  0.4550882
## EITC                                           EITC  0.4385737
## Seg_affluence                         Seg_affluence  0.3075372
## Seg_income                               Seg_income  0.2414435
## Gini                                           Gini  0.1563206
## Urban                                         Urban  0.0000000
```

---

## Forward stagewise additive modeling

Generic for regression or classification, any weak learner `$G(x,\ \theta)$`

.emphasis[

__Algorithm:__

1. Set initial predictor `$f_0(x)=0$`

2. Until we quit ( `$m<M$` iterations )
    
    a. Compute `$$(\beta_m, \theta_m) = \arg\min_{\beta, \theta} \sum_{i=1}^n L\left(y_i,\ f_{m-1}(x_i) + \beta G(x_i,\ \theta)\right)$$`
    
    b. Set `$f_m(x) = f_{m-1}(x) + \beta_m G(x,\ \theta_m)$`
    
3. Final classifier is `$G(x, \theta_M) = \textrm{sign}\left( f_M(x) \right)$`
]

Here, `$L$` is a loss function that measures prediction accuracy

If `$L(y,\ f(x))= \exp(-y f(x))$`, `$G$` is a classifier, and `$y \in \{-1, 1\}$` then this is equivalent to AdaBoost. Proven 5 years later (Friedman, Hastie, and Tibshirani 2000).

---

## So what?

It turns out that "exponential loss" `$L(y,\ f(x))= \exp(-y f(x))$` is not very robust.

Here are some other loss functions for 2-class classification

Want losses which penalize negative margin, but not positive margins.

Robust means .hand[don't over-penalize large negatives]

---

## Gradient boosting

In the forward stagewise algorithm, we solved a minimization and then made an update

`$f_m(x) = f_{m-1}(x) + \beta_m G(x, \theta_m)$`.

For most loss functions, `$\arg\min_{\beta, \theta} \sum_{i=1}^n L\left(y_i,\ f_{m-1}(x_i) + \beta G(x_i, \theta)\right)$` cannot be solved

Instead, if we take one gradient step toward the minimum, we get

`$f_m(x) = f_{m-1}(x) -\gamma_m \nabla L(y,f_{m-1}(x)) = f_{m-1}(x) +\gamma_m \left(-\nabla L(y,f_{m-1}(x))\right)$`

This is called __Gradient boosting__

Notice how similar the update steps look.

Gradient boosting goes only part of the way toward the minimum at each `$m$`. This has two implications:

1. Since we're not fitting `$\beta, \theta$` to the data as "hard", the learner is weaker.

2. This procedure is computationally much simpler.

---

## Gradient boosting

.emphasis[

__Algorithm:__

1. Set initial predictor `$f_0(x)=\overline{\y}$`

2. Until we quit ( `$m<M$` iterations )
    
    a. Compute pseudo-residuals (what is the gradient of `$L(y,f)=(y-f(x))^2$`?)
    `$$r_i = -\frac{\partial L(y_i,f(x_i))}{\partial f(x_i)}\bigg|_{f(x_i)=f_{m-1}(x_i)}$$`
    
    b. Estimate weak learner, `$G(x, \theta_m)$`, with the training set `$\{r_i, x_i\}$`.
    
    c. Find the step size `$\gamma_m = \arg\min_\gamma \sum_{i=1}^n L(y_i, f_{m-1}(x_i) + \gamma G(x_i, \theta_m))$`
    
    b. Set `$f_m(x) = f_{m-1}(x) + \gamma_m G(x, \theta_m)$`
    
3. Final predictor is `$f_M(x)$`.
]

```r
grad_boost <- gbm(mobile ~ ., data = train_boost, n.trees = 500, distribution = "bernoulli")
```

---

## Gradient boosting modifications

* Typically done with "small" trees, not stumps because of the gradient. You can specify the size. Usually 4-8 terminal nodes is recommended (more gives more interactions between predictors)

* Usually modify the gradient step to `$f_m(x) = f_{m-1}(x) + \gamma_m \alpha G(x,\theta_m)$` with `$0<\alpha<1$`. Helps to keep from fitting too hard.

* Often combined with Bagging so that each step is fit using a bootstrap resample of the data. Gives us out-of-bag options.

* There are many other extensions, notably XGBoost.

---

## Major takeaways

* Two flavours of Boosting (1) AdaBoost (the original) and (2) gradient boosting (easier and more computationally friendly)

* The connection is "Forward stagewise additive modelling" (AdaBoost is a special case)

* But that special case "isn't robust because it uses exponential loss" (squared error is even worse)

* Gradient boosting is a computationally easier version of FSAM

* All use **weak learners** (compare to Bagging)

* Think about the Bias-Variance implications

---
class: middle, inverse, center

# Next time...

Neural networks and deep learning, the beginning