class: center, middle, inverse, title-slide .title[ # 20 Boosting ] .author[ ### STAT 406 ] .author[ ### Daniel J. McDonald ] .date[ ### Last modified - 2022-11-02 ] --- ## Last time We learned about bagging, for averaging .secondary[low-bias] / .primary[high-variance] estimators. Today, we examine it's opposite: __Boosting__. __Boosting__ also combines estimators, but it combines __high-bias__ / low-variance estimators. Boosting has a number of flavours. And if you Google descriptions, most are wrong. For a deep (and accurate) treatment, see [ESL] Chapter 10 -- We'll discuss 2 flavours: AdaBoost and Gradient Boosting Neither requires a tree, but that's the typical usage. Boosting needs a "weak learner", so small trees (called stumps) are natural. `$$\newcommand{\Expect}[1]{E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\argmin}{\arg\min} \newcommand{\argmax}{\arg\max} \newcommand{\R}{\mathbb{R}} \newcommand{\P}{\mathbb{P}} \renewcommand{\hat}{\widehat} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\X}{\mathbf{X}} \newcommand{\y}{\mathbf{y}}$$` --- ## AdaBoost intuition At each iteration, we weight the __observations__. Observations that are currently misclassified, get __higher__ weights. So on the next iteration, we'll try harder to correctly classify our mistakes. The number of iterations must be chosen. --- ## AdaBoost (Freund and Schapire) Let `\(G(x, \theta)\)` be a weak learner (say a tree with one split) .emphasis[ __Algorithm (AdaBoost):__ 1. Set observation weights `\(w_i=1/n\)`. 2. Until we quit ( `\(m<M\)` iterations ) a. Estimate the classifier `\(G(x,\theta_m)\)` using weights `\(w_i\)` b. Calculate it's weighted error `\(\textrm{err}_m = \sum_{i=1}^n w_i I(y_i \neq G(x_i, \theta_m)) / \sum w_i\)` c. Set `\(\alpha_m = \log((1-\textrm{err}_m)/\text{err}_m)\)` d. Update `\(w_i \leftarrow w_i \exp(\alpha_m I(y_i \neq G(x_i,\theta_m)))\)` 3. Final classifier is `\(G(x) = \textrm{sign}\left( \sum_{m=1}^M \alpha_m G(x, \theta_m)\right)\)` ] --- ## Using mobility data again ```r library(gbm) train_boost <- train %>% mutate(mobile = as.integer(mobile) - 1) # needs {0, 1} responses test_boost <- test %>% mutate(mobile = as.integer(mobile) - 1) adab <- gbm(mobile ~ ., data = train_boost, n.trees = 500, distribution = "adaboost") preds$adab = as.numeric(predict(adab, test_boost) > 0) par(mar = c(5,15,0,1)) summary(adab, las = 1) ``` <img src="rmd_gfx/20-boosting/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> ``` ## var rel.inf ## Single_mothers Single_mothers 14.0404924 ## Local_tax_rate Local_tax_rate 6.7989919 ## Religious Religious 6.6286695 ## Commute Commute 6.4179702 ## Test_scores Test_scores 5.9974412 ## Black Black 4.8716771 ## Manufacturing Manufacturing 4.7307967 ## Gini_99 Gini_99 4.2246181 ## Chinese_imports Chinese_imports 3.9555660 ## Latitude Latitude 3.8442520 ## Foreign_born Foreign_born 2.8028885 ## Middle_class Middle_class 2.7339875 ## Progressivity Progressivity 2.6475539 ## Graduation Graduation 2.4408142 ## Share01 Share01 2.4262649 ## Tuition Tuition 2.3563082 ## Longitude Longitude 2.0708840 ## Student_teacher_ratio Student_teacher_ratio 2.0645286 ## School_spending School_spending 1.9700753 ## Migration_out Migration_out 1.8248691 ## Married Married 1.8081761 ## HS_dropout HS_dropout 1.6722032 ## Colleges Colleges 1.5849022 ## Local_gov_spending Local_gov_spending 1.2733602 ## Migration_in Migration_in 1.2450674 ## Social_capital Social_capital 1.2442027 ## Seg_racial Seg_racial 0.9630583 ## Teenage_labor Teenage_labor 0.7700715 ## Labor_force_participation Labor_force_participation 0.6406051 ## Income Income 0.6310008 ## Divorced Divorced 0.5880420 ## Seg_poverty Seg_poverty 0.5662602 ## Violent_crime Violent_crime 0.5654377 ## Population Population 0.4550882 ## EITC EITC 0.4385737 ## Seg_affluence Seg_affluence 0.3075372 ## Seg_income Seg_income 0.2414435 ## Gini Gini 0.1563206 ## Urban Urban 0.0000000 ``` --- ## Forward stagewise additive modeling Generic for regression or classification, any weak learner `\(G(x,\ \theta)\)` .emphasis[ __Algorithm:__ 1. Set initial predictor `\(f_0(x)=0\)` 2. Until we quit ( `\(m<M\)` iterations ) a. Compute `$$(\beta_m, \theta_m) = \arg\min_{\beta, \theta} \sum_{i=1}^n L\left(y_i,\ f_{m-1}(x_i) + \beta G(x_i,\ \theta)\right)$$` b. Set `\(f_m(x) = f_{m-1}(x) + \beta_m G(x,\ \theta_m)\)` 3. Final classifier is `\(G(x, \theta_M) = \textrm{sign}\left( f_M(x) \right)\)` ] Here, `\(L\)` is a loss function that measures prediction accuracy If `\(L(y,\ f(x))= \exp(-y f(x))\)`, `\(G\)` is a classifier, and `\(y \in \{-1, 1\}\)` then this is equivalent to AdaBoost. Proven 5 years later (Friedman, Hastie, and Tibshirani 2000). --- ## So what? It turns out that "exponential loss" `\(L(y,\ f(x))= \exp(-y f(x))\)` is not very robust. Here are some other loss functions for 2-class classification <img src="rmd_gfx/20-boosting/loss-funs-1.svg" style="display: block; margin: auto;" /> -- Want losses which penalize negative margin, but not positive margins. Robust means .hand[don't over-penalize large negatives] --- ## Gradient boosting In the forward stagewise algorithm, we solved a minimization and then made an update `\(f_m(x) = f_{m-1}(x) + \beta_m G(x, \theta_m)\)`. For most loss functions, `\(\arg\min_{\beta, \theta} \sum_{i=1}^n L\left(y_i,\ f_{m-1}(x_i) + \beta G(x_i, \theta)\right)\)` cannot be solved Instead, if we take one gradient step toward the minimum, we get `\(f_m(x) = f_{m-1}(x) -\gamma_m \nabla L(y,f_{m-1}(x)) = f_{m-1}(x) +\gamma_m \left(-\nabla L(y,f_{m-1}(x))\right)\)` This is called __Gradient boosting__ Notice how similar the update steps look. Gradient boosting goes only part of the way toward the minimum at each `\(m\)`. This has two implications: 1. Since we're not fitting `\(\beta, \theta\)` to the data as "hard", the learner is weaker. 2. This procedure is computationally much simpler. --- ## Gradient boosting .emphasis[ __Algorithm:__ 1. Set initial predictor `\(f_0(x)=\overline{\y}\)` 2. Until we quit ( `\(m<M\)` iterations ) a. Compute pseudo-residuals (what is the gradient of `\(L(y,f)=(y-f(x))^2\)`?) `$$r_i = -\frac{\partial L(y_i,f(x_i))}{\partial f(x_i)}\bigg|_{f(x_i)=f_{m-1}(x_i)}$$` b. Estimate weak learner, `\(G(x, \theta_m)\)`, with the training set `\(\{r_i, x_i\}\)`. c. Find the step size `\(\gamma_m = \arg\min_\gamma \sum_{i=1}^n L(y_i, f_{m-1}(x_i) + \gamma G(x_i, \theta_m))\)` b. Set `\(f_m(x) = f_{m-1}(x) + \gamma_m G(x, \theta_m)\)` 3. Final predictor is `\(f_M(x)\)`. ] ```r grad_boost <- gbm(mobile ~ ., data = train_boost, n.trees = 500, distribution = "bernoulli") ``` --- ## Gradient boosting modifications * Typically done with "small" trees, not stumps because of the gradient. You can specify the size. Usually 4-8 terminal nodes is recommended (more gives more interactions between predictors) * Usually modify the gradient step to `\(f_m(x) = f_{m-1}(x) + \gamma_m \alpha G(x,\theta_m)\)` with `\(0<\alpha<1\)`. Helps to keep from fitting too hard. * Often combined with Bagging so that each step is fit using a bootstrap resample of the data. Gives us out-of-bag options. * There are many other extensions, notably XGBoost. <img src="rmd_gfx/20-boosting/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> --- ## Major takeaways * Two flavours of Boosting (1) AdaBoost (the original) and (2) gradient boosting (easier and more computationally friendly) * The connection is "Forward stagewise additive modelling" (AdaBoost is a special case) * But that special case "isn't robust because it uses exponential loss" (squared error is even worse) * Gradient boosting is a computationally easier version of FSAM * All use **weak learners** (compare to Bagging) * Think about the Bias-Variance implications --- class: middle, inverse, center # Next time... Neural networks and deep learning, the beginning