19 Bagging and random forests

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 11 October 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

Bagging

Many methods (trees, nonparametric smoothers) tend to have low bias but high variance.

Especially fully grown trees (that’s why we prune them)

High-variance: if we split the training data into two parts at random and fit a decision tree to each part, the results will be quite different.
In contrast, a low variance estimator: would yield similar results if applied to the two parts (consider \(\widehat{f} = 0\)).

Bagging, short for bootstrap aggregation, is a general purpose procedure for reducing variance.

We’ll use it specifically in the context of trees, but it can be applied much more broadly.

Bagging: The heuristic motivation

Suppose we have \(n\) uncorrelated observations \(Z_1, \ldots, Z_n\), each with variance \(\sigma^2\).

What is the variance of

\[\overline{Z} = \frac{1}{n} \sum_{i=1}^n Z_i\ \ \ ?\]

Suppose we had \(B\) separate (uncorrelated) training sets, \(1, \ldots, B\),

We can form \(B\) separate model fits, \(\widehat{f}^1(x), \ldots, \widehat{f}^B(x)\), and then average them:

\[\widehat{f}_{B}(x) = \frac{1}{B} \sum_{b=1}^B \widehat{f}^b(x)\]

Bagging: The bootstrap part

This isn’t practical: we don’t have many training sets.

We therefore turn to the bootstrap to simulate having many training sets.

Suppose we have data \(Z_1, \ldots, Z_n\)

Choose some large number of samples, \(B\).
For each \(b = 1,\ldots,B\), resample from \(Z_1, \ldots, Z_n\), call it \(\widetilde{Z}_1, \ldots, \widetilde{Z}_n\).
Compute \(\widehat{f}^b = \widehat{f}(\widetilde{Z}_1, \ldots, \widetilde{Z}_n)\).

\[\widehat{f}_{\textrm{bag}}(x) = \frac{1}{B} \sum_{b=1}^B \widehat{f}^b(x)\]

This process is known as Bagging

Bagging trees

The procedure for trees is the following

Choose a large number \(B\).
For each \(b = 1,\ldots, B\), grow an unpruned tree on the \(b^{th}\) bootstrap draw from the data.
Average all these trees together.

Bagging trees

Each tree, since it is unpruned, will have

low / high variance
low / high bias

Therefore averaging many trees results in an estimator that has

lower / higher variance and
low / high bias.

Bagging trees: Variable importance measures

Bagging can dramatically improve predictive performance of trees

But we sacrificed some interpretability.

We no longer have that nice diagram that shows the segmentation of the predictor space

(more accurately, we have \(B\) of them).

To recover some information, we can do the following:

For each of the \(b\) trees and each of the \(p\) variables, we record the amount that the Gini index is reduced by the addition of that variable
Report the average reduction over all \(B\) trees.

Random Forest

Random Forest is an extension of Bagging, in which the bootstrap trees are decorrelated.

Remember: \(\Var{\overline{Z}} = \frac{1}{n}\Var{Z_1}\) unless the \(Z_i\)’s are correlated

So Bagging may not reduce the variance that much because the training sets are correlated across trees.

How do we decorrelate?

Draw a bootstrap sample and start to build a tree.

But: Before we split, we randomly pick; \(m\) of the possible \(p\) predictors as candidates for the split.

Decorrelating

A new sample of size \(m\) of the predictors is taken at each split.

Usually, we use about \(m = \sqrt{p}\)

In other words, at each split, we aren’t even allowed to consider the majority of possible predictors!

What is going on here?

Suppose there is 1 really strong predictor and many mediocre ones.

Then each tree will have this one predictor in it,
Therefore, each tree will look very similar (i.e. highly correlated).
Averaging highly correlated things leads to much less variance reduction than if they were uncorrelated.

If we don’t allow some trees/splits to use this important variable, each of the trees will be much less similar and hence much less correlated.

Bagging Trees is Random Forest when \(m = p\), that is, when we can consider all the variables at each split.

Example with Mobility data

library(randomForest)
library(kableExtra)
set.seed(406406)
mob <- Stat406::mobility |>
  mutate(mobile = as.factor(Mobility > .1)) |>
  select(-ID, -Name, -Mobility, -State) |>
  drop_na()
n <- nrow(mob)
trainidx <- sample.int(n, floor(n * .75))
testidx <- setdiff(1:n, trainidx)
train <- mob[trainidx, ]
test <- mob[testidx, ]
rf <- randomForest(mobile ~ ., data = train)
bag <- randomForest(mobile ~ ., data = train, mtry = ncol(mob) - 1)
preds <-  tibble(truth = test$mobile, rf = predict(rf, test), bag = predict(bag, test))

kbl(cbind(table(preds$truth, preds$rf), table(preds$truth, preds$bag))) |>
  add_header_above(c("Truth" = 1, "RF" = 2, "Bagging" = 2))

Truth	RF		Bagging
	FALSE	TRUE	FALSE	TRUE
FALSE	61	10	60	11
TRUE	12	22	10	24

Example with Mobility data

varImpPlot(rf, pch = 16, col = orange)

One last thing…

On average: drawing \(n\) samples from \(n\) observations with replacement (bootstrapping) results in ~ 2/3 of the observations being selected. (Can you show this?)

The remaining ~ 1/3 of the observations are not used on that tree.

These are referred to as out-of-bag (OOB).

We can think of it as a for-free cross-validation.

Each time a tree is grown, we get its prediction error on the unused observations.

We average this over all bootstrap samples.

Out-of-bag error estimation for bagging / RF

For randomForest(), predict() without passing newdata = gives the OOB prediction

not like lm() where it gives the fitted values

tab <- table(predict(bag), train$mobile) 
kbl(tab) |> add_header_above(c("Truth" = 1, "Bagging" = 2))

Truth	Bagging
	FALSE	TRUE
FALSE	182	28
TRUE	21	82

1 - sum(diag(tab)) / sum(tab) ## OOB misclassification error, no need for CV

[1] 0.1565495

Next time…

Boosting