19 Bagging and random forests
Stat 406
Geoff Pleiss, Trevor Campbell
Last modified – 11 October 2023
\[
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\minimize}{minimize}
\DeclareMathOperator*{\maximize}{maximize}
\DeclareMathOperator*{\find}{find}
\DeclareMathOperator{\st}{subject\,\,to}
\newcommand{\E}{E}
\newcommand{\Expect}[1]{\E\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]}
\newcommand{\given}{\ \vert\ }
\newcommand{\X}{\mathbf{X}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\P}{\mathcal{P}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\snorm}[1]{\lVert #1 \rVert}
\newcommand{\tr}[1]{\mbox{tr}(#1)}
\newcommand{\brt}{\widehat{\beta}^R_{s}}
\newcommand{\brl}{\widehat{\beta}^R_{\lambda}}
\newcommand{\bls}{\widehat{\beta}_{ols}}
\newcommand{\blt}{\widehat{\beta}^L_{s}}
\newcommand{\bll}{\widehat{\beta}^L_{\lambda}}
\]
Bagging
Many methods (trees, nonparametric smoothers) tend to have low bias but high variance .
Especially fully grown trees (that’s why we prune them)
High-variance
if we split the training data into two parts at random and fit a decision tree to each part, the results will be quite different.
In contrast, a low variance estimator
would yield similar results if applied to the two parts (consider \(\widehat{f} = 0\) ).
Bagging , short for bootstrap aggregation , is a general purpose procedure for reducing variance.
We’ll use it specifically in the context of trees, but it can be applied much more broadly.
Bagging: The heuristic motivation
Suppose we have \(n\) uncorrelated observations \(Z_1, \ldots, Z_n\) , each with variance \(\sigma^2\) .
What is the variance of
\[\overline{Z} = \frac{1}{n} \sum_{i=1}^n Z_i\ \ \ ?\]
Suppose we had \(B\) separate (uncorrelated) training sets, \(1, \ldots, B\) ,
We can form \(B\) separate model fits, \(\widehat{f}^1(x), \ldots, \widehat{f}^B(x)\) , and then average them:
\[\widehat{f}_{B}(x) = \frac{1}{B} \sum_{b=1}^B \widehat{f}^b(x)\]
Bagging: The bootstrap part
This isn’t practical
we don’t have many training sets.
We therefore turn to the bootstrap to simulate having many training sets.
Suppose we have data \(Z_1, \ldots, Z_n\)
Choose some large number of samples, \(B\) .
For each \(b = 1,\ldots,B\) , resample from \(Z_1, \ldots, Z_n\) , call it \(\widetilde{Z}_1, \ldots, \widetilde{Z}_n\) .
Compute \(\widehat{f}^b = \widehat{f}(\widetilde{Z}_1, \ldots, \widetilde{Z}_n)\) .
\[\widehat{f}_{\textrm{bag}}(x) = \frac{1}{B} \sum_{b=1}^B \widehat{f}^b(x)\]
This process is known as Bagging
Bagging trees
The procedure for trees is the following
Choose a large number \(B\) .
For each \(b = 1,\ldots, B\) , grow an unpruned tree on the \(b^{th}\) bootstrap draw from the data.
Average all these trees together.
Bagging trees
Each tree, since it is unpruned, will have
low / high variance
low / high bias
Therefore averaging many trees results in an estimator that has
Bagging trees: Variable importance measures
Bagging can dramatically improve predictive performance of trees
But we sacrificed some interpretability .
We no longer have that nice diagram that shows the segmentation of the predictor space
(more accurately, we have \(B\) of them).
To recover some information, we can do the following:
For each of the \(b\) trees and each of the \(p\) variables, we record the amount that the Gini index is reduced by the addition of that variable
Report the average reduction over all \(B\) trees.
Random Forest
Random Forest is an extension of Bagging, in which the bootstrap trees are decorrelated .
Remember: \(\Var{\overline{Z}} = \frac{1}{n}\Var{Z_1}\) unless the \(Z_i\) ’s are correlated
So Bagging may not reduce the variance that much because the training sets are correlated across trees.
How do we decorrelate?
Draw a bootstrap sample and start to build a tree.
But
Before we split, we randomly pick
\(m\) of the possible \(p\) predictors as candidates for the split.
Decorrelating
A new sample of size \(m\) of the predictors is taken at each split .
Usually, we use about \(m = \sqrt{p}\)
In other words, at each split, we aren’t even allowed to consider the majority of possible predictors!
What is going on here?
Suppose there is 1 really strong predictor and many mediocre ones.
Then each tree will have this one predictor in it,
Therefore, each tree will look very similar (i.e. highly correlated).
Averaging highly correlated things leads to much less variance reduction than if they were uncorrelated.
If we don’t allow some trees/splits to use this important variable, each of the trees will be much less similar and hence much less correlated.
Bagging Trees is Random Forest when \(m = p\) , that is, when we can consider all the variables at each split.
Example with Mobility data
library (randomForest)
library (kableExtra)
set.seed (406406 )
mob <- Stat406:: mobility |>
mutate (mobile = as.factor (Mobility > .1 )) |>
select (- ID, - Name, - Mobility, - State) |>
drop_na ()
n <- nrow (mob)
trainidx <- sample.int (n, floor (n * .75 ))
testidx <- setdiff (1 : n, trainidx)
train <- mob[trainidx, ]
test <- mob[testidx, ]
rf <- randomForest (mobile ~ ., data = train)
bag <- randomForest (mobile ~ ., data = train, mtry = ncol (mob) - 1 )
preds <- tibble (truth = test$ mobile, rf = predict (rf, test), bag = predict (bag, test))
kbl (cbind (table (preds$ truth, preds$ rf), table (preds$ truth, preds$ bag))) |>
add_header_above (c ("Truth" = 1 , "RF" = 2 , "Bagging" = 2 ))
FALSE
TRUE
FALSE
TRUE
FALSE
61
10
60
11
TRUE
12
22
10
24
Example with Mobility data
varImpPlot (rf, pch = 16 , col = orange)
One last thing…
On average
drawing \(n\) samples from \(n\) observations with replacement (bootstrapping) results in ~ 2/3 of the observations being selected. (Can you show this?)
The remaining ~ 1/3 of the observations are not used on that tree .
These are referred to as out-of-bag (OOB) .
We can think of it as a for-free cross-validation .
Each time a tree is grown, we get its prediction error on the unused observations.
We average this over all bootstrap samples.
Out-of-bag error estimation for bagging / RF
For randomForest()
, predict()
without passing newdata =
gives the OOB prediction
not like lm()
where it gives the fitted values
tab <- table (predict (bag), train$ mobile)
kbl (tab) |> add_header_above (c ("Truth" = 1 , "Bagging" = 2 ))
FALSE
TRUE
FALSE
182
28
TRUE
21
82
1 - sum (diag (tab)) / sum (tab) ## OOB misclassification error, no need for CV