library(rpart)
data(Boston, package = "MASS")
set.seed(123456)
<- nrow(Boston)
n <- sample(n, floor(n / 4))
ii <- Boston[ii, ]
dat.te <- Boston[-ii, ] dat.tr
14 Bagging for regression
One strategy to obtain more stable predictors is called Bootstrap AGGregatING (bagging). It can be applied to many predictors (not only trees), and it generally results in larger improvements in prediction quality when it is used with predictors that are flexible (low bias), but highly variable.
The justification and motivation were discussed in class. Intuitively we are averaging the predictions obtained from an estimate of the “average prediction” we would have computed had we had access to several (many?) independent training sets (samples).
There are several (many?) R
packages implementing bagging for different predictors, with varying degrees of flexibility (the implementations) and user-friendliness. However, for pedagogical and illustrative purposes, in these notes I will bagg by hand.
14.0.1 Bagging by hand
Again, to simplify the discussion and presentation, in order to evaluate prediction quality I will split the data (Boston
) into a training and a test set. We do this now:
I will now train \(N = 5\) trees and average their predictions. Note that, in order to illustrate the process more clearly, I will compute and store the \(n_e \times N\) predictions in a two-dimensional array (aka a matrix), where \(n_e\) denotes the number of observations in the test set. This is not the best (i.e. most efficient) way of implementing bagging, but the main purpose here is to understand exactly what we are doing. Also note that an alternative (better in terms of reusability of the ensemble, but maybe still not the most efficient option) is to store the \(N\) trees directly. This will also allow for more elegant and easy to read code, and it is illustrated below, but first we will use the the former (and hopefully clearer) strategy.
First create an array where we will store all the predictions:
<- 5
N <- array(NA, dim = c(nrow(dat.te), N))
myps <- rpart.control(minsplit = 3, cp = 1e-3, xval = 1) con
The last object (con
) contains my options to train large (potentially overfitting) trees. As discussed in class, we now generate N
bootstrap samples by generating vectors of randomly sampled indices (with replacement), the relevant lines of code are:
<- sample(n.tr, replace = TRUE)
ii <- rpart(..., data = dat.tr[ii, ], ...) tmp
where we train the trees on the data set dat.tr[ii, ]
, which is the boostrap sample. Then, for each of these trees we compute the corresponding (vector of) predictions on the test set (predict(tmp, newdata=dat.te, type='vector')
) and store them. Putting all the pieces together we get:
<- nrow(dat.tr)
n.tr set.seed(123)
for (j in 1:N) {
<- sample(n.tr, replace = TRUE)
ii <- rpart(medv ~ ., data = dat.tr[ii, ], method = "anova", control = con)
tmp <- predict(tmp, newdata = dat.te, type = "vector")
myps[, j] }
The bagged predictions are the average of the predictions obtained with each tree in the ensamble. In other words, for each point in dat.te
we need to compute the average of the predictions obtained with the N
different trees. Because of the way I stored the results in the matrix myps
, the bagged prediction of each point in dat.te
is the average of the corresponding row in the matrix myps
. We can compute all these (\(n_e\)) averages at once using rowMeans()
as follows:
<- rowMeans(myps) pr.bagg
Finally, the estimated MSPE of the bagged ensemble of trees obtained with our specific test set is:
with(dat.te, mean((medv - pr.bagg)^2))
#> [1] 14.54751
We can now compare this with the similarly estimated MSPE of the pruned tree:
<- rpart.control(minsplit = 2, cp = 1e-5, xval = 10)
myc set.seed(123456)
<- rpart(medv ~ .,
bos.to data = dat.tr, method = "anova",
control = myc
)<- bos.to$cptable[which.min(bos.to$cptable[, "xerror"]), "CP"]
b <- prune(bos.to, cp = b)
bos.t3 <- predict(bos.t3, newdata = dat.te, type = "vector")
pr.t3 with(dat.te, mean((medv - pr.t3)^2))
#> [1] 16.59113
Would the quality of the bagged predictions improve if we use a larger ensemble? For example, what happens if we bagg \(N = 10\) trees?
<- 10
N <- array(NA, dim = c(nrow(dat.te), N))
myps <- nrow(dat.tr)
n.tr set.seed(123)
for (j in 1:N) {
<- sample(n.tr, replace = TRUE)
ii <- rpart(medv ~ ., data = dat.tr[ii, ], method = "anova", control = con)
tmp <- predict(tmp, newdata = dat.te, type = "vector")
myps[, j]
}<- rowMeans(myps)
pr.bagg with(dat.te, mean((medv - pr.bagg)^2))
#> [1] 13.97641
or \(N = 100\) trees?
<- 100
N <- array(NA, dim = c(nrow(dat.te), N))
myps <- nrow(dat.tr)
n.tr set.seed(123)
for (j in 1:N) {
<- sample(n.tr, replace = TRUE)
ii <- rpart(medv ~ ., data = dat.tr[ii, ], method = "anova", control = con)
tmp <- predict(tmp, newdata = dat.te, type = "vector")
myps[, j]
}<- rowMeans(myps)
pr.bagg with(dat.te, mean((medv - pr.bagg)^2))
#> [1] 12.10982
or \(N = 1000\) trees?
<- 1000
N <- array(NA, dim = c(nrow(dat.te), N))
myps <- nrow(dat.tr)
n.tr set.seed(123)
for (j in 1:N) {
<- sample(n.tr, replace = TRUE)
ii <- rpart(medv ~ ., data = dat.tr[ii, ], method = "anova", control = con)
tmp <- predict(tmp, newdata = dat.te, type = "vector")
myps[, j]
}<- rowMeans(myps)
pr.bagg with(dat.te, mean((medv - pr.bagg)^2))
#> [1] 11.48381
Note that, at least for this test set, increasing the number of bagged trees seems to improve the MSPE. However, the gain appears to decrease, so it may not be worth the computational effort to use a larger bag / ensemble. Furthermore, one may also want to investigate whether this is an artifact of this specific training / test partition, or if similar patterns of MSPE are observed for other random training / test splits. Below we try a different test/training split and repeat the bagging experiment above:
set.seed(1234)
<- nrow(Boston)
n <- sample(n, floor(n / 4))
ii <- Boston[ii, ]
dat.te <- Boston[-ii, ]
dat.tr <- c(5, 10, 100, 1000)
Ns <- matrix(NA, length(Ns), 2)
all.results colnames(all.results) <- c("N", "MSPE")
<- nrow(dat.tr)
n.tr for (hh in 1:length(Ns)) {
<- Ns[hh]
N <- array(NA, dim = c(nrow(dat.te), N))
myps set.seed(123)
for (j in 1:N) {
<- sample(n.tr, replace = TRUE)
ii <- rpart(medv ~ ., data = dat.tr[ii, ], method = "anova", control = con)
tmp <- predict(tmp, newdata = dat.te, type = "vector")
myps[, j]
}<- rowMeans(myps)
pr.bagg <- c(N, with(dat.te, mean((medv - pr.bagg)^2)))
all.results[hh, ]
}print(all.results)
#> N MSPE
#> [1,] 5 18.27651
#> [2,] 10 17.50426
#> [3,] 100 14.52966
#> [4,] 1000 14.27511
The pattern is in fact similar to the one we observed before: increasing the size of the ensemble \(N\) helps, but the improvements become smaller after a certain value of \(N\). You are strongly encouraged to verify this by repeating the above experiment with different train/test splits. Furthermore, a very good exercise is to explore what happens with the MSPE of the bagged ensembles (for different values of \(N\)) when the MSPE is estimated using cross-validation (instead of using a specific test set). Do it!
14.1 More efficient, useful and elegant implementation
I will now illustrate a possibly more efficient way to implement bagging, namely storing the \(N\) trees (rather than their predictions on a given data set). In this way one can re-use the ensemble (on any future data set) without having to re-train the elements of the bag. Since the idea is the same, I will just do it for ensemble of \(N = 100\) trees. To simplify the comparison between this implementation of bagging and the one used above, we first re-create the original training / test split
set.seed(123456)
<- nrow(Boston)
n <- sample(n, floor(n / 4))
ii <- Boston[ii, ]
dat.te <- Boston[-ii, ] dat.tr
Now, let’s create a list
of 100 (empty) elements, each element of this list will store a regression tree:
<- 100
N <- vector("list", N) mybag
Now, we train the \(N\) trees as before, but store them in the list
(without computing any predictions):
set.seed(123)
for (j in 1:N) {
<- sample(n.tr, replace = TRUE)
ii <- rpart(medv ~ ., data = dat.tr[ii, ], method = "anova", control = con)
mybag[[j]] }
Given a new data set, in order to obtain the corresponding predictions for each tree in the ensemble, one could either:
- loop over the \(N\) trees, averaging the corresponding \(N\) vectors of predictions; or
- use
sapply
(check the help page if you are not familiar with theapply
functions inR
).
The later option results in code that is much more elegant, efficient (allowing for future uses of the ensemble), and compact. Of course both give exactly the same results. Below we illustrate both strategies. If we use the first approach (loop) we obtain the following estimated MSPE using the test set:
<- rep(0, nrow(dat.te))
pr.bagg2 for (j in 1:N) {
<- pr.bagg2 + predict(mybag[[j]], newdata = dat.te) / N
pr.bagg2
}with(dat.te, mean((medv - pr.bagg2)^2))
#> [1] 12.10982
(compare it with the results we obtained before). Using the second approach (sapply):
<- rowMeans(sapply(mybag, predict, newdata = dat.te))
pr.bagg3 with(dat.te, mean((medv - pr.bagg3)^2))
#> [1] 12.10982
Both results are of course identical.
14.2 Bagging a regression spline
Bagging does not provide much of an advantage when applied to linear predictors (can you explain why?) Nevertheless, let us try it on the lidar
data, which, as we did before, we randomly split into a training and test set:
data(lidar, package = "SemiPar")
set.seed(123456)
<- nrow(lidar)
n <- sample(n, floor(n / 5))
ii <- lidar[ii, ]
lid.te <- lidar[-ii, ] lid.tr
Now fit a cubic spline, and estimate the MSPE using the test set:
library(splines)
<- lm(logratio ~ bs(x = range, df = 10, degree = 3), data = lid.tr)
a <- order(lid.tr$range)
oo <- predict(a, newdata = lid.te)
pr.of mean((lid.te$logratio - pr.of)^2)
#> [1] 0.007427559
We build an ensemble of 10 fits and estimate the corresponding MSPE using the test set:
<- 10
N <- matrix(NA, nrow(lid.te), N)
myps set.seed(123)
<- nrow(lid.tr)
n.tr for (i in 1:N) {
<- sample(n.tr, replace = TRUE)
ii <- lm(logratio ~ bs(x = range, df = 10, degree = 3), data = lid.tr[ii, ])
a.b <- predict(a.b, newdata = lid.te)
myps[, i]
}<- rowMeans(myps) # , na.rm=TRUE)
pr.ba mean((lid.te$logratio - pr.ba)^2)
#> [1] 0.007552562
Note that the estimated MSPE is almost the same as the one of the original single spline. Furthermore, adding more elements to the ensemble does not seem to improve the estimated MSPEs:
<- 100
N <- matrix(NA, nrow(lid.te), N)
myps set.seed(123)
<- nrow(lid.tr)
n.tr for (i in 1:N) {
<- sample(n.tr, replace = TRUE)
ii <- lm(logratio ~ bs(x = range, df = 10, degree = 3), data = lid.tr[ii, ])
a.b <- predict(a.b, newdata = lid.te)
myps[, i]
}<- rowMeans(myps) # , na.rm=TRUE)
pr.ba mean((lid.te$logratio - pr.ba)^2)
#> [1] 0.0075887