20  Bagging

20.1 Relationship to Bootstrap

Bagging is a general technique–like cross-validation–that can be applied to any learning method. It is particularly effective in ensembles of base learners with high variability and low bias. The name is short for bootstrap aggregating and describes its relationship to the bootstrap. Draw \(B\) bootstrap samples from the dataframe with \(n\) observations, fit a base learner to each of the \(B\) samples, and combine the individual predictions or classifications.

In the regression case, the \(B\) predictions are averaged, in the classification case the majority vote is taken to assign the predicted category.

Not surprisingly, bagging is popular with decision trees, unbiased regression or classification methods that tend to have high variability. A single decision tree might have a large classification or prediction error, an ensemble of 500 trees can be a very precise predictor. Random forests (Section 20.4) are a special version of bagged trees.

Bagging can be useful with other estimators.

Example: Bagged Estimator based on

Suppose we have 100 observations from a G(\(\mu\),2) distribution.

We want to estimate the mean of the normal, so the sample mean \[ \overline{Y} = \frac{1}{n} \sum_{i=1}^n Y_i \] is the best estimator.

We can get close to that estimator by bagging a highly variable estimator, say the average of the first three observations. if \(Y_{(i)}\) denotes the \(i\)th observation in the data set, then \[ Y^* = \frac{1}{3} \sum_{i=1}^3 Y_{(i)} \] is also an unbiased estimator for \(\mu\) and has standard deviation \[ \sqrt{\frac{\sigma^2}{3}} = 0.816 \] compared to the standard deviation of the sample mean, \[ \sqrt{\frac{\sigma^2}{100}} = 0.1414 \]

\(Y^*\) is more variable than \(\overline{Y}\) and both are unbiased estimators. The precision of estimating \(\mu\) based on \(Y^*\) can be improved by bagging the estimator: rather than compute \(Y^*\) once from the data, we compute it \(B\) times based on bootstrap samples from the data and take the average of the bootstrapped estimates as our estimate of \(\mu\).

set.seed(542)
n <- 100
x <- rnorm(n, mean=0, sd=sqrt(2))

bagger <- function(x,b=1000) {
    n <- length(x)
    sum_bs <- 0
    sum_bs2 <- 0
    for (i in 1:b) {
        # Draw a bootstrap sample
        bsample <- sample(n,n,replace=TRUE)
        # Compute the estimator
        est <- mean(x[bsample][1:3])
        # Accumulate sum and sum of squared value of estimator
        # so we can calculate mean and standard deviation later
        sum_bs <- sum_bs + est
        sum_bs2 <- sum_bs2 + est*est
    }
    # Compute mean and standard deviation of the estimator
    mn = sum_bs / b
    sdev = sqrt((sum_bs2 - sum_bs*sum_bs/b)/(b-1))
    return (list(B=b, mean=mn, sd=sdev))
}

cat("The best estimate of mu is ", mean(x),"\n\n")
The best estimate of mu is  -0.004722246 
l <- list()
l[[length(l)+1]] <- bagger(x,b=10)
l[[length(l)+1]] <- bagger(x,b=100)
l[[length(l)+1]] <- bagger(x,b=1000)
l[[length(l)+1]] <- bagger(x,b=10000)
l[[length(l)+1]] <- bagger(x,b=25000)
l[[length(l)+1]] <- bagger(x,b=50000)

do.call(rbind.data.frame, l)
      B          mean        sd
1    10  0.2239820300 0.6364328
2   100 -0.1450445628 0.8497776
3  1000 -0.0173021748 0.8271210
4 10000  0.0006381345 0.8534333
5 25000 -0.0028520425 0.8394222
6 50000 -0.0073707649 0.8386747

The bagger function returns the mean and standard deviation of the \(B\) bootstrap estimates. Notice that by increasing \(B\), the bagged estimate gets closer to the optimal estimator, \(\overline{Y}\). The standard deviation of the bagged estimate stabilizes around 0.8, close to the theoretically expected value of \(\sqrt{2/3} = 0.816\). Both quantities returned from bagger are becoming more accurate estimates as \(B\) increases. The standard deviation of the bootstrap estimate does not go to zero as \(B\) increases, it stabilizes at the standard deviation of the estimator.

20.2 Why and When Does Bagging Work?

It is intuitive that the average in a random sample varies less than the individual random variables being averaged. If \(Y_1, \cdots, Y_n\) is a random sample from a distribution with mean \(\mu\) and variance \(\sigma^2\), then \(\text{Var}[Y_i] = \sigma^2\), while \(\text{Var}[\overline{Y}] = \sigma^2/n\).

We can express this intuition more formally and gain some insight into why bagging works and when it is most effective. To motivate, we follow Sutton (2005) and suppose we have a continuous response \(Y\) and that \(\phi(\textbf{x})\) is the prediction that results from applying any particular method. A prediction method could be a regression tree and \(\phi^{T}(\textbf{x})\) is the average of the values in the terminal node for \(\textbf{x}\). A prediction method could be a multiple linear regression and \(\phi^{R}(\textbf{x})\) is the predicted value \(\widehat{y} = \textbf{x}^\prime\widehat{\boldsymbol{\beta}}\).

The point is that \(\phi(\textbf{x})\) is a random variable because it depends on the particular sample (the training data) and its statistical properties derive from the prediction method itself and the properties of \(Y\). \(\text{Var}[\phi^{R}(\textbf{x})]\) and \(\text{Var}[\phi^{T}(\textbf{x})]\) are not the same.

Let \(\mu_\phi\) denote the expected value of the predictor \(\phi(\textbf{x})\) over the distribution of the training data. The mean-squared error decomposition into variance and squared-bias contribution is then is \[ \text{E}[(Y-\phi(\textbf{x}))^2] = \text{Var}[\phi(\textbf{x})] + \text{E}[(Y - \mu_\phi)^2] \]

If we had the opportunity to use as the predictor the mean \(\mu_\phi\) rather than the random variable \(\phi(\textbf{x})\), then we would have a smaller mean square error than using \(\phi(\textbf{x})\) because the variance term drops out (\(\text{Var}[\mu_\phi]=0\) because \(\mu_\phi\) is a constant). This is true regardless of the prediction method. The replacement would be most effective in situations where \(\text{Var}[\phi(\textbf{x})]\) is large. In other words, the larger the variance \(\text{Var}[\phi(\textbf{x})]\) is relative to the MSE of the predictor, the greater the percentage reduction if the predictor is replaced by its mean.

In practice, we do not know \(\mu(\phi)\), but we can get an estimate of it, by simply averaging the individual predictors \(\phi(\textbf{x})_1, \cdots, \phi(\textbf{x})_B\) in a bootstrap sample. Let’s denote this aggregate of the bootstrap predictor \(\phi_{B}(\textbf{x})\) of our base method. To make notation even worse, we would denote the bootstrap predictor of the regression predictor \(\phi^{R}(\textbf{x})\) as \(\phi_{B}^{R}(\textbf{x})\).

The effectiveness of bagging is greatest when the prediction method is highly variable, possibly even unstable. Effectiveness is measured here by the relative reduction in variability that occurs when \(\text{Var}[\phi_B^{*}(\textbf{x})]\) approaches zero with increasing number of bootstrap samples \(B\).

In order for bagging to work well, we need to take enough bootstrap samples to reduce the variance and the training data has to be large enough for the bootstrap estimate of \(\mu_\phi\) to be good.


A regression model that uses the correct inputs and functional form is a low-variance estimation method, it is very stable. Bagging such an estimator does not yield much. A decision tree is inherently unstable in that it can change under small perturbations or changes in the data. That inherent variability makes it a great candidate to be improved upon through bootstrap aggregation—through bagging.

To measure the effect of bagging on a classifier, mean square prediction error is not the appropriate metric as the performance of the classifier is measured through misclassification rate, cross-entropy, or other metrics. An improvement through bagging is achieved when the misclassification rate of the bagged estimator is closer to the Bayes rate than the un-bagged classifier.

Bayes classifier

Recall that the Bayes classifier is an ideal classifier that always predicts the category that is most likely to occur for an input vector \(\textbf{x}\). The Bayes classifier will not always make the correct prediction, but it will have the smallest misclassification rate.

Bagging classifiers tends to work well for classifiers that have low bias in the sense that they perform similar to the Bayes classifier for \(\textbf{x}\) values where the classifier \(\phi(\textbf{x})\) predicts a class different from the best possible one.

20.3 Bagged Trees

Breiman (1996) introduced bootstrap aggregating (bagging) as a means to increase the accuracy of a predictor by applying the predictor to a sequence of learning sets, data sets of size \(n\), and combining the results from the sets. In case of prediction the obvious choice is to average the predictions across the learning sets. In case of classification, Breiman recommends majority voting: classify an observation in each learning set and assign the most frequently chosen category as the overall classification.

The learning sets are \(B\) bootstrap samples drawn from the original data. Breiman notes

The vital element is the instability of the prediction method.

If the different learning sets do not lead to changes in the predictors, the aggregation will not be effective. This is where base learners with high variability (and low bias) are important. There is a second situation where bagging will not be effective: if the method being bagged is already near the limits of accuracy achievable for that data set no amount of bootstrapping and bagging will help.

The most frequent application of bagging is with decision trees for either regression or classification. In a non-ensemble situation one would build one tree, first by growing it deep and then pruning it back. During bagging trees are not pruned, the volatility of deep trees is desirable.

Example: Banana Quality

The data for this example can be found on kaggle and comprises observations on the quality of bananas (“Good”, “Bad”) and seven attributes (Figure 20.1). The attributes are normalized although it is not clear how this was done. There are 4,000 training observations and 4,000 testing observations in separate data sets.

library("duckdb")
con <- dbConnect(duckdb(),dbdir = "ads.ddb",read_only=TRUE)
ban_train <- dbGetQuery(con, "SELECT * FROM banana_train")
ban_test <- dbGetQuery(con, "SELECT * FROM banana_test")

ban_train$Quality <- as.factor(ban_train$Quality)
ban_test$Quality <- as.factor(ban_test$Quality)

dbDisconnect(con)
lattice::histogram(~ Size + Weight + Sweetness + Softness + HarvestTime
                   + Ripeness + Acidity , 
                   cex     =0.5,
                   as.table=TRUE,
                   par.strip.text=list(cex=0.75),
                   xlab    ="",
                   data    =ban_train)
Figure 20.1: Histograms of banana quality attributes (training data)

We first build a regular classification tree for the training data and compute its confusion matrix for the test data. The tree is grown with the tree::tree function and its complexity is cross-validated (10-fold) based on the misclassification rate.

library(tree)
tree.ban <- tree(Quality ~ ., data=ban_train)
summary(tree.ban)

Classification tree:
tree(formula = Quality ~ ., data = ban_train)
Variables actually used in tree construction:
[1] "Sweetness"   "HarvestTime" "Ripeness"    "Weight"      "Size"       
[6] "Softness"   
Number of terminal nodes:  15 
Residual mean deviance:  0.5658 = 2255 / 3985 
Misclassification error rate: 0.09525 = 381 / 4000 
set.seed(123)
cv.ban <- cv.tree(tree.ban, FUN=prune.misclass)

cat("Optimial tree size after pruning: ",cv.ban$size[which.min(cv.ban$dev)],"\n")
Optimial tree size after pruning:  15 

It turns out that the size of the deep tree with 15 nodes is also the best size per cross-validation, so no further pruning is necessary and the tree can be used to predict the test data.

library(caret)

tree.ban.pred <- predict(tree.ban, newdata=ban_test, type="class")

tree.ban.cm <- confusionMatrix(tree.ban.pred,ban_test$Quality)
tree.ban.cm
Confusion Matrix and Statistics

          Reference
Prediction  Bad Good
      Bad  1844  270
      Good  150 1736
                                          
               Accuracy : 0.895           
                 95% CI : (0.8851, 0.9043)
    No Information Rate : 0.5015          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.79            
                                          
 Mcnemar's Test P-Value : 6.375e-09       
                                          
            Sensitivity : 0.9248          
            Specificity : 0.8654          
         Pos Pred Value : 0.8723          
         Neg Pred Value : 0.9205          
             Prevalence : 0.4985          
         Detection Rate : 0.4610          
   Detection Prevalence : 0.5285          
      Balanced Accuracy : 0.8951          
                                          
       'Positive' Class : Bad             
                                          

The decision tree achieves an accuracy of 89.5% on the test data.

To apply bagging we use the randomForest function in the package by the same name. As we will see shortly, bagging is a special case of a random forest where the mtry parameter is set to the number of input variables, here 7. By default, randomForest performs \(B=500\) bootstrap samples, that is, it builds \(500\) trees.

library(randomForest)
set.seed(6543)
bag.ban <- randomForest(Quality ~ . , 
                        data      =ban_train, 
                        xtest     =ban_test[,-8],
                        ytest     =ban_test[, 8],
                        mtry      =7,
                        importance=TRUE)

bag.ban

Call:
 randomForest(formula = Quality ~ ., data = ban_train, xtest = ban_test[,      -8], ytest = ban_test[, 8], mtry = 7, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 7

        OOB estimate of  error rate: 3.88%
Confusion matrix:
      Bad Good class.error
Bad  1933   67      0.0335
Good   88 1912      0.0440
                Test set error rate: 3.35%
Confusion matrix:
      Bad Good class.error
Bad  1929   65  0.03259779
Good   69 1937  0.03439681

When a test set is specified, randomForest returns two types of predicted values. bag.ban$predicted are the out-of-bag predictions. Each observation is out-of-bag in about 1/3 of the trees built. The majority vote across the \(B/3\) classifications is the out-of-bag prediction for that observation. The predicted values based on the test data set are stored in bag.ban$test$predicted. Each observation in the test data set is passed through the 500 trees—the majority vote of the predicted quality is the overall prediction for that observation.

The confusion matrices shown by randomForest can be reconstructed as follows:

confusionMatrix(bag.ban$predicted,ban_train$Quality)
Confusion Matrix and Statistics

          Reference
Prediction  Bad Good
      Bad  1933   88
      Good   67 1912
                                         
               Accuracy : 0.9612         
                 95% CI : (0.9548, 0.967)
    No Information Rate : 0.5            
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.9225         
                                         
 Mcnemar's Test P-Value : 0.1082         
                                         
            Sensitivity : 0.9665         
            Specificity : 0.9560         
         Pos Pred Value : 0.9565         
         Neg Pred Value : 0.9661         
             Prevalence : 0.5000         
         Detection Rate : 0.4833         
   Detection Prevalence : 0.5052         
      Balanced Accuracy : 0.9612         
                                         
       'Positive' Class : Bad            
                                         
bag.ban.cm <- confusionMatrix(bag.ban$test$predicted,ban_test$Quality)
bag.ban.cm
Confusion Matrix and Statistics

          Reference
Prediction  Bad Good
      Bad  1929   69
      Good   65 1937
                                          
               Accuracy : 0.9665          
                 95% CI : (0.9604, 0.9719)
    No Information Rate : 0.5015          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.933           
                                          
 Mcnemar's Test P-Value : 0.7955          
                                          
            Sensitivity : 0.9674          
            Specificity : 0.9656          
         Pos Pred Value : 0.9655          
         Neg Pred Value : 0.9675          
             Prevalence : 0.4985          
         Detection Rate : 0.4823          
   Detection Prevalence : 0.4995          
      Balanced Accuracy : 0.9665          
                                          
       'Positive' Class : Bad             
                                          

The accuracy of the decision tree has increased from 89.5% for the single tree to 96.65% through bagging.

As with all ensemble methods, interpretability of the results suffers when methods are combined. Decision trees are intrinsically interpretable and among the most easily understood algorithms. The visual of a tree describes how data flows through the algorithm and how it arrives at a prediction or classification (Figure 20.2).

par(mar = c(2, 2, 1, 2))
plot(tree.ban)
text(tree.ban,cex=0.7)
Figure 20.2: Decision tree for banana quality

With bagged trees we lose this interpretability; instead of a single tree the decision now depends on many trees and cannot be visualized in the same way. To help with the interpretation of bagged trees, the importance of the features can be calculated in various ways. One technique uses the Gini index, a measure of the node purity, another looks at the decrease in accuracy when the variable is permuted in the out-of-bag samples. Permutation breaks the relationship between the values of the variable and the target variable. A variable that changes the accuracy greatly when its values are permuted is an important variable for the tree.

Example: Banana Quality (Cont’d)

The variable importance for the banana quality based on bagging 500 trees is shown in Figure 20.3.

importance(bag.ban)
                  Bad      Good MeanDecreaseAccuracy MeanDecreaseGini
Size         89.29303  89.07396            127.62487         318.5278
Weight      100.00928 111.36404            139.91621         245.3550
Sweetness    89.51721 112.26886            158.96167         392.7300
Softness    108.32549 133.31219            168.30776         287.1453
HarvestTime  96.58981 108.22739            143.83798         337.7313
Ripeness    101.76511  93.76190            133.84679         294.0230
Acidity      57.39471  57.35883             77.66721         123.9741
par(mar = c(2, 2, 0, 2))
varImpPlot(bag.ban,main="")
Figure 20.3: Variable importance plots for bagged trees.

Bagging is a powerful method to improve accuracy. In Breiman’s words:

Bagging goes a ways toward making a silk purse out of a sow’s ear, especially if the sow’s ear is twitchy. […] What one loses, with the trees, is a simple and interpretable structure.
What one gains is increased accuracy.

20.4 Random Forests

Are there any other downsides to bagging, besides a loss of interpretability and increased computational burden? Not really, but another issue deserves consideration: bootstrap samples are highly correlated. Two thirds of the observations are in the bootstrap samples, making the data sets similar to each other. An input variable that dominates the model will tend to be the first variable on which the tree is split, leading to similar trees. This reduces the variability of the trees and reduces the effectiveness of the ensemble.

How can we make the trees more volatile? Breiman (2001) advanced the concept of bagged trees through the introduction of random forests. The key innovation that leads from bagged trees to random forests is that not all input variables are considered at each split of the tree. Instead, only a randomly selected set of predictors is considered, the set changes from split to split. The random split introduces more variability and allows non-dominant input variables to participate.

In other aspects, random forests are constructed in the same way as bagged trees introduced in Section 20.3: \(B\) trees are built based on \(B\) bootstrap samples, the trees are built deep and not pruned.

Example: Banana Quality (Cont’d)

Fitting a random forest instead of bagging trees is simple based on the previous R code. Simply specify a value for the mtry parameter that is smaller than the number of input variables. By default, randomForest chooses \(p/3\) candidate inputs at each split in regression trees and \(\sqrt{p}\) candidate inputs in classification trees.

set.seed(6543)
rf.ban <- randomForest(Quality ~ . , 
                       data =ban_train, 
                       xtest=ban_test[,-8],
                       ytest=ban_test[, 8])
rf.ban

Call:
 randomForest(formula = Quality ~ ., data = ban_train, xtest = ban_test[,      -8], ytest = ban_test[, 8]) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 3.3%
Confusion matrix:
      Bad Good class.error
Bad  1944   56       0.028
Good   76 1924       0.038
                Test set error rate: 3.15%
Confusion matrix:
      Bad Good class.error
Bad  1928   66  0.03309930
Good   60 1946  0.02991027

Only two of the seven inputs are considered at each split in the random forest compared to the bagged trees. The accuracy improves slightly from 96.65% to 96.85%.

It might seem counterintuitive that one can achieve a greater accuracy by “ignoring” five out of the seven input variables at each split. It turns out that the results are fairly insensitive to the number of inputs considered at each split. The important point is that not all inputs are considered at each split. In that case, one might as well choose a small number to speed up the computations.

Bagged trees and random forests are easily parallelized, the \(B\) trees can be trained independently. This is different from other ensemble techniques, for example, boosting that builds the ensemble sequentially. Another advantage of bootstrap-based methods is that they do not overfit. Increasing \(B\), the number of bootstrap samples, simply increases the number of trees but does not change the model complexity.

Cross-validation is not necessary with bootstrap-based methods because the out-of-bag error can be computed from the observations not in a particular bootstrap sample.

Using a random selection of inputs at each split is one manifestation of “random” in random forests. Other variations of random forests take random linear combinations of inputs, randomizing the outputs in the training data or random set of weights. The important idea is to inject randomness into the system to create diversity.

Random forests excel at reducing variability and there is some evidence that they also reduce bias. Breiman (2001) mentions that the mechanism by which forests reduce bias is not obvious. The bootstrap mechanism is inherently focused on reducing variability. A different class of ensemble methods was developed to attack both bias and variance simultaneously. In contrast to the random forest, these boosting techniques are based on sequentially changing the training data set or the model (Chapter 21).

A yet completely different approach is taken by Bayesian Model Averaging (BMA). In many situations there is no one clear winning model but a neighborhood of models that deserve consideration. Picking a single winner then seems not justified. Providing interpretations of a large number of models is also not advised. BMA combines many models to contribute to an overall prediction according to their proximity to the best-performing model (Chapter 22).