On Theory Behind Ensemble Learning Techniques

Bagging, Boosting, Stacking and why they work

5 min readOct 13, 2024

Introduction

I recently attended a talk at PyCon ’24 on ensemble learning methods. Although the techniques themselves are fairly simple, the talk made me curious if there is any simple theoretical justification for the benefits offered by such methods.

Ensemble methods combine the results of various models to produce better ones

To briefly recap the discussed methods:

Bagging — This has two steps. First, multiple datasets are created by sampling with replacement and models of the same type are trained. The combined result is typically obtained through averaging.
Boosting — The combined result is obtained by weighted averaging of the models, with the weights being adjusted iteratively from the incorrect responses in the previous step.
Stacking — Instead of averaging their results directly like in bagging, a second model (meta-learner) is trained to combine the predictions of the first-level models. Note that the underlying models are created from the same dataset but can belong to different types.

Bias-Variance Decomposition

Supervised learning methods usually minimize the mean-square error between the prediction and ground truth. This error can be decomposed as follows:

where y represents the target value and f^(x) stands for the prediction at x. This notation is chosen as we will use f to stand for the true underlying function but devoid of an irreducible noise. It is written as

with 𝔼[ε] = 0. Using the above two equations,

and as ε is independent of f(x), the second term vanishes as 𝔼[ε] = 0. Also, let’s represent 𝔼[ε²] = σ², which represents the variance of the noise. The first term can be expanded as

Again, on expanding the square, the middle term vanishes as

The final expression for the mean-square error then becomes

High bias results from overly simplistic models that make strong assumptions about the data, while high variance arises from overly complex models that overfit to noise in the training data.

Ensemble learning primarily addresses the bias-variance tradeoff in supervised learning. These methods leverage and recycle the strengths of existing models and the combined model has reductions in bias or variance.

Figure 1: Typical bias-variance tradeoff trend as a function of model complexity (Source: Bigbossfarin, CC0, via Wikimedia Commons)

The remainder of the article will discuss how the ensemble learning methods affect various terms in the bias-variance decomposition.

Bagging

To briefly recap, Bagging involves multiple versions of a base model on different bootstrapped datasets and averages their predictions.

Assuming there are M base models, we have:

Then, the bias for this combined model is given by

Similarly, the variance can be written as

and assuming the base models are weakly or even better, uncorrelated, we have

Thus, bagging with averaging

Maintains similar bias
But reduces variance by a factor of M

Boosting

In this approach, models are built sequentially, where each new model is trained to correct the residual errors of the previous models. Boosting can be written as an additive model

where the weights are obtained based on the training errors in the previous iteration. Depending on the variant used for deciding the weights, the bias and variance are modified.

In each step of the boosting algorithm, assuming the model becomes better at approximating the underlying function by a factor of `β`, where 0 < β< 1, we have

The variance can be written as

Assuming a variance of `τ` for the base models and since there is some correlation between sequential models, but assuming it is at most `ρ`, we have

The expression in parenthesis on the right-hand side would be less than 1 as `ρ` is the fraction of the covariance with the variance, which would ensure 0 < ρ < 1 as we assumed sequential models are positively correlated and the variance always bounds the covariance in magnitude. Thus, boosting would also reduce variance in several circumstances. In summary, boosting

Reduces bias
Reduces variance

Particularly, variations such as AdaBoost can be rigorously shown to reduce the entire mean square error exponentially across iterations.

Stacking

In this technique, multiple different types of models (e.g., decision trees, SVMs, neural networks) are trained, and their predictions are combined by a meta-model (also another ML model)

Since the prediction function itself now can be transformed to any extent as g is non-linear, we can have both the bias and variance term in the MSE expression being reduced. In fact, the training step due to the addition of the meta-learner would be focused on reducing the MSE in each iteration.

Summary

This article explores ensemble learning methods, particularly their theoretical justifications for improving predictive performance. Ensemble methods, such as bagging, boosting, and stacking, combine multiple models to balance bias and variance in supervised learning.

Bagging reduces variance by averaging predictions from multiple bootstrapped datasets while maintaining similar bias, leading to improved stability in the predictions.
Boosting sequentially builds models to correct errors from previous iterations, resulting in reduced bias and variance, as each model incrementally improves the overall prediction quality.
Stacking involves training diverse models and combining their predictions using a meta-learner. This approach allows for significant reductions in both bias and variance but would involve an expensive training step.

By effectively addressing the bias-variance tradeoff, ensemble learning methods leverage the strengths of various models to achieve better predictive accuracy.