On Theory Behind Ensemble Learning Techniques
Bagging, Boosting, Stacking and why they work
Introduction
I recently attended a talk at PyCon ’24 on ensemble learning methods. Although the techniques themselves are fairly simple, the talk made me curious if there is any simple theoretical justification for the benefits offered by such methods.
To briefly recap the discussed methods:
- Bagging — This has two steps. First, multiple datasets are created by sampling with replacement and models of the same type are trained. The combined result is typically obtained through averaging.
- Boosting — The combined result is obtained by weighted averaging of the models, with the weights being adjusted iteratively from the incorrect responses in the previous step.
- Stacking — Instead of averaging their results directly like in bagging, a second model (meta-learner) is trained to combine the predictions of the first-level models. Note that the underlying models are created from the same dataset but can belong to different types.
Bias-Variance Decomposition
Supervised learning methods usually minimize the mean-square error between the prediction and ground truth. This error can be decomposed as follows:
where y
represents the target value and f^(x)
stands for the prediction at x
. This notation is chosen as we will use f
to stand for the true underlying function but devoid of an irreducible noise. It is written as
with 𝔼[ε] = 0. Using the above two equations,
and as ε is independent of f(x)
, the second term vanishes as 𝔼[ε] = 0. Also, let’s represent 𝔼[ε²] = σ², which represents the variance of the noise. The first term can be expanded as
Again, on expanding the square, the middle term vanishes as
The final expression for the mean-square error then becomes
High bias results from overly simplistic models that make strong assumptions about the data, while high variance arises from overly complex models that overfit to noise in the training data.
Ensemble learning primarily addresses the bias-variance tradeoff in supervised learning. These methods leverage and recycle the strengths of existing models and the combined model has reductions in bias or variance.
The remainder of the article will discuss how the ensemble learning methods affect various terms in the bias-variance decomposition.
Bagging
To briefly recap, Bagging involves multiple versions of a base model on different bootstrapped datasets and averages their predictions.
Assuming there are M base models, we have:
Then, the bias for this combined model is given by
Similarly, the variance can be written as
and assuming the base models are weakly or even better, uncorrelated, we have
Thus, bagging with averaging
- Maintains similar bias
- But reduces variance by a factor of M
Boosting
In this approach, models are built sequentially, where each new model is trained to correct the residual errors of the previous models. Boosting can be written as an additive model
where the weights are obtained based on the training errors in the previous iteration. Depending on the variant used for deciding the weights, the bias and variance are modified.
In each step of the boosting algorithm, assuming the model becomes better at approximating the underlying function by a factor of `β`, where 0 < β< 1, we have
The variance can be written as
Assuming a variance of `τ` for the base models and since there is some correlation between sequential models, but assuming it is at most `ρ`, we have
The expression in parenthesis on the right-hand side would be less than 1 as `ρ` is the fraction of the covariance with the variance, which would ensure 0 < ρ < 1 as we assumed sequential models are positively correlated and the variance always bounds the covariance in magnitude. Thus, boosting would also reduce variance in several circumstances. In summary, boosting
- Reduces bias
- Reduces variance
Particularly, variations such as AdaBoost can be rigorously shown to reduce the entire mean square error exponentially across iterations.
Stacking
In this technique, multiple different types of models (e.g., decision trees, SVMs, neural networks) are trained, and their predictions are combined by a meta-model (also another ML model)
Since the prediction function itself now can be transformed to any extent as g
is non-linear, we can have both the bias and variance term in the MSE expression being reduced. In fact, the training step due to the addition of the meta-learner would be focused on reducing the MSE in each iteration.
Summary
This article explores ensemble learning methods, particularly their theoretical justifications for improving predictive performance. Ensemble methods, such as bagging, boosting, and stacking, combine multiple models to balance bias and variance in supervised learning.
- Bagging reduces variance by averaging predictions from multiple bootstrapped datasets while maintaining similar bias, leading to improved stability in the predictions.
- Boosting sequentially builds models to correct errors from previous iterations, resulting in reduced bias and variance, as each model incrementally improves the overall prediction quality.
- Stacking involves training diverse models and combining their predictions using a meta-learner. This approach allows for significant reductions in both bias and variance but would involve an expensive training step.
By effectively addressing the bias-variance tradeoff, ensemble learning methods leverage the strengths of various models to achieve better predictive accuracy.