Ensemble Learning
Ensemble learning is a machine learning approach that combines predictions from many
models to improve predictive performance. Ensemble methods are a set of techniques for
creating a large number of models and then merging them to enhance outcomes. When
compared to a single model, ensemble techniques yield more accurate findings.
Wisdom of the Crowd
Ensemble learning is a machine learning approach that combines predictions from different
models to improve prediction accuracy. Ensemble methods are a collection of strategies for
building a large number of models and then combining them to improve results. Ensemble
approaches produce more accurate results than a single model.
Why Ensemble Learning
● Statistical Problem
When the hypothesis space is too big for the evidence available, the Statistical
Problem occurs. As a consequence, there are numerous hypotheses with the same
accuracy on the data, but only one is chosen by the learning process! Based on unseen
data, there's a good chance the selected hypothesis isn't very accurate!
● Computational Problem
The Computational Problem occurs when the learning method cannot guarantee that
the best hypothesis will be identified.
● Representational Problem
The Representational Problem occurs when the hypothesis space does not have a good
approximation of the target class (es).
History
Machine learning strategies rely on ensemble techniques as a foundation. However, for those
who are unfamiliar with the subject, it may be frightening, so we asked Mike Bowles, a
machine learning specialist and serial entrepreneur, to offer some context.
Ensemble Methods are among the most effective and simple-to-use predictive analytics
algorithms, and the R programming language contains a large number of them, including the
best performers – Random Forest, Gradient Boosting, and Bagging – as well as big data
versions through Revolution Analytics.
The term "Ensemble Methods" refers to the practise of building a large number of slightly
different prediction models and then merging them through voting or averaging to attain
extremely high performance. For machine crowd sourcing, ensemble approaches have been
coined. Bagging, Boosting, and Random Forest are all algorithms that try to increase
performance beyond that of a binary decision tree, but in various ways.
To address the challenges of binary decision trees' variety and stability, Bagging and Random
Forests were developed. Berkeley professor Leo Breiman is credited with coining the term
"bagging." Professor Breiman was a forerunner in the development of decision trees for
statistical learning, recognising that training and averaging a large number of trees on
different random subsets of data would minimise variance and improve stability. The term
comes from a contraction of "Bootstrap Aggregating," and the link to bootstrap sampling is
obvious.
Tin Kam Ho of Bell Labs devised Random Decision Forests, which are an example of a
random subspace technique. The purpose of Random Decision Forests was to train binary
decision trees on random subsets of features (random subsets of columns of the training data).
Breiman and Cutler's Random Forests technique combined random subsampling of rows
(Bagging) with random subsampling of columns. The R package randomForest was designed
by Professor Breiman and Adele Cutler.
Boosting techniques arose from the development of computational learning theory. The first
algorithm of this type was dubbed "AdaBoost" by Freund and Shapire. In the start to their
paper, they provide the example of friends who travel to the race track on a regular basis and
gamble on the horses. One of the friends devises a technique for betting a portion of his
money with each of his pals and then adjusting the fractions based on the results, such that his
total performance approaches that of his most successful business partner.
AdaBoost was the best example of a black box algorithm for a long time. It is possible for a
practitioner to utilise it without a lot of parameter tweaking and get good outcomes with
almost no overfitting. It was a little mysterious. In several of his Random Forests works,
Professor Breiman compares performance to AdaBoost. Professor Jerome Friedman and his
Stanford colleagues Professors Hastie and Tibshirani conducted a paper in the year 2000
seeking to explain why AdaBoost was so successful. The story caused a flurry of discussion.
The comments in the publication were even longer than the article itself. The majority of the
comments were on whether boosting was just another way of minimising variance or if it was
doing anything different by focusing on error reduction. Professor
Because of their understanding of AdaBoost, Professor Friedman and his colleagues were
able to construct the boosting strategy more clearly. Beyond AdaBoost, this resulted in a slew
of important additions and improvements, including the capacity to handle regression and
multiclass difficulties, as well as other performance metrics besides squared error. All of
these features (as well as those in the works) are available in Greg Ridgeway's excellent R
package gbm.
Many data science applications now use ensemble methods as their base. Random Forests has
surged in popularity among modellers competing in Kaggle events, surpassing AdaBoost in
popularity, according to Google Trends.
Types of Ensemble Learning
1. Sequential Ensemble Learning
In this method, base learners rely on the results of previous base learners. Each
succeeding base model corrects the errors in the previous model's forecast. As a
consequence, total performance can be enhanced by raising the weight of earlier
labels.
2. Parallel Ensemble Learning
There is no dependency between the base learners in this approach, and they all run at
the same time, with the outputs of all basic models mixed together at the end (using
averaging for regression and voting for classification problems).
1. Homogeneous Parallel Ensemble Methods- In this technique, the foundation
learner is a single machine learning algorithm.
2. In heterogeneous parallel ensemble approaches, several machine learning
algorithms are used as foundation learners.
Types of Ensemble Classifier
Bagging:
Bagging reduces the variance of a decision tree (Bootstrap Aggregation). A set Di of d tuples
is sampled at each iteration I, with D replacing a set D of d tuples (i.e., bootstrap). A
classifier model Mi is then developed for each training set D I. Each classifier is given a class
prediction by Mi. M*, a bagged classifier, sums up the votes and assigns X to the class with
the most votes (unknown sample).
Algorithm
1. From the original data set, multiple subsets with equivalent tuples are generated, with
observations replaced.
2. Each of these subsets has a basis model built on it.
3. Each training set teaches each model individually and in simultaneously.
4. To achieve the final estimates, all of the models' forecasts are pooled.
Random Forest:
Random Forest's bagging method has been enhanced. Each classifier in the ensemble is a
decision tree classifier, with a random selection of characteristics determining the split of
each node. During categorization, each tree casts a vote, and the most popular class is
returned.
Algorithm
1. Many subsets are created from the original data set by selecting observations with
replacements.
2. A set of features is selected at random, and the best split feature is used to recursively
divide the node.
3. The tree has grown to its maximum height.
4. To produce a forecast based on an average of forecasts from n trees, repeat the
methods above..
Boosting
Boosting is a powerful ensemble method that reduces bias and variation while also making it
simpler to convert weak to strong learners. By training models to focus on misclassified data
from earlier models, boosting builds powerful classification tree models; after that, all
classifiers are integrated by a weighted majority vote. This method raises the weight of data
that has been wrongly categorised while decreases the weight of data that has been
successfully classified, forcing future models to give misclassified records more weight.
After that, the algorithm calculates the weighted average of votes for each class and assigns
the information to the most appropriate categorization. Regularly boosting, on the other hand,
produces better models than bagging.
Stacking
Stacking is the process of building numerous classifiers on the same dataset, which consists
of feature vectors and classifications in pairs, by combining several learning algorithms.
Following the construction of a succession of base-level classifiers in the first stage, a meta-
level classifier that integrates the outputs of the base-level classifiers is learned.
Bootstrapping
Bagging (also known as Bootstrap Aggregation) is a basic yet effective ensemble strategy.
The Bootstrap approach is employed to a high-variance machine learning system, such as
decision trees, in the bagging process. Consider the following scenario: There are M qualities
and N observations.
To develop a model using a sample of observations and a subset of features, a subset of
features is chosen. The training data is utilised to choose the subset feature that delivers the
best split. This is done in order to create a large number of models that can be trained
simultaneously. To make a forecast, all of the models' projections are merged.
Gradient Boosting
Gradient boosting algorithms are powerful forecasting techniques. Xgboost, LightGBM, and
CatBoost are examples of popular boosting approaches for regression and classification
problems. Their fame has surged after displaying their talents to win specific Kaggle
competitions.
Ensemble Combination Rules
1. Majority Voting
For each test scenario, each model creates a forecast (vote), with more than half of the
votes going to the final output projection. In this case, if no forecast receives more
than half of the votes, we may infer that the ensemble approach was unable to produce
a stable prediction. Despite the fact that this is a popular strategy, you might want to
base your final estimate on the most popular forecast (even if it received less than half
of the votes). "Plurality voting" is a term used in certain publications to describe this
method.
2. Weighted Voting
We can enhance the prominence of one or more models, unlike majority voting, when
each model has the same benefits. Weighted voting multiplies the forecasts of the
better models multiple times. It's up to you to figure out what weights work best for
you.
3. Simple Averaging
In the basic averaging strategy, the average predictions are produced for each
occurrence of the test dataset. To reduce overfitting and construct a smoother
regression model, this method is widely utilised.
4. Weighted Averaging
The weighted average is a slightly modified variant of simple averaging that involves
multiplying each model's projections by the weight before calculating the average.
5. Probability Averaging
The probability scores for various models are generated initially in this ensemble
approach. The scores are then averaged across all of the models in the dataset for all
of the classes.
The confidence in a model's predictions is expressed as a probability score. So, to get
a final probability score for the ensemble, we're pooling the confidences of various
models. The anticipated class is determined by the class with the highest probability
following the averaging procedure.
6. Max Rule
The ensemble strategy is known as the "Max Rule," and it is based on the probability
distributions provided by each classifier. This approach is superior than Majority
Voting for multi-class classification issues because it uses the concept of "confidence
in prediction" of the classifiers.
The matching confidence score for a classifier's projected class may be checked here.
In the ensemble framework's prediction, the class prediction of the classifier with the
highest confidence score is employed.
7. Mixture of Experts
The "Mixture of Experts" ensemble genre trains a large number of classifiers, whose
outputs are subsequently mixed using a generalised linear algorithm.
The weights assigned to these pairings are determined using a "Gating Network,"
often known as a trainable model or a neural network.
An ensemble approach is often used when many classifiers are trained on different
parts of the feature space.
Careful Considerations
1. Noise, Bias, and Variance: Combining judgements with many models may help
improve overall performance. As a result, one of the most significant reasons to use
ensemble models is to overcome noise, bias, and variance. In this scenario, if the
ensemble model does not give enough collective experience to improve accuracy, it is
critical to examine if such employment is necessary..
2. Simplicity and Explainability: Machine learning models, especially those employed in
manufacturing, should be simple to comprehend. When you utilise an ensemble
model, you have a far reduced likelihood of being able to explain the final model
result.
3. While there have been some assertions that ensemble models are better at
generalising, other claimed use cases have indicated higher generalisation errors. As a
result, ensemble models that do not use a good training strategy are likely to produce
significant overfitting models in the near future.
4. While we may be ready to put up with a longer model training time, inference time is
still critical. When ensemble models are used in production, the amount of time it
takes to pass many models grows, possibly decreasing prediction task throughput.
Advantages/Benefits of ensemble methods
1. Ensemble techniques outperform individual models in terms of predicted accuracy.
2. Individual models have a lower prediction accuracy than ensemble approaches.
3. Ensemble techniques can reduce bias/variance, and the model is seldom underfitted or
overfitted.
4. Individual models are usually noisier and less reliable than model ensembles.
Disadvantages of Ensemble learning
1. The output of an ensembled model is difficult to forecast and explain, and assembling
is more difficult to grasp. As a result, it's difficult to sell the ensemble concept and
gain useful business insights from it.
2. Assembly is a tough method to master, and making a mistake might result in a model
that is less accurate than a single model.
3. In terms of both time and space, assembling is costly. As a result, assembly can assist
you in maximising your return on investment.
Applications Of Ensemble Methods
1. For more conventional model creation, ensemble techniques might be utilised as a
diagnostic tool. The bigger the difference in fit quality between one of the stronger
ensemble techniques and a typical statistical model, the more information the
traditional model is likely to overlook.
2. In traditional statistical models, ensemble approaches may be used to investigate the
correlations between explanatory factors and response. An ensemble technique may
reveal predictors or basis functions that would have been missed by a traditional
model.
3. The selection process may be better recorded and the chance of inclusion in each
treatment group evaluated with less bias using the ensemble technique.
4. Ensemble techniques may be used to obtain the covariance adjustments inherent in
multiple regression and related approaches. To "residualize" the response and the
predictors of interest, ensemble techniques would be utilised.
References:
http://www.scholarpedia.org/article/Ensemble_learning
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/
https://www.analyticsvidhya.com/blog/2021/03/basic-ensemble-technique-in-machine-
learning/
https://machinelearningmastery.com/combine-predictions-for-ensemble-learning/