Lecture 5.
Ensemble Learning
Crowd intelligence
Joaquin Vanschoren
Ensemble learning
If different models make different mistakes, can we simply average the predictions?
Voting Classifier: gives every model a vote on the class label
Hard vote: majority class wins (class order breaks ties)
Soft vote: sum class probabilities over models:
pm,c M argmax ∑
M
m=1
wc pm,c
Classes can get different weights (default: )
c
wc wc = 1
Why does this work?
Different models may be good at different 'parts' of data (even if they underfit)
Individual mistakes can be 'averaged out' (especially if models overfit)
Which models should be combined?
Bias-variance analysis teaches us that we have two options:
If model underfits (high bias, low variance): combine with other low-variance models
Need to be different: 'experts' on different parts of the data
Bias reduction. Can be done with Boosting
If model overfits (low bias, high variance): combine with other low-bias models
Need to be different: individual mistakes must be different
Variance reduction. Can be done with Bagging
Models must be uncorrelated but good enough (otherwise the ensemble is worse)
We can also learn how to combine the predictions of different models: Stacking
Decision trees (recap)
Representation: Tree that splits data points into leaves based on tests
Evaluation (loss): Heuristic for purity of leaves (Gini index, entropy,...)
Optimization: Recursive, heuristic greedy search (Hunt's algorithm)
Consider all splits (thresholds) between adjacent data points, for every feature
Choose the one that yields the purest leafs, repeat
Evaluation (loss function for classification)
Every leaf predicts a class probability = the relative frequency of class
p
^ c
Leaf impurity measures (splitting criteria) for leafs, leaf has data :
c
L l Xl
Gini-Index:Gini(Xl ) = ∑ ′ p
^ p
^ ′
Entropy (more expensive):
c≠c c c
E(Xl ) = − ∑ ′ ^
p log ^
p
Best split maximizes information gain (idem for Gini index)
c≠c c 2 c
L
|Xi=l |
Gain(X, Xi ) = E(X) − ∑ E(Xi=l )
|Xi |
l=1
Regression trees
Every leaf predicts the mean target value of all points in that leaf
μ
Choose the split that minimizes squared error of the leaves: ∑ (yi − μ)
2
Yields non-smooth step-wise predictions, cannot extrapolate
xi ∈L
Impurity/Entropy-based feature importance
We can measure the importance of features (to the model) based on
Which features we split on
How high up in the tree we split on them (first splits ar emore important)
Under- and overfitting
We can easily control the (maximum) depth of the trees as a hyperparameter
Bias-variance analysis:
Shallow trees have high bias but very low variance (underfitting)
Deep trees have high variance but low bias (overfitting)
Because we can easily control their complexity, they are ideal for ensembling
Deep trees: keep low bias, reduce variance with Bagging
Shallow trees: keep low variance, reduce bias with Boosting
Bagging (Bootstrap Aggregating)
Obtain different models by training the same model on different training samples
Reduce overfitting by averaging out individual predictions (variance reduction)
In practice: take bootstrap samples of your data, train a model on each bootstrap
I
Higher : more models, more smoothing (but slower training and prediction)
I
Base models should be unstable: different training samples yield different models
E.g. very deep decision trees, or even randomized decision trees
Deep Neural Networks can also benefit from bagging (deep ensembles)
Prediction by averaging predictions of base models
Soft voting for classification (possibly weighted)
Mean value for regression
Can produce uncertainty estimates as well
By combining class probabilities of individual models (or variances for regression)
Random Forests
Uses randomized trees to make models even less correlated (more unstable)
At every split, only consider max_features features, randomly selected
Extremely randomized trees: considers 1 random threshold for random set of features (faster)
Effect on bias and variance
Increasing the number of models (trees) decreases variance (less overfitting)
Bias is mostly unaffected, but will increase if the forest becomes too large (oversmoothing)
In practice
Different implementations can be used. E.g. in scikit-learn:
BaggingClassifier : Choose your own base model and sampling procedure
RandomForestClassifier : Default implementation, many options
ExtraTreesClassifier : Uses extremely randomized trees
Most important parameters:
n_estimators (>100, higher is better, but diminishing returns)
Will start to underfit (bias error component increases slightly)
max_features
Defaults:
sqrt(p) for classification,
log2(p) for regression
Set smaller to reduce space/time requirements
parameters of trees, e.g. max_depth , min_samples_split ,...
Prepruning useful to reduce model size, but don't overdo it
Easy to parallelize (set n_jobs to -1)
Fix random_state (bootstrap samples) for reproducibility
Out-of-bag error
RandomForests don't need cross-validation: you can use the out-of-bag (OOB) error
For each tree grown, about 33% of samples are out-of-bag (OOB)
Remember which are OOB samples for every model, do voting over these
OOB error estimates are great to speed up model selection
As good as CV estimates, althought slightly pessimistic
In scikit-learn: oob_error = 1 - clf.oob_score_
Feature importance
RandomForests provide more reliable feature importances, based on many alternative hypotheses
(trees)
Other tips
Model calibration
RandomForests are poorly calibrated.
Calibrate afterwards (e.g. isotonic regression) if you aim to use probabilities
Warm starting
Given an ensemble trained for iterations, you can simply add more models later
I
You warm start from the existing model instead of re-starting from scratch
Can be useful to train models on new, closely related data
Not ideal if the data batches change over time (concept drift)
Boosting is more robust against this (see later)
Strength and weaknesses
RandomForest are among most widely used algorithms:
Don't require a lot of tuning
Typically very accurate
Handles heterogeneous features well (trees)
Implictly selects most relevant features
Downsides:
less interpretable, slower to train (but parallellizable)
don't work well on high dimensional sparse data (e.g. text)
Adaptive Boosting (AdaBoost)
Obtain different models by reweighting the training data every iteration
Reduce underfitting by focusing on the 'hard' training examples
Increase weights of instances misclassified by the ensemble, and vice versa
Base models should be simple so that different instance weights lead to different models
Underfitting models: decision stumps (or very shallow trees)
Each is an 'expert' on some parts of the data
Additive model: Predictions at iteration are sum of base model predictions
I
In Adaboost, also the models each get a unique weight wi
fI (x) = ∑ wi gi (x)
i=1
Adaboost minimizes exponential loss. For instance-weighted error : ε
ε(fI (x))
LExp = ∑ e
n=1
By deriving ∂L
∂wi
you can find that optimal wi =
1
2
log(
1−ε
ε
)
AdaBoost algorithm
Initialize sample weights: sn,0 =
1
Build a model (e.g. decision stumps) using these sample weights
N
Give the model a weight related to its weighted error rate
wi ε
1 − ε
wi = λ log( )
ε
Good trees get more weight than bad trees
Logit function maps error from [0,1] to weight in [-Inf,Inf] (use small minimum error)
ε
Learning rate (shrinkage) decreases impact of individual classifiers
λ
Small updates are often better but requires more iterations
Update the sample weights
Increase weight of incorrectly predicted samples: sn,i+1 = sn,i e
wi
Decrease weight of correctly predicted samples: sn,i+1 = sn,i e
−wi
Normalize weights to add up to 1
Repeat for iterations
I
AdaBoost variants
Discrete Adaboost: error rate is simply the error rate (1-Accuracy)
ε
Real Adaboost: is based on predicted class probabilities (better)
ε p
^
AdaBoost for regression: is either linear ( ), squared ( ), or exponential loss
c
2
ε |yi − y
^ | (yi − y
^ )
GentleBoost: adds a bound on model weights
i i
wi
LogitBoost: Minimizes logistic loss instead of exponential loss
N
ε(fI (x))
LLogistic = ∑ log(1 + e )
n=1
Adaboost in action
Size of the samples represents sample weight
Background shows the latest tree's predictions
Examples
Bias-Variance analysis
AdaBoost reduces bias (and a little variance)
Boosting is a bias reduction technique
Boosting too much will eventually increase variance
Gradient Boosting
Ensemble of models, each fixing the remaining mistakes of the previous ones
Each iteration, the task is to predict the residual error of the ensemble
Additive model: Predictions at iteration are sum of base model predictions
I
Learning rate (or shrinkage ) : small updates work better (reduces variance)
η
fI (x) = g0 (x) + ∑ η ⋅ gi (x) = fI −1 (x) + η ⋅ gI (x)
i=1
The pseudo-residuals are computed according to differentiable loss function
ri
E.g. least squares loss for regression and log loss for classification
Gradient descent: predictions get updated step by step until convergence
∂L(yi , fi−1 (xi ))
gi (x) ≈ ri = −
∂fi−1 (xi )
Base models should be low variance, but flexible enough to predict residuals accurately
gi
E.g. decision trees of depth 2-5
Gradient Boosting Trees (Regression)
Base models are regression trees, loss function is square loss: L =
1
^ )
(yi − y
2
The pseudo-residuals are simply the prediction errors for every sample:
2 i
∂L 1
ri = − = −2 ∗ ^ ) ∗ (−1) = yi − y
(yi − y ^
i i
∂y
^ 2
Initial model simply predicts the mean of
g0 y
For iteration
m = 1..M :
For all samples i=1..n, compute pseudo-residuals ri = yi − y
^
Fit a new regression tree model to
i
gm (x) ri
In
gm (x) , each leaf predicts the mean of all its values
Update ensemble predictions ^ = g0 (x) + ∑
y
M
m=1
η ⋅ gm (x)
Early stopping (optional): stop when performance on validation set does not improve for nr
iterations
Gradient Boosting Regression in action
Residuals quickly drop to (near) zero
GradientBoosting Algorithm (Classification)
Base models are regression trees, predict probability of positive class p
For multi-class problems, train one tree per class
Use (binary) log loss, with true class :
yi ∈ 0, 1
N
Llog = − ∑ [yi log(pi ) + (1 − yi )log(1 − pi )]
The pseudo-residuals are simply the difference between true class and predicted :
i=1
∂L ∂L
= = y i − pi
^
∂y ∂log(pi )
Initial model predicts
g0 p = log(
#positives
)
For iteration :
#negatives
m = 1..M
For all samples i=1..n, compute pseudo-residuals r i = y i − pi
Fit a new regression tree model gm (x)to ri
Ingm (x) , each leaf predicts ∑ ri
i
∑ pi (1−pi )
Update ensemble predictions
i
M
^ = g0 (x) + ∑
y η ⋅ gm (x)
Early stopping (optional): stop when performance on validation set does not improve for
m=1
nr
iterations
Gradient Boosting Classification in action
Size of the samples represents the residual weights: most quickly drop to (near) zero
Examples
Bias-variance analysis
Gradient Boosting is very effective at reducing bias error
Boosting too much will eventually increase variance
Feature importance
Gradient Boosting also provide feature importances, based on many trees
Compared to RandomForests, the trees are smaller, hence more features have zero importance
Gradient Boosting: strengths and weaknesses
Among the most powerful and widely used models
Work well on heterogeneous features and different scales
Typically better than random forests, but requires more tuning, longer training
Does not work well on high-dimensional sparse data
Main hyperparameters:
n_estimators : Higher is better, but will start to overfit
learning_rate : Lower rates mean more trees are needed to get more complex models
Set n_estimators as high as possible, then tune learning_rate
Or, choose a learning_rate and use early stopping to avoid overfitting
max_depth : typically kept low (<5), reduce when overfitting
max_features : can also be tuned, similar to random forests
n_iter_no_change : early stopping: algorithm stops if improvement is less than a certain
tolerance tol for more than n_iter_no_change iterations.
Extreme Gradient Boosting (XGBoost)
Faster version of gradient boosting: allows more iterations on larger datasets
Normal regression trees: split to minimize squared loss of leaf predictions
XGBoost trees only fit residuals: split so that residuals in leaf are more similar
Don't evaluate every split point, only quantiles per feature (binning)
q
q is hyperparameter ( sketch_eps , default 0.03)
For large datasets, XGBoost uses approximate quantiles
Can be parallelized (multicore) by chunking the data and combining histograms of data
For classification, the quantiles are weighted by
p(1 − p)
Gradient descent sped up by using the second derivative of the loss function
Strong regularization by pre-pruning the trees
Column and row are randomly subsampled when computing splits
Support for out-of-core computation (data compression in RAM, sharding,...)
XGBoost in practice
Not part of scikit-learn, but HistGradientBoostingClassifier is similar
binning, multicore,...
The xgboost python package is sklearn-compatible
Install separately, conda install -c conda-forge xgboost
Allows learning curve plotting and warm-starting
Further reading:
XGBoost Documentation
Paper
Video
LightGBM
Another fast boosting technique
Uses gradient-based sampling
use all instances with large gradients/residuals (e.g. 10% largest)
randomly sample instances with small gradients, ignore the rest
intuition: samples with small gradients are already well-trained.
requires adapted information gain criterion
Does smarter encoding of categorical features
CatBoost
Another fast boosting technique
Optimized for categorical variables
Uses bagged and smoothed version of target encoding
Uses symmetric trees: same split for all nodes on a given level aka
Can be much faster
Allows monotonicity constraints for numeric features
Model must be be a non-decreasing function of these features
Lots of tooling (e.g. GPU training)
Stacking
Choose different base-models, generate predictions
M
Stacker (meta-model) learns mapping between predictions and correct label
Can also be repeated: multi-level stacking
Popular stackers: linear models (fast) and gradient boosting (accurate)
Cascade stacking: adds base-model predictions as extra features
Models need to be sufficiently different, be experts at different parts of the data
Can be very accurate, but also very slow to predict
Other ensembling techniques
Hyper-ensembles: same basic model but with different hyperparameter settings
Can combine overfitted and underfitted models
Deep ensembles: ensembles of deep learning models
Bayes optimal classifier: ensemble of all possible models (largely theoretic)
Bayesian model averaging: weighted average of probabilistic models, weighted by their posterior
probabilities
Cross-validation selection: does internal cross-validation to select best of models
M
Any combination of different ensembling techniques
Algorithm overview
Name Representation Loss Optimization Regularization
function
Classification trees Decision tree Entropy / Gini Hunt's Tree depth,...
index algorithm
Regression trees Decision tree Square loss Hunt's Tree depth,...
algorithm
RandomForest Ensemble of Entropy / Gini (Bagging) Number/depth
randomized trees / Square of trees,...
AdaBoost Ensemble of stumps Exponential Greedy Number/depth
loss search of trees,...
GradientBoostingRegression Ensemble of Square loss Gradient Number/depth
regression trees descent of trees,...
GradientBoostingClassification Ensemble of Log loss Gradient Number/depth
regression trees descent of trees,...
XGBoost, LightGBM, CatBoost Ensemble of Square/log 2nd order Number/depth
XGBoost trees loss gradients of trees,...
Ensemble of Number of
Stacking heterogeneous / / models,...
models
Summary
Ensembles of voting classifiers improve performance
Which models to choose? Consider bias-variance tradeoffs!
Bagging / RandomForest is a variance-reduction technique
Build many high-variance (overfitting) models on random data samples
The more different the models, the better
Aggregation (soft voting) over many models reduces variance
Diminishing returns, over-smoothing may increase bias error
Parallellizes easily, doesn't require much tuning
Boosting is a bias-reduction technique
Build low-variance models that correct each other's mistakes
By reweighting misclassified samples: AdaBoost
By predicting the residual error: Gradient Boosting
Additive models: predictions are sum of base-model predictions
Can drive the error to zero, but risk overfitting
Doesn't parallelize easily. Slower to train, much faster to predict.
XGBoost,LightGBM,... are fast and offer some parallellization
Stacking: learn how to combine base-model predictions
Base-models still have to be sufficiently different