0% found this document useful (0 votes)

60 views39 pages

05 - Ensemble Learning

Uploaded by

Rehan Mahmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views39 pages

05 - Ensemble Learning

Uploaded by

Rehan Mahmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Lecture 5.

Ensemble Learning
Crowd intelligence
Joaquin Vanschoren
Ensemble learning
If different models make different mistakes, can we simply average the predictions?
Voting Classifier: gives every model a vote on the class label
Hard vote: majority class wins (class order breaks ties)
Soft vote: sum class probabilities over models:
pm,c M argmax ∑
M

m=1
wc pm,c

Classes can get different weights (default: )

wc wc = 1
Why does this work?
Different models may be good at different 'parts' of data (even if they underfit)
Individual mistakes can be 'averaged out' (especially if models overfit)
Which models should be combined?
Bias-variance analysis teaches us that we have two options:
If model underfits (high bias, low variance): combine with other low-variance models
Need to be different: 'experts' on different parts of the data
Bias reduction. Can be done with Boosting
If model overfits (low bias, high variance): combine with other low-bias models
Need to be different: individual mistakes must be different
Variance reduction. Can be done with Bagging
Models must be uncorrelated but good enough (otherwise the ensemble is worse)
We can also learn how to combine the predictions of different models: Stacking
Decision trees (recap)
Representation: Tree that splits data points into leaves based on tests
Evaluation (loss): Heuristic for purity of leaves (Gini index, entropy,...)
Optimization: Recursive, heuristic greedy search (Hunt's algorithm)
Consider all splits (thresholds) between adjacent data points, for every feature
Choose the one that yields the purest leafs, repeat
Evaluation (loss function for classification)
Every leaf predicts a class probability = the relative frequency of class
p
^ c

Leaf impurity measures (splitting criteria) for leafs, leaf has data :
c

L l Xl

Gini-Index:Gini(Xl ) = ∑ ′ p
^ p
^ ′

Entropy (more expensive):

c≠c c c

E(Xl ) = − ∑ ′ ^
p log ^
p

Best split maximizes information gain (idem for Gini index)

c≠c c 2 c

L
|Xi=l |
Gain(X, Xi ) = E(X) − ∑ E(Xi=l )
|Xi |
l=1
Regression trees
Every leaf predicts the mean target value of all points in that leaf
μ

Choose the split that minimizes squared error of the leaves: ∑ (yi − μ)
2

Yields non-smooth step-wise predictions, cannot extrapolate

xi ∈L
Impurity/Entropy-based feature importance
We can measure the importance of features (to the model) based on
Which features we split on
How high up in the tree we split on them (first splits ar emore important)
Under- and overfitting
We can easily control the (maximum) depth of the trees as a hyperparameter
Bias-variance analysis:
Shallow trees have high bias but very low variance (underfitting)
Deep trees have high variance but low bias (overfitting)
Because we can easily control their complexity, they are ideal for ensembling
Deep trees: keep low bias, reduce variance with Bagging
Shallow trees: keep low variance, reduce bias with Boosting
Bagging (Bootstrap Aggregating)
Obtain different models by training the same model on different training samples
Reduce overfitting by averaging out individual predictions (variance reduction)
In practice: take bootstrap samples of your data, train a model on each bootstrap
I

Higher : more models, more smoothing (but slower training and prediction)
I

Base models should be unstable: different training samples yield different models
E.g. very deep decision trees, or even randomized decision trees
Deep Neural Networks can also benefit from bagging (deep ensembles)
Prediction by averaging predictions of base models
Soft voting for classification (possibly weighted)
Mean value for regression
Can produce uncertainty estimates as well
By combining class probabilities of individual models (or variances for regression)
Random Forests
Uses randomized trees to make models even less correlated (more unstable)
At every split, only consider max_features features, randomly selected
Extremely randomized trees: considers 1 random threshold for random set of features (faster)
Effect on bias and variance
Increasing the number of models (trees) decreases variance (less overfitting)
Bias is mostly unaffected, but will increase if the forest becomes too large (oversmoothing)
In practice

Different implementations can be used. E.g. in scikit-learn:

BaggingClassifier : Choose your own base model and sampling procedure
RandomForestClassifier : Default implementation, many options
ExtraTreesClassifier : Uses extremely randomized trees

Most important parameters:

n_estimators (>100, higher is better, but diminishing returns)
Will start to underfit (bias error component increases slightly)
max_features
Defaults:
sqrt(p) for classification,
log2(p) for regression
Set smaller to reduce space/time requirements
parameters of trees, e.g. max_depth , min_samples_split ,...
Prepruning useful to reduce model size, but don't overdo it
Easy to parallelize (set n_jobs to -1)
Fix random_state (bootstrap samples) for reproducibility
Out-of-bag error
RandomForests don't need cross-validation: you can use the out-of-bag (OOB) error
For each tree grown, about 33% of samples are out-of-bag (OOB)
Remember which are OOB samples for every model, do voting over these
OOB error estimates are great to speed up model selection
As good as CV estimates, althought slightly pessimistic
In scikit-learn: oob_error = 1 - clf.oob_score_
Feature importance
RandomForests provide more reliable feature importances, based on many alternative hypotheses
(trees)
Other tips
Model calibration
RandomForests are poorly calibrated.
Calibrate afterwards (e.g. isotonic regression) if you aim to use probabilities
Warm starting
Given an ensemble trained for iterations, you can simply add more models later
I

You warm start from the existing model instead of re-starting from scratch
Can be useful to train models on new, closely related data
Not ideal if the data batches change over time (concept drift)
Boosting is more robust against this (see later)
Strength and weaknesses
RandomForest are among most widely used algorithms:
Don't require a lot of tuning
Typically very accurate
Handles heterogeneous features well (trees)
Implictly selects most relevant features
Downsides:
less interpretable, slower to train (but parallellizable)
don't work well on high dimensional sparse data (e.g. text)
Adaptive Boosting (AdaBoost)
Obtain different models by reweighting the training data every iteration
Reduce underfitting by focusing on the 'hard' training examples
Increase weights of instances misclassified by the ensemble, and vice versa
Base models should be simple so that different instance weights lead to different models
Underfitting models: decision stumps (or very shallow trees)
Each is an 'expert' on some parts of the data
Additive model: Predictions at iteration are sum of base model predictions
I

In Adaboost, also the models each get a unique weight wi

fI (x) = ∑ wi gi (x)

i=1

Adaboost minimizes exponential loss. For instance-weighted error : ε

ε(fI (x))
LExp = ∑ e

n=1

By deriving ∂L

∂wi
you can find that optimal wi =
1

2
log(
1−ε

ε
)
AdaBoost algorithm
Initialize sample weights: sn,0 =
1

Build a model (e.g. decision stumps) using these sample weights

Give the model a weight related to its weighted error rate

wi ε

1 − ε
wi = λ log( )
ε

Good trees get more weight than bad trees

Logit function maps error from [0,1] to weight in [-Inf,Inf] (use small minimum error)
ε

Learning rate (shrinkage) decreases impact of individual classifiers

Small updates are often better but requires more iterations

Update the sample weights
Increase weight of incorrectly predicted samples: sn,i+1 = sn,i e
wi

Decrease weight of correctly predicted samples: sn,i+1 = sn,i e

−wi

Normalize weights to add up to 1

Repeat for iterations
I
AdaBoost variants
Discrete Adaboost: error rate is simply the error rate (1-Accuracy)
ε

Real Adaboost: is based on predicted class probabilities (better)

ε p
^

AdaBoost for regression: is either linear ( ), squared ( ), or exponential loss

2
ε |yi − y
^ | (yi − y
^ )

GentleBoost: adds a bound on model weights

i i

LogitBoost: Minimizes logistic loss instead of exponential loss

ε(fI (x))
LLogistic = ∑ log(1 + e )

n=1
Adaboost in action
Size of the samples represents sample weight
Background shows the latest tree's predictions
Examples
Bias-Variance analysis
AdaBoost reduces bias (and a little variance)
Boosting is a bias reduction technique
Boosting too much will eventually increase variance
Gradient Boosting
Ensemble of models, each fixing the remaining mistakes of the previous ones
Each iteration, the task is to predict the residual error of the ensemble
Additive model: Predictions at iteration are sum of base model predictions
I

Learning rate (or shrinkage ) : small updates work better (reduces variance)
η

fI (x) = g0 (x) + ∑ η ⋅ gi (x) = fI −1 (x) + η ⋅ gI (x)

i=1

The pseudo-residuals are computed according to differentiable loss function

E.g. least squares loss for regression and log loss for classification
Gradient descent: predictions get updated step by step until convergence
∂L(yi , fi−1 (xi ))
gi (x) ≈ ri = −
∂fi−1 (xi )

Base models should be low variance, but flexible enough to predict residuals accurately
gi

E.g. decision trees of depth 2-5

Gradient Boosting Trees (Regression)
Base models are regression trees, loss function is square loss: L =
1
^ )
(yi − y
2

The pseudo-residuals are simply the prediction errors for every sample:
2 i

∂L 1
ri = − = −2 ∗ ^ ) ∗ (−1) = yi − y
(yi − y ^
i i
∂y
^ 2

Initial model simply predicts the mean of

g0 y

For iteration
m = 1..M :
For all samples i=1..n, compute pseudo-residuals ri = yi − y
^

Fit a new regression tree model to

gm (x) ri

In
gm (x) , each leaf predicts the mean of all its values
Update ensemble predictions ^ = g0 (x) + ∑
y
M

m=1
η ⋅ gm (x)

Early stopping (optional): stop when performance on validation set does not improve for nr

iterations
Gradient Boosting Regression in action

Residuals quickly drop to (near) zero

GradientBoosting Algorithm (Classification)
Base models are regression trees, predict probability of positive class p

For multi-class problems, train one tree per class

Use (binary) log loss, with true class :
yi ∈ 0, 1
N
Llog = − ∑ [yi log(pi ) + (1 − yi )log(1 − pi )]

The pseudo-residuals are simply the difference between true class and predicted :
i=1

∂L ∂L
= = y i − pi
^
∂y ∂log(pi )

Initial model predicts

g0 p = log(
#positives
)

For iteration :
#negatives

m = 1..M

For all samples i=1..n, compute pseudo-residuals r i = y i − pi

Fit a new regression tree model gm (x)to ri

Ingm (x) , each leaf predicts ∑ ri

∑ pi (1−pi )

Update ensemble predictions

M
^ = g0 (x) + ∑
y η ⋅ gm (x)

Early stopping (optional): stop when performance on validation set does not improve for
m=1

iterations
Gradient Boosting Classification in action

Size of the samples represents the residual weights: most quickly drop to (near) zero
Examples
Bias-variance analysis

Gradient Boosting is very effective at reducing bias error

Boosting too much will eventually increase variance
Feature importance

Gradient Boosting also provide feature importances, based on many trees

Compared to RandomForests, the trees are smaller, hence more features have zero importance
Gradient Boosting: strengths and weaknesses

Among the most powerful and widely used models

Work well on heterogeneous features and different scales
Typically better than random forests, but requires more tuning, longer training
Does not work well on high-dimensional sparse data
Main hyperparameters:
n_estimators : Higher is better, but will start to overfit
learning_rate : Lower rates mean more trees are needed to get more complex models
Set n_estimators as high as possible, then tune learning_rate
Or, choose a learning_rate and use early stopping to avoid overfitting
max_depth : typically kept low (<5), reduce when overfitting
max_features : can also be tuned, similar to random forests
n_iter_no_change : early stopping: algorithm stops if improvement is less than a certain
tolerance tol for more than n_iter_no_change iterations.
Extreme Gradient Boosting (XGBoost)
Faster version of gradient boosting: allows more iterations on larger datasets
Normal regression trees: split to minimize squared loss of leaf predictions
XGBoost trees only fit residuals: split so that residuals in leaf are more similar
Don't evaluate every split point, only quantiles per feature (binning)
q

q is hyperparameter ( sketch_eps , default 0.03)

For large datasets, XGBoost uses approximate quantiles
Can be parallelized (multicore) by chunking the data and combining histograms of data
For classification, the quantiles are weighted by
p(1 − p)

Gradient descent sped up by using the second derivative of the loss function
Strong regularization by pre-pruning the trees
Column and row are randomly subsampled when computing splits
Support for out-of-core computation (data compression in RAM, sharding,...)
XGBoost in practice

Not part of scikit-learn, but HistGradientBoostingClassifier is similar

binning, multicore,...
The xgboost python package is sklearn-compatible
Install separately, conda install -c conda-forge xgboost
Allows learning curve plotting and warm-starting
Further reading:
XGBoost Documentation
Paper
Video
LightGBM
Another fast boosting technique
Uses gradient-based sampling
use all instances with large gradients/residuals (e.g. 10% largest)
randomly sample instances with small gradients, ignore the rest
intuition: samples with small gradients are already well-trained.
requires adapted information gain criterion
Does smarter encoding of categorical features
CatBoost
Another fast boosting technique
Optimized for categorical variables
Uses bagged and smoothed version of target encoding
Uses symmetric trees: same split for all nodes on a given level aka
Can be much faster
Allows monotonicity constraints for numeric features
Model must be be a non-decreasing function of these features
Lots of tooling (e.g. GPU training)
Stacking
Choose different base-models, generate predictions
M

Stacker (meta-model) learns mapping between predictions and correct label

Can also be repeated: multi-level stacking
Popular stackers: linear models (fast) and gradient boosting (accurate)
Cascade stacking: adds base-model predictions as extra features
Models need to be sufficiently different, be experts at different parts of the data
Can be very accurate, but also very slow to predict
Other ensembling techniques
Hyper-ensembles: same basic model but with different hyperparameter settings
Can combine overfitted and underfitted models
Deep ensembles: ensembles of deep learning models
Bayes optimal classifier: ensemble of all possible models (largely theoretic)
Bayesian model averaging: weighted average of probabilistic models, weighted by their posterior
probabilities
Cross-validation selection: does internal cross-validation to select best of models
M

Any combination of different ensembling techniques

Algorithm overview
Name Representation Loss Optimization Regularization
function
Classification trees Decision tree Entropy / Gini Hunt's Tree depth,...
index algorithm
Regression trees Decision tree Square loss Hunt's Tree depth,...
algorithm
RandomForest Ensemble of Entropy / Gini (Bagging) Number/depth
randomized trees / Square of trees,...
AdaBoost Ensemble of stumps Exponential Greedy Number/depth
loss search of trees,...
GradientBoostingRegression Ensemble of Square loss Gradient Number/depth
regression trees descent of trees,...
GradientBoostingClassification Ensemble of Log loss Gradient Number/depth
regression trees descent of trees,...
XGBoost, LightGBM, CatBoost Ensemble of Square/log 2nd order Number/depth
XGBoost trees loss gradients of trees,...
Ensemble of Number of
Stacking heterogeneous / / models,...
models
Summary
Ensembles of voting classifiers improve performance
Which models to choose? Consider bias-variance tradeoffs!
Bagging / RandomForest is a variance-reduction technique
Build many high-variance (overfitting) models on random data samples
The more different the models, the better
Aggregation (soft voting) over many models reduces variance
Diminishing returns, over-smoothing may increase bias error
Parallellizes easily, doesn't require much tuning
Boosting is a bias-reduction technique
Build low-variance models that correct each other's mistakes
By reweighting misclassified samples: AdaBoost
By predicting the residual error: Gradient Boosting
Additive models: predictions are sum of base-model predictions
Can drive the error to zero, but risk overfitting
Doesn't parallelize easily. Slower to train, much faster to predict.
XGBoost,LightGBM,... are fast and offer some parallellization
Stacking: learn how to combine base-model predictions
Base-models still have to be sufficiently different

Week 7 - Tree-Based Model
100% (1)
Week 7 - Tree-Based Model
8 pages
Unit 3
No ratings yet
Unit 3
63 pages
5 - EnsembleModeling
No ratings yet
5 - EnsembleModeling
80 pages
Ensemble Learning Explained
No ratings yet
Ensemble Learning Explained
32 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
Chapter 7 - Ensemble
No ratings yet
Chapter 7 - Ensemble
12 pages
Ensemble Methods
No ratings yet
Ensemble Methods
19 pages
Eda - M4
No ratings yet
Eda - M4
7 pages
UNIT-V (Bagging, Boosting, Random Forest) : by Dr. K. Aditya Shastry Associate Professor Dept. of ISE NMIT, Bengaluru
No ratings yet
UNIT-V (Bagging, Boosting, Random Forest) : by Dr. K. Aditya Shastry Associate Professor Dept. of ISE NMIT, Bengaluru
27 pages
ES335
No ratings yet
ES335
22 pages
Ensemble Learning
No ratings yet
Ensemble Learning
52 pages
Ensemble Learning in Machine Learning
No ratings yet
Ensemble Learning in Machine Learning
32 pages
ML Lecture 15 Ensemble
No ratings yet
ML Lecture 15 Ensemble
27 pages
ML Unit-3 Part-1
No ratings yet
ML Unit-3 Part-1
17 pages
DSA5102 Lecture3
No ratings yet
DSA5102 Lecture3
34 pages
14 Model Ensembles
No ratings yet
14 Model Ensembles
63 pages
Pa - Unit - Iv
No ratings yet
Pa - Unit - Iv
45 pages
Assessing Predictive Models
No ratings yet
Assessing Predictive Models
25 pages
Lecture 05 Random Forest 07112022 124639pm
No ratings yet
Lecture 05 Random Forest 07112022 124639pm
25 pages
Bagging
No ratings yet
Bagging
7 pages
Module 2
No ratings yet
Module 2
34 pages
Ensemble Learning
No ratings yet
Ensemble Learning
12 pages
ML11 Generalization
No ratings yet
ML11 Generalization
40 pages
Unit 3
No ratings yet
Unit 3
59 pages
2025 Ensemble Learning
No ratings yet
2025 Ensemble Learning
25 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Module 7 - Ensemble Learning
No ratings yet
Module 7 - Ensemble Learning
41 pages
Machine Learning Lecture 2,3,4
No ratings yet
Machine Learning Lecture 2,3,4
26 pages
ML-Unit I - Ensemble Methods
No ratings yet
ML-Unit I - Ensemble Methods
54 pages
Ensemble Learning
No ratings yet
Ensemble Learning
35 pages
Ensemble Methods in Machine Learning
No ratings yet
Ensemble Methods in Machine Learning
17 pages
Lecture 10 Ensemble Methods
No ratings yet
Lecture 10 Ensemble Methods
69 pages
ML Unit-3
No ratings yet
ML Unit-3
16 pages
Ensemble Methods
No ratings yet
Ensemble Methods
31 pages
Random Forest-Supervised ML
No ratings yet
Random Forest-Supervised ML
45 pages
Phys361 S24 Lecture 17 Random Forests
No ratings yet
Phys361 S24 Lecture 17 Random Forests
24 pages
Ch-4 Ensemble Learning
No ratings yet
Ch-4 Ensemble Learning
18 pages
Ensemble Method
No ratings yet
Ensemble Method
8 pages
CH 7 Ensemble Learning
No ratings yet
CH 7 Ensemble Learning
34 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
CS Study Guide
No ratings yet
CS Study Guide
3 pages
Ensembles
No ratings yet
Ensembles
9 pages
Unit-3 ML
No ratings yet
Unit-3 ML
18 pages
ML Mod1
No ratings yet
ML Mod1
48 pages
What Is Ensemble Learning
No ratings yet
What Is Ensemble Learning
4 pages
Ensemble Methods Final PDF
No ratings yet
Ensemble Methods Final PDF
25 pages
D3 IT Random Forest Apr 2023
No ratings yet
D3 IT Random Forest Apr 2023
32 pages
Lec06 - Ensembling Methods Bagging Boosting
No ratings yet
Lec06 - Ensembling Methods Bagging Boosting
48 pages
Lesson 8 - Ensemble Learning
No ratings yet
Lesson 8 - Ensemble Learning
61 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
40 pages
M05 Ensemble
No ratings yet
M05 Ensemble
42 pages
Random Forest
No ratings yet
Random Forest
20 pages
Unit 3
No ratings yet
Unit 3
99 pages
PDS LVC 2 Post-Session Summary
No ratings yet
PDS LVC 2 Post-Session Summary
11 pages
Data Mining - Ensemble Methods
No ratings yet
Data Mining - Ensemble Methods
12 pages
13 PracticalMachineLearning
100% (1)
13 PracticalMachineLearning
84 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Data Preprocessing for ML Pipelines
No ratings yet
Data Preprocessing for ML Pipelines
68 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
10 - Neural Networks For Text
No ratings yet
10 - Neural Networks For Text
40 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
ML Exercises for Students
No ratings yet
ML Exercises for Students
1 page
Deques: ECE 250 Algorithms and Data Structures
No ratings yet
Deques: ECE 250 Algorithms and Data Structures
34 pages
Department of Computing: CS-220: Database Systems Class: BSCS-4C
100% (1)
Department of Computing: CS-220: Database Systems Class: BSCS-4C
14 pages
This Study Resource Was
No ratings yet
This Study Resource Was
3 pages
This Study Resource Was
No ratings yet
This Study Resource Was
1 page
Some Random Reviews: Which Happens by Default?
No ratings yet
Some Random Reviews: Which Happens by Default?
9 pages
A Guide To DEAP Version 2.1: A Data Envelopment Analysis (Computer) Program
No ratings yet
A Guide To DEAP Version 2.1: A Data Envelopment Analysis (Computer) Program
49 pages
CompTIA Server
No ratings yet
CompTIA Server
3 pages
Power/AIX Training for Beginners
No ratings yet
Power/AIX Training for Beginners
4 pages
Toshiba 2SK2961 N-Channel MOSFET
No ratings yet
Toshiba 2SK2961 N-Channel MOSFET
7 pages
Kasun CV
No ratings yet
Kasun CV
3 pages
Implementing A Multi-Domain System: Siemens Digital Industries Software
No ratings yet
Implementing A Multi-Domain System: Siemens Digital Industries Software
9 pages
Pxinclude As A Title/Nav Bar: Steps
No ratings yet
Pxinclude As A Title/Nav Bar: Steps
12 pages
Unit 2 - Z Transform
No ratings yet
Unit 2 - Z Transform
67 pages
2500+ Canva Templates
0% (3)
2500+ Canva Templates
7 pages
Vsx305 Om-Pioneer Text
No ratings yet
Vsx305 Om-Pioneer Text
28 pages
PTE - (Full Presentation)
0% (1)
PTE - (Full Presentation)
64 pages
Android Developer Portfolio
No ratings yet
Android Developer Portfolio
1 page
Definition of Big Data - IT Glossary - Gartner
No ratings yet
Definition of Big Data - IT Glossary - Gartner
1 page
SCADA Supervisory Control and Data Acquisition 3rd Edition Stuart A. Boyer (Boyer Download
No ratings yet
SCADA Supervisory Control and Data Acquisition 3rd Edition Stuart A. Boyer (Boyer Download
105 pages
Elementary Level Test: 1. Open The Brackets Using The Correct Form of The Verb: Present Simple/continuous
No ratings yet
Elementary Level Test: 1. Open The Brackets Using The Correct Form of The Verb: Present Simple/continuous
2 pages
NICDSign Installation Manual For Windows - DDO - Portal
No ratings yet
NICDSign Installation Manual For Windows - DDO - Portal
34 pages
LinQ trong lập trình C# Winform - How Kteam-trang-3-6
No ratings yet
LinQ trong lập trình C# Winform - How Kteam-trang-3-6
3 pages
New Ford Figo Features & Specs
No ratings yet
New Ford Figo Features & Specs
22 pages
Chapter 1 Complex Numbers: Question Bank
No ratings yet
Chapter 1 Complex Numbers: Question Bank
4 pages
Functional Safety Certificate: ICO3S, ICO4S, ICO4D, ICO4N and SOV 1 To 6
100% (1)
Functional Safety Certificate: ICO3S, ICO4S, ICO4D, ICO4N and SOV 1 To 6
5 pages
Java - Developer
No ratings yet
Java - Developer
4 pages
UI Style Guide Essentials
No ratings yet
UI Style Guide Essentials
10 pages
Internship Report
No ratings yet
Internship Report
20 pages
C 09 S 4
100% (2)
C 09 S 4
12 pages
Advamced Math Solved Problems
No ratings yet
Advamced Math Solved Problems
2 pages
COMPANY PROFILE Technical World Company
No ratings yet
COMPANY PROFILE Technical World Company
125 pages
Swarm Intelligence Seminar
100% (1)
Swarm Intelligence Seminar
35 pages
Sacrifial Interface (SI)
No ratings yet
Sacrifial Interface (SI)
3 pages
Java.io.File Class Overview
No ratings yet
Java.io.File Class Overview
5 pages

05 - Ensemble Learning

Uploaded by

05 - Ensemble Learning

Uploaded by

Lecture 5.

Classes can get different weights (default: )

Entropy (more expensive):

Best split maximizes information gain (idem for Gini index)

Yields non-smooth step-wise predictions, cannot extrapolate

Different implementations can be used. E.g. in scikit-learn:

Most important parameters:

In Adaboost, also the models each get a unique weight wi

Adaboost minimizes exponential loss. For instance-weighted error : ε

Build a model (e.g. decision stumps) using these sample weights

Give the model a weight related to its weighted error rate

Good trees get more weight than bad trees

Learning rate (shrinkage) decreases impact of individual classifiers

Small updates are often better but requires more iterations

Decrease weight of correctly predicted samples: sn,i+1 = sn,i e

Normalize weights to add up to 1

Real Adaboost: is based on predicted class probabilities (better)

AdaBoost for regression: is either linear ( ), squared ( ), or exponential loss

GentleBoost: adds a bound on model weights

LogitBoost: Minimizes logistic loss instead of exponential loss

fI (x) = g0 (x) + ∑ η ⋅ gi (x) = fI −1 (x) + η ⋅ gI (x)

The pseudo-residuals are computed according to differentiable loss function

E.g. decision trees of depth 2-5

Initial model simply predicts the mean of

Fit a new regression tree model to

Residuals quickly drop to (near) zero

For multi-class problems, train one tree per class

Initial model predicts

For all samples i=1..n, compute pseudo-residuals r i = y i − pi

Fit a new regression tree model gm (x)to ri

Ingm (x) , each leaf predicts ∑ ri

Update ensemble predictions

Gradient Boosting is very effective at reducing bias error

Gradient Boosting also provide feature importances, based on many trees

Among the most powerful and widely used models

q is hyperparameter ( sketch_eps , default 0.03)

Not part of scikit-learn, but HistGradientBoostingClassifier is similar

Stacker (meta-model) learns mapping between predictions and correct label

Any combination of different ensembling techniques

You might also like