Formulation Bagging Boosting
Chapter 9
Ensemble Learning
supplementary slides to
Machine Learning Fundamentals
c
Hui Jiang 2020
published by Cambridge University Press
August 2020
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Outline
1 Formulation of Ensemble Learning
2 Bagging
3 Boosting
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Ensemble Learning
ensemble learning: combine multiple base models that are
learned separately for the same task
how to choose base models?
◦ neural networks, linear models, decision trees, etc.
how to learn base models to ensure the diversity?
◦ re-sampling the training set, re-weighting training samples, etc.
how to combine base models optimally?
◦ bagging, boosting, stacking
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Decision Trees (I)
a popular non-parametric model for
regression or classification tasks
a tree-structured model:
◦ each non-terminal node is associated
with a binary question regarding an
input feature element xi and a threshold
tj , e.g. xi ≤ tj
◦ each leaf node represents a homogeneous
region Rl in the input space
each decision tree represents a particular
partition of the input space
decision trees are a highly interpretable
machine learning method
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Decision Trees (II)
fit a simple model to all y values in each x y
region Rl ML model
◦ regression: use a constant cl for each Rl
◦ classification: assign all x in each Rl to y = f¯(x)
one particular class
approximate the unknown target function
by a piece-wise constant function
X
y = f (x) = cl I(x ∈ Rl )
l
where
1 if x ∈ Rl
I(x ∈ Rl ) =
0 otherwise
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Decision Trees for Regression
a training set: D = (x(n) , y (n) ) n = 1, 2, · · · , N
construct the loss functional using a loss function l(·):
1
PN 1
PN 2
L(f ; D) = N n=1 l y (n) , f (x(n) ) = N n=1 y (n) − f (x(n) )
computationally infeasible to find the best partition to
minimize the above loss
use the greedy algorithm to recursively find an optimal split
x∗i ≤ t∗j at a time
h X 2 X 2 i
x∗i , t∗j = arg min y (n) − c∗l y (n) − c∗r
+
xi ,tj
x(n) ∈Dl x(n) ∈Dr
where
(n) (n)
Dl = (x(n) , y (n) ) xi ≤ tj , Dr = (x(n) , y (n) ) xi > tj ,
and c∗l and c∗r are the centroids of Dl and Dr
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Decision Trees for Classification
classification
problem involving K classes,
i.e. ω1 , ω2 , · · · , ωK
plk (k = 1, 2, · · · , K): the portion of class
k among all training samples assigned to
leaf node l representing Rl
1 X
plk = I(y (n) = ωk )
Nl
x(n) ∈Rl ◦ misclassification error:
1
P (n)
Nl x (n) ∈R l
I(y 6=
all input x in each region Rl is assigned ωkl∗ ) = 1 − plkl∗
to the majority class ◦ Gini index: 1 − K
P 2
k=1 plk
kl∗ = arg maxk plk ◦ entropy:
− K
P
the criteria for the best split {x∗i , t∗j } k=1 plk log(plk )
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Bagging and Random Forests
bagging stands for bootstrap aggregating
bootstrapping (sampling with replacement) a training set into
M subsets
use M bootstrap subsets to independently learn M models
combine M models by averaging or majority-voting
random forests: use decision trees as base models in bagging
◦ row sampling
◦ column sampling
◦ sub-optimal splitting
random forests are much more powerful than decision trees
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Boosting: Outline
1 Gradient Boosting
2 AdaBoost
3 Gradient Tree Boosting
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Boosting
consider an additive model for ensemble learning
Fm (x) = w1 f1 (x) + w2 f2 (x) + · · · + wm fm (x)
each base model fm (x) ∈ H, then Fm (x) ∈ lin(H) ⊇ H
ensemble learning: ⇐⇒ functional minimization
N
X
Fm (x) = arg min l f (xn ), yn
f ∈ lin(H)
n=1
boosting: a sequential learning strategy to add a new base
model to improve the current ensemble
Fm (x) = Fm−1 (x) + wm fm (x)
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Gradient Boosting
gradient boosting: estimate the new base
model along the direction of the gradient
at the current ensemble Fm−1 :
∂l f (x), y
∆
∇l Fm−1 (x) =
∂f
f =Fm−1
project the gradient into H: N
∆ 1 X
hf, gi = f (xi )g(xi )
fm = arg max f, −∇l Fm−1 (x) N i=1
f ∈H
estimate the optimal weight:
PN
wm = arg minw n=1 l Fm−1 (xn ) + w fm (xn ), yn
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
AdaBoost (I)
apply gradient boosting to binary classification problems
H: all binary functions, i.e. ∀f ∈ H, f (x) ∈ {−1, +1}
the exponential loss function: l F (x), y = e −yF (x)
given a training set: D = (x1 , y1 ), (x2 , y2 ), · · · , (xN , yN ) ,
where xn ∈ Rd and yn ∈ {−1, +1}
the functional gradient:
∆ ∂l f (x), y
∇l Fm−1 (x) = = −y e−yFm−1 (x)
∂f
f =Fm−1
project into H:
fm = arg max f, −∇l Fm−1 (x)
f ∈H
N
1 X
= arg max yn f (xn )e−yn Fm−1 (xn )
f ∈H N n=1
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
AdaBoost (II)
(m) ∆
denote αn = exp(−yn Fm−1 xn ) :
X X
fm = arg max αn(m) − αn(m)
f ∈H
yn =f (xn ) yn 6=f (xn )
N
X X
= arg max αn(m) − 2 αn(m)
f ∈H
n=1 yn 6=f (xn )
X
= arg min αn(m)
f ∈H
yn 6=f (xn )
(m)
(m) ∆ αn
normalize all weights as ᾱn = PN (m) , we have
n=1 αn
X
fm = arg min ᾱn(m)
f ∈H
yn 6=f (xn )
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
AdaBoost (III)
estimate fm to minimize the weighted classification error:
X
m = ᾱn(m) (0 ≤ m ≤ 1)
yn 6=fm (xn )
replace the 0-1 loss function with a weighted loss function,
(m)
where ᾱn is treated as the loss if (xn , yn ) is misclassified
estimate the optimal weight:
N
X
wm = arg min e−yn Fm−1 (xn )+w fm (xn )
w
n=1
P (m)
ᾱn
1 yn =fm (xn ) 1 1 − m
=⇒ wm = ln (m)
= ln
2 2 m
P
yn 6=fm (xn ) ᾱn
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
AdaBoost (IV)
AdaBoost algorithm
input: (x1 , y1 ), · · · , (xN , yN ) , where xn ∈ Rd and yn ∈ {−1, +1}
output: an ensemble model Fm (x)
m = 1 and F0 (x) = 0
(1)
initialize ᾱn = N1 for all n = 1, 2, · · · , N
while not converged do P (m)
learn a binary classifier fm (x) to minimize m = yn 6=fm (xn ) ᾱn
estimate ensemble weight: wm = 21 ln 1− m
m
add to ensemble: Fm (x) = Fm−1 (x) + wm fm (x)
(m) −yn wm fm (xn )
(m+1) ᾱn e
update ᾱn = PN (m) −yn wm fm (xn ) for all n = 1, 2, · · · , N
n=1 ᾱn e
m=m+1
end while
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
AdaBoost (V)
Theorem
If AdaBoost generates m base models with errors 1 , 2 , · · · , m ,
the error of the ensemble model Fm (x) is bounded as:
m p
Y
ε ≤ 2m t (1 − t )
t=1
combine many weak classifiers towards a strong classifier, i.e.
ε → 0 as m → ∞ if all t 6= 12 (better than random guessing)
generalize well into unseen samples since it improves the
margin distribution of training samples
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Gradient Tree Boosting (I)
apply gradient boosting to regression problems
H: all decision trees
use the square error as the loss functional:
2
l f (x), y = 12 f (x) − y
the functional gradient: ∇l Fm−1 (x) = Fm−1 (x) − y
given a training set: D = (x1 , y1 ), (x2 , y2 ), · · · , (xN , yN )
project it to minimize:
2
fm = arg min
f + ∇l Fm−1 (x)
f ∈H
N
X 2
= arg min f (xn ) − yn − Fm−1 (xn )
f ∈H
n=1
| {z }
residual
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Gradient Tree Boosting (II)
gradient tree boosting: build a decision tree fm to
approximate the residuals, i.e. yn − Fm−1 (xn ), for all n
X
y = fm (x) = cml I(x ∈ Rml )
l
where cml is the mean of all residuals in the region Rml
a.k.a. gradient boosting machine (GBM), gradient boosted
regression tree (GBRT)
use a pre-set ”shrinkage” parameter ν as the weight:
Fm (x) = Fm−1 (x) + ν fm (x)
also applicable to multi-class classification problems
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9
Formulation Bagging Boosting
Gradient Tree Boosting (III)
Gradient Tree Boosting
input: (x1 , y1 ), (x2 , y2 ), · · · , (xN , yN )
output: an ensemble model Fm (x)
fit a regression tree f0 (x) to (x1 , y1 ), (x2 , y2 ), · · · , (xN , yN )
F0 (x) = ν f0 (x)
m=1
while not converged do
compute the negative gradients as pseudo outputs:
ỹn = −∇l Fm−1 (xn) for all n = 1, 2, · · · , N
fit a regression tree fm (x) to (x1 , ỹ1 ), · · · , (xN , ỹN )
Fm (x) = Fm−1 (x) + νfm (x)
m=m+1
end while
supplementary slides to Machine Learning Fundamentals
c
Hui Jiang 2020 published by Cambridge University Press
Chapter 9