Ho Chi Minh University of Banking
Department of Economic Mathematics
Machine Learning
Decision Tree
Vuong Trong Nhân (nhanvt@hub.edu.vn)
Outline
Decision tree representation
ID3 learning algorithm
Which attribute is best?
C4.5: real valued attributes
Which hypothesis is best?
Noise
From Trees to Rules
Miscellaneous
2
Decision Tree Representation
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Outlook, Temperature, etc.: attributes
PlayTennis: class
Shall I play tennis today?
3
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes
4
Alternative Decision Tree for PlayTennis
Temperature
hot mild cool
{1,2,3,13} {4,8,10,11,12,14} {5,6,7,9}
Humidity
Normal High
{1,2,3} {13] ...
YES
Wind
Mild Strong
{1,3} {2}
Outlook NO What is different?
Sunny Overcast Sequence of attributes influences
{1} {3}
size and shape of tree
NO YES
5
Occam’s Principle
Occam’s Principle:
“If two theories explain the
facts equally well, then the
simpler theory is preferred”
Preferred the smallest tree
that correctly classifies all
training examples
6
Decision Trees
Decision tree representation:
• Each internal node tests an attribute
• Each branch corresponds to attribute value
• Each leaf node assigns a classification
How would we represent:
A
• ∧, ∨, XOR Example XOR:
yes no
B B
yes no yes no
NO YES YES NO
7
When to Consider Decision Trees
Instances describable by attribute–value pairs
Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data
Interpretable result of learning is required
Examples:
Medical diagnosis
Text classification
Credit risk analysis
8
Top-Down Induction of Decision Trees, ID3
ID3 (Quinlan, 1986) operates on whole training set S
Algorithm:
1. create a new node
2. If current training set is sufficiently pure:
• Label node with respective class
• We’re done
3. Else:
• x → the “best” decision attribute for current training set
• Assign x as decision attribute for node
• For each value of x, create new descendant of node
• Sort training examples to leaf nodes
• Iterate over new leaf nodes and apply algorithm recursively
9
Example ID3
• Look at current training set S
• Determine best attribute
• Split training set according to different values
10
Example ID3
• Tree
• Apply algorithm recursively
11
Example – Resulting Tree
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes
12
ID3 – Intermediate Summary
• Recursive splitting of the training set
• Stop, if current training set is sufficiently pure
• ... What means pure? Can we allow for errors?
• What is the best attribute?
• How can we tell that the tree is really good?
• How shall we deal with continuous values?
13
Which attribute is best?
• Assume a training set { + , + , − , − , + , − , + , + , − , − }
(only classes)
• Assume binary attributes x 1 , x 2 , and x 3
• Produced splits:
Value 1 Value 2
x1 {+, +, −, −, + } {−, +, +, −, −}
x2 {+} {+, −, −, +, −, +, +, −, −}
x3 {+, +, +, +, −} {−, −, −, −, + }
• No attribute is perfect
• Which one to choose?
14
Entropy
• p⊕ is the proportion of positive
examples
1.0
• p⊖ is the proportion of negative
Entropy (S)
examples
0.5
• Entropy measures the impurity of S
Entropy(S) ≡ −p ⊕ log2 p⊕ − p⊖ log2 p⊖
0.0 0.5 1.0
p⊕
• Information can be seen as the
negative of entropy
15
Entropy
S = { + + + + + + + + + , − − − − − } = { 9 + , 5−}. Entropy( S) = ?
Entropy( S) = −9/14 log(9/14) − 5/14 log(5/14) = 0.94
S = { + + + + + + + + , − − − − − − } = { 8 + , 6−}. Entropy (S) = ?
Entropy( S) = −8/14 log(8/14) − 6/14 log(6/14) = 0.98
S = { + + + + + + + + + + + + + + + } = {14+}. Entropy( S) = ?
Entr opy( S) = 0
S = { + + + + + + + + − − − − − − − } = { 7 + , 7−}.
Entr opy( S) = ?
Entr opy( S) = 1
16
Entropy
• All members of 𝑆 belong to the same
1.0
class
Entropy (S)
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = 0 (the purest set)
0.5
• Numbers of positive and negative
examples are equal (p ⊕ = p⊖ = 0.5)
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = 1 (most impurity)
0.0 0.5 1.0
p⊕
• Numbers of positive and negative
examples are unequal
• Entropy is between 0 and 1.
17
Information Gain
• Measuring attribute x creates subsets S 1 and S 2 with
different entropies
• Taking the mean of Entropy(S 1 ) and Entropy(S 2 )
gives conditional entropy Entropy(S|x), i.e. in general:
𝑠𝑣
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆|𝑥) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑥)
• → Choose that attribute that maximizes difference:
Gain(S , x) Entropy(S ) Entropy( S | x)
• Gain(S, x ) = expected reduction in entropy due to partitioning
on x
| sv |
Gain( S , x) Entropy ( S ) vValues ( x ) Entropy ( Sv )
|S|
18
Information Gain: Definition
| sv |
Gain( S , x) Entropy ( S ) vValues ( x ) Entropy ( Sv )
|S|
𝑉𝑎𝑙𝑢𝑒 (x): the set of all possible values for the attribute x.
S𝑣: the subset of S for which x has value 𝑣
Information Gain is a measure of the effectiveness of an
attribute in classifying data.
It is the expected reduction in entropy caused by
partitioning the objects according to this attribute.
19
Example - Training Set
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
20
Example
| sv |
Gain( S , x) Entropy ( S ) vValues ( x ) Entropy ( Sv )
|S|
For top node: S = { 9 + , 5−}, E n t r o p y ( S ) = 0.94 ID Wind
Play
Tennis
Attribute Wind: D1 Weak No
S _ w e a k = { 6 + , 2−}, | S _we a k | = 8 D8 Weak No
S _s t r o n g = { 3 + , 3−}, | S _s t r o n g | = 6 D3 Weak Yes
E n t r o p y ( S _w e a k ) = −6/8*l o g ( 6 / 8 ) − - D4 Weak Yes
2 / 8 *l o g ( 2 / 8 ) = 0.81 D5 Weak Yes
D9 Weak Yes
E n t r o p y ( S _s t r o n g ) = 1
Expected Entropy when assuming attribute ’Wind’:
D10 Weak Yes
D13 Weak Yes
E n t r o p y ( S | W i n d ) = 8 / 1 4 * E n t r o p y ( S _w e a k ) +
D2 Strong No
6 / 1 4 *E n t r o p y ( S _s t r o n g ) = 0.89
D6 Strong No
D14 Strong No
Gain(S, Wind) = D7 Strong Yes
0.94 − 0.89 ≈ 0.05 D11 Strong Yes
D12 Strong Yes
21
Selecting the Next Attribute
• For whole training set:
G a i n ( S , O u t l o o k ) = 0.246
G a i n ( S , H u m i d i t y ) = 0.151
G a i n ( S , W i n d ) = 0.048
G a i n ( S , Te m p e r a t u r e ) = 0.029
→ O u t l o o k should be used to split training set!
• Further down in the tree, E n t r o p y ( S ) is computed locally
• Usually, the tree does not have to be minimized
• Reason of good performance of ID3!
22
Next step in growing the decision tree
23
The Resulting Decision Tree & Its Rules
24
Some issues: Real-Valued Attributes
Temperature = 82.5
Create discrete attributes to test continuous:
(Temperature > 54) = true or = false
Sort attribute values that occur in training set:
Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No
Determine points where the class changes
Candidates are (48 + 60) / 2 and (80 + 90) / 2
Select best one using info gain
Implemented in the system C4.5 (successor of ID3)
25
Some issues: Noise
Consider adding noisy (=wrongly labeled) training example #15:
S u n n y , M i l d , N o r m a l , We a k , P l a y Te n n i s = N o
, i.e. outlook = sunny, humidity = normal
What effect on earlier tree? Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes
26
Some issues: Overfitting Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes
• Algorithm will introduce new test
• Unnecessary, because new example was erroneous due to the
presence of Noise
→ Overfitting corresponds to learning coincidental regularities
• Unfortunately, we generally don’t know which examples are noisy
• ... and also not the amount, e.g. percentage, of noisy examples
27
Some issues: Overfitting
An example: continuing to grow the tree can improve the accuracy
on the training data, but perform badly on the test data.
28
[Mitchell, 1997]
Overfitting: solutions
Some solutions:
Stop learning early: prevent the tree before it fits the
training data perfectly.
Prune the full tree: grow the tree to its full size, and then
post prune the tree.
It is hard to decide when to stop learning.
Post-pruning the tree empirically results in better
performance. But
How to decide the good size of a tree?
When to stop pruning?
We can use a validation set to do pruning, such
as, reduced-error pruning, and rule-post pruning
29
Summary
Decision tree is a Machine Learning algorithm that
can perform both classification and regression
tasks.
Decision tree represents a function by using a tree.
Each decision tree can be interpreted as a set of
rules of the form: IF-THEN
Decision trees have been used in many practical
applications.
30
Advantages & Disadvantages
Advantages
It is simple to understand as it follows the same
process which a human follow while making any
decision in real-life.
It can be very useful for solving decision-related
problems.
It helps to think about all the possible outcomes for
a problem.
There is less requirement of data cleaning
compared to other algorithms.
31
Advantages & Disadvantages
Disadvantages
The decision tree contains lots of layers, which
makes it complex.
It may have an overfitting issue, which can be
resolved using the Random Forest algorithm.
For more class labels, the computational
complexity of the decision tree may increase
32
Random forests
Random forests (RF) is a method by Leo Breiman
(2001) for both classification and regression.
Main idea: prediction is based on combination of many
decision trees, by taking the average of all individual
predictions
Each tree in RF is simple but random.
Each tree is grown differently, depending on the
choices of the attributes and training data
33
Random forests
RF currently is one of the most popular and accurate
methods [Fernández-Delgado et al., 2014]
It is also very general.
RF can be implemented easily and efficiently.
It can work with problems of very high dimensions,
without overfitting
34
How Random Forest work
35
RF: three basic ingredients
Randomization and no pruning:
For each tree and at each node, we select randomly a subset of
attributes.
Find the best split, and then grow appropriate subtrees.
Every tree will be grown to its largest size without pruning.
Combination: each prediction later is made by taking the
average of all predictions of individual trees.
Bagging: the training set for each tree is generated by
sampling (with replacement) from the original data.
36
Exersice
1. Build Decision tree
2. Predict :
customer (15, youth, medium, no, fair)
Customer (16, senior, low, yes, excellent) 37