Decision Tree
Introduction to Machine Learning
Dr. Hikmat Ullah Khan
Assistant Professor
COMSATS Institute of Information Technology,
Wah Cantt, Pakistan
Email: Hikmat.ullah@ciitwah.edu.pk
1
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
Resulting tree: 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
student? yes credit rating?
no yes excellent fair
no yes yes
2
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No
2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10
Training Data Model: Decision Tree
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a
top-down
recursive
divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical
(continuous are discretized in advance)
Test attributes are selected on the basis on statistical measure
(e.g., information gain)
12
Decision Tree Induction
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5, C5.0
SLIQ,SPRINT
General Structure
Tid Refund Marital Taxable
Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No
General Procedure: 2 No Married 100K No
If Dt contains records that belong 3 No Single 70K No
the same class yt, then t is a leaf 4 Yes Married 120K No
5 No Divorced 95K Yes
node labeled as yt
6 No Married 60K No
If Dt contains records that belong
7 Yes Divorced 220K No
to more than one class, use an 8 No Single 85K Yes
attribute test to split the data into 9 No Married 75K No
smaller subsets. Recursively apply 10 No Single 90K Yes
the procedure to each subset.
10
Dt
?
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No
2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
Refund Refund 9 No Married 75K No
Yes No Yes No 10 No Single 90K Yes
10
Don’t Don’t Marital
Marital Cheat
Cheat Status Status
Single, Single,
Married Married
Divorced Divorced
Don’t Taxable Don’t
Cheat Cheat
Cheat Income
< 80K >= 80K
Don’t Cheat
Cheat
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
How to Specify Test Condition?
Depends on attribute types
Nominal
Ordinal
Continuous
Depends on number of ways to split
2-way split
Multi-way split
Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as distinct
values.
CarType
Family Luxury
Sports
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
Size
{Small,
What about this split? Large} {Medium}
Splitting Based on Continuous Attributes
Different ways of handling
Discretization to form an ordinal categorical
attribute (many)
As defined in age attribute
Binary Decision: (A < v) or (A v)
consider all possible splits and finds the best cut
can be more compute intensive
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Brief Review of Entropy
Entropy is the measure of uncertainty associated
with a random measure
High entropy -> high uncertainty
Low entropy -> low uncertainty
It is also known as measure of dispersion
m
Entropy( D) pi log2 ( pi )
i 1
m=2
24
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
m
Info( D) pi log2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
classify D: v | D |
Info A ( D) j
Info( D j )
j 1 | D |
Information gained by branching on attribute A
Gain(A) Info(D) InfoA(D)
25
Examples for computing Entropy
Entropy (t ) p( j | t ) log p( j | t )
j 2
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
C1 1
C2 5
C1 2
C2 4
Examples for computing Entropy
Entropy (t ) p( j | t ) log p( j | t )
j 2
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6
C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65
C1 2 P(C1) = 2/6 P(C2) = 4/6
C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Attribute Selection: Information Gain
age income student credit_rating buys_computer
Class P: buys_computer = “yes”
<=30 high no fair no
Class N: buys_computer = “no”
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
9 9 5 5 >40 low yes excellent no
Info( D) I (9,5) log2 ( ) log2 ( ) 0.940
14 14 14 14 31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
28
Attribute Selection: Information Gain
age income student credit_rating buys_computer
Class P: buys_computer = “yes”
<=30 high no fair no
Class N: buys_computer = “no”
9 9 5 5 <=30 high no excellent no
Info( D) I (9,5) log2 ( ) log2 ( ) 0.940
14 14 14 14 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
age pi ni I(pi, ni) >40 low yes excellent no
<=30 2 3 0.971 31…40 low yes excellent yes
31…40 4 0 0 <=30 medium no fair no
>40 3 2 0.971 <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
29
Attribute Selection: Information Gain
Class P: buys_computer = “yes” 5 4
Infoage ( D) I (2,3) I (4,0)
Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) I (9,5) log2 ( ) log2 ( ) 0.940 I (3,2) 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
<=30 2 3 0.971 I (2,3)means “age <=30” has 5 out of
14
14 samples, with 2 yes’es and 3
31…40 4 0 0
>40 3 2 0.971 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) Info( D) Infoage ( D) 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 30
Attribute Selection: Information Gain
Class P: buys_computer = “yes” 5 4
Infoage ( D) I (2,3) I (4,0)
Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) I (9,5) log2 ( ) log2 ( ) 0.940 I (3,2) 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
<=30 2 3 0.971 I (2,3)means “age <=30” has 5 out of
14
14 samples, with 2 yes’es and 3
31…40 4 0 0
>40 3 2 0.971 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) Info( D) Infoage ( D) 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Similarly,
>40 low yes fair yes
Gain(income) 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no 31
Gini Index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index,
gini(D) is defined as n 2
gini( D) 1 p j
j 1
where pj is the relative frequency of class j in D
|D1| |D |
gini A ( D) gini(D1) 2 gini(D2)
|D| |D|
gini( A) gini(D) giniA (D)
32
Computation of Gini Index
Ex. D has 9 tuples in buys_computer = “yes”
2
and
2
5 in “no”
9 5
gini( D) 1 0.459
14 14
33
Examples for computing GINI
GINI (t ) 1 [ p( j | t )]2
j
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1
C2 5
C1 2
C2 4
Examples for computing GINI
GINI (t ) 1 [ p( j | t )]2
j
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6
C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278
C1 2 P(C1) = 2/6 P(C2) = 4/6
C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Third measure of Classification Error
36
Three in One
37
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Stopping Criteria for Tree Induction
Stop expanding a node when all the records belong to
the same class
Decision Tree Based Classification
Advantages:
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets
Disadvantages
Not scalable (add one attribute, all tree needed to
be computed again)
Not good accuracy for large dataset
Not robust (less handling of large attributes)
WEKA complete Book
WEKA provides Wiki for all the concepts of Machine
Learning and data mining
https://www.cs.waikato.ac.nz/ml/weka/book.html
WEKA examples for Decision Tree has been uploaded
as reading material
41
Examples
Search “Data Mining Lecture -- Decision Tree | Solved Example (Eng-
Hindi)”
URL:
https://www.youtube.com/watch?v=cKl7WV_EKDU
You All should solve the complete example of
Weather data
30 min video
2 hour solution
42
More Examples
Additional material for learning
http://www.otnira.com/2013/03/17/decision-tree-
induction/
A simple research paper based on weather data and
decision tree
https://www.ijarcce.com/upload/2016/may-
16/IJARCCE%20114.pdf
43
DT Classification Task (optional)
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Every one can has DT in his mind for
every task
45