Machine Learning
Dr. Sunil Saumya
IIIT Dharwad
ML Algorithm: Decision Tree
● Decision tree builds classification or regression models in the form of a
tree structure.
● It breaks down a dataset into smaller and smaller subsets while at the same
time an associated decision tree is incrementally developed.
● The final result is a tree with decision nodes and leaf nodes.
○ A decision node has two or more branches.
○ Leaf node represents a classification or decision. The top most
decision node in a tree which corresponds to the best predictor called
root node.
● Decision trees can handle both categorical and numerical data.
Decision Tree understanding : example 1
Consider a dataset:
Age Lower birth Ticket fare concession
(LB)
45 Yes No Concession
60 Yes Concession with LB
62 No Concession without LB
58 No No Concession
How do we find who will get concession?
Decision Tree understanding: example 1
Consider a dataset:
We can write a simple If-else
Age Lower birth Ticket fare concession statement to take the decision as:
(LB)
if Age<60
45 Yes No Concession print (“No Concession”)
else
60 Yes Concession with LB if Lower birth == Yes
print (“Concession with LB”)
62 No Concession without LB
else
58 No No Concession print(“Concession without LB”)
How do we find who will get concession?
Decision Tree understanding: example 1
Consider a dataset:
We can write a simple If-else
Age Lower birth Ticket fare concession statement to take the decision as:
(LB)
if Age<60
45 Yes No Concession print (“No Concession”)
else
60 Yes Concession with LB if Lower birth == Yes
print (“Concession with LB”)
62 No Concession without LB
else
58 No No Concession print(“Concession without LB”)
How do we find who will get concession? Where is the tree?
Decision Tree understanding: example 1
● The nested if-else statements is our tree
Will get a ticket fare concession?
Age < 60 ?
Yes No
No concession Lower berth available?
Yes No
Concession given Concession given
with lower birth without lower birth
Decision Tree understanding: example 2
● Consider the following dataset:
Decision Tree understanding: example 2
● Consider the following dataset:
Decision Tree understanding: example 2
● Consider the following dataset:
Decision Tree understanding: example 2
● Consider the following dataset:
Decision Tree understanding: example 3
Decision Tree understanding: example 3
Decision Tree understanding: example 3
Assigns cheat to
“No”
Decision tree geometric intuition
Decision tree: important points
● Programmatically, decision tree is nothing but a giant structure of nested
if-else condition.
● Mathematically, decision tree uses hyperplanes which run parallel to any one
of the axes to cut your coordinates system into hyper cuboid.
Decision Tree: core idea
● The core idea of a building decision tree is to identify the attribute at which
we split the tree (or the dataset) such that it gives the least impurity.
● There are several method that can be applied to find the best splitting
attribute:
○ Gini index
○ Information gain
Decision Tree: Example
● Consider the following dataset and draw a decision tree:
● Step 1 is to find the split using
Review Readability Class
Length information gain using Entropy.
10 4 0 ○ Entropy is uncertainty/ randomness in
9 3.5 1
the data, the more the randomness the
higher will be the entropy.
2 8 0
○ Information gain uses entropy to
1 3 0
make decisions.
12 5 1
○ If the entropy is less, information gain
will be more and will have higher
chances to get the pure subset on split.
Decision Tree: Example
● Consider the following dataset and draw a decision tree:
● Information gain =
Review Readability Class
Length Entropy before split - Entropy after split
10 4 0
● Information gain (class, reviewLength) =
9 3.5 1
Entropy (class) - Entropy (class, reviewLength)
2 8 0
1 3 0
● Information gain (class, Readability) =
12 5 1 Entropy (class) - Entropy (class, reviewLength)
Whichever, Information gain will be higher that
will our splitting attribute.
Decision Tree: Example
● Consider the following dataset and draw a decision tree:
● Entropy (class) = E(3, 2) = E(0.6, 0.4)
Review Readability Class
Length => -(0.6 log20.6 ) - (0.4 log20.4 ) => 0.29
1 3 0 ● Information gain (class, reviewLength) = E(Class) -
1.5
2 8 0 E(class, reviewLength), Consider the threshold= 1.5
5.5
ReviewLength<=1.5
9 3.5 1
9.5 Yes No
10 4 0
Class Class
11 1 0 1 0
12 5 1
______ ______
0 1 2 2
Decision Tree: Example
● Consider the following dataset and draw a decision tree:
● Entropy (class) = E(3, 2) = E(0.6, 0.4)
Review Readability Class
Length => -(0.6 log20.6 ) - (0.4 log20.4 ) => 0.29
1 3 0 ● Information gain (class, reviewLength) = E(Class) -
1.5
2 8 0 E(class, reviewLength), Consider the threshold= 5.5
5.5
ReviewLength<=5.5
9 3.5 1
No
9.5 Yes
10 4 0
Class Class
11 1 0 1 0
12 5 1
______ ______
0 2 2 1
Decision Tree: Example
● Consider the following dataset and draw a decision tree:
● Similarly, we try for other thresholds also and find the
Review Readability Class Information gain in each case.
Length ● Then we pick that threshold on which we get the maximum
gain.
1 3 0 ● Consider for review length <= 5.5, we get the maximum gain.
1.5 So, the corresponding tree will be following:
2 8 0
5.5
ReviewLength<=5.5
9 3.5 1
No
9.5 Yes
10 4 0
Class Class
11 1 0 1 0
12 5 1
______ ______
0 2 2 1
Decision Tree: Example
● Consider the following dataset and draw a decision tree:
● Similarly, we try for other thresholds also and find the
Review Readability Class Information gain in each case.
Length ● Then we pick that threshold on which we get the maximum
gain.
1 3 0 ● Consider for review length <= 5.5, we get the maximum gain.
1.5 So, the corresponding tree will be following:
2 8 0
5.5
ReviewLength<=5.5
9 3.5 1
No
9.5 Yes
10 4 0
Class Class
11 1 0 1 0
12 5 1
______ ______
0 2 2 1
Pure set and class is 0. Impure set further split is
required.
Decision Tree: Example
● Consider the following dataset and draw a decision tree:
● Similarly, we try for other thresholds also and find the
Review Readability Class Information gain in each case.
Length ● Then we pick that threshold on which we get the maximum
gain.
1 3 0 ● Consider for review length <= 5.5, we get the maximum gain.
1.5 So, the corresponding tree will be following:
2 8 0
5.5
ReviewLength<=5.5
9 3.5 1
9.5 Yes No
10 4 0
Class
11 1 0
12 5 1 Class 0
______
2 1
Impure set further split is
required.
Decision Tree: Example
● Consider the following dataset and draw a decision tree:
ReviewLength<=5.5
Review Readability Class Yes No
Length Readability<=4.5
Class 0
1 3 0 Yes No
1.5
Class Class
2 8 0
5.5 1 0 1 0
9 3.5 1 ______ ______
1 1 1 0
9.5 3.8
10 4 0
11 4.5 Impure set Pure set and
12 5 1 further split is class is 1.
required.
Decision Tree: Example
● Consider the following dataset and draw a decision tree:
ReviewLength<=5.5
Review Readability Class Yes No
Length Readability<=4.5
Class 0
1 3 0 Yes No
1.5
Class
2 8 0
5.5 1 0 Class 1
9 3.5 1 ______
1 1
9.5 3.8
10 4 0
11 4.5 Impure set
12 5 1 further split is
required.
Decision Tree: Example
● Consider the following dataset and draw a decision tree:
ReviewLength<=5.5
Yes No
Review Readability Class
Length Readability<=4.5
Class 0
1 3 0 Yes No
1.5
Readability<
2 8 0
5.5 =3.8 Class 1
9 3.5 1 Yes No
9.5 3.8
10 4 0 Class Class
11 4.5 1 0 1 0
12 5 1 ______ ______
1 0 0 1
Decision Tree: Example
● Consider the following dataset and draw a decision tree:
ReviewLength<=5.5
Yes No
Review Readability Class
Length Readability<=4.5
Class 0
1 3 0 Yes No
1.5
Readability<
2 8 0
5.5 =3.8 Class 1
9 3.5 1 No
Yes
9.5 3.8
10 4 0
11 4.5 Class 1 Class 0
12 5 1
Final Classification tree