Decision Tree
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
2
Prediction Problems: Classification vs.
Numeric Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
Numeric Prediction
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
Credit/loan approval:
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is
3
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
4
A is discrete valued
A is continuous valued
A is discrete valued and
2 branches (binary value)
Decision Tree
1. Learning
Training data are analyzed by a classification algorithm. Here, Class label is loan
decision; learned model or classifier is represented in the form of classification rules.
Decision Tree
2. Classification
Test data are used to estimate the accuracy of classification rules. If accuracy is
acceptable, the rules can be applied to the classification of new data tuples.
Decision Tree Induction
It is the learning of decision tree from class labeled
training tuples.
Internal nodes-test on an attribute
Branch-outcome of the test
Leaf node-holds a class label
Path is traced from root to leaf
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
11
Brief Review of Entropy
m=2
12
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
m
Info( D) pi log2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
classify D: v | D |
InfoA ( D) Info( D j )
j
j 1 | D |
Information gained by branching on attribute A
Gain(A) Info(D) InfoA(D)
13
Attribute Selection: Information Gain
Gain(age) Info( D) Infoage ( D) 0.246
Gain(income) 0.029
Gain( student ) 0.151
Gain(credit _ rating ) 0.048
15
Computing Information-Gain for
Continuous-Valued Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information
requirement for A is selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
18
Gain Ratio for Attribute Selection (C4.5)
Information gain measure is biased towards attributes with a
large number of values
Example, product id, attribute acts as an unique identifier
A split on product id would result in large number of partitions,
each one containing just one tuple
Gain (Product-Id) is maximal. Such a partitioning is useless for
classification
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D) log2 ( )
j 1 | D| | D|
19
Gain Ratio for Attribute Selection (C4.5)
v | Dj | | Dj |
SplitInfoA ( D) log2 ( )
j 1 | D| | D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex. Consider income as the attribute
gain_ratio(income) = 0.029/1.557 = 0.019
The attribute with the maximum gain ratio is selected as the
splitting attribute
20
Gini Index (CART, IBM IntelligentMiner)
Gini Index considers a binary split on each attribute
If a data set D contains examples from m classes, gini index, gini(D) is
defined as
where pi is the probability that a tuple in D belongs to class Ci pi =
If a data set D is split (binary split) on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D2 |
giniA ( D) gini( D1) gini( D2)
|D| |D|
For each attribute, each of the possible binary split is considered
21
Gini Index (CART, IBM IntelligentMiner)
Reduction in Impurity:
gini( A) gini(D) giniA(D)
The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute)
22
Computation of Gini Index
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini( D) 1 0.459
14 14
Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2 giniincome{low,medium} ( D) 10 Gini( D1 ) 4 Gini( D2 )
14 14
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the
{low,medium} (and {high}) since it has the lowest Gini index
23
Comparing Attribute Selection Measures
The three measures, in general, return good results but
Information gain:
biased towards multivalued attributes
Gain ratio:
tends to prefer unbalanced splits in which one partition is
much smaller than the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions
and purity in both partitions
24
PRUNING
If Information gain or
gini index falls below a
prespecified threshold,
then further
partitioning of the
given subset is halted.
Pre Pruning -halting its construction early. i.e. by deciding not to further split or partition the
subset of training tuples at a given node. The leaf may hold the most frequent class among
the subset tuples.
Post Pruning - removes the subtrees from a fully grown tree
A subtree at a given node is pruned by removing its branches and replacing it with a leaf.
The leaf is labelled with the most frequent class among the subtree being replaced.
Cost complexity-number of leaves in the tree; error rate-% of tuples misclassified by the tree
Drawbacks of Decision Tree
Repitition
Drawbacks of Decision Tree
Replication
Scalability Framework for RainForest
Separates the scalability aspects from the criteria that
determine the quality of the tree
Capable of handling large data set that should fit in memory
Builds an AVC-list: AVC (Attribute, Value, Class_label)
AVC-set (of an attribute X )
Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
28
Home work problem.
Build a decision tree for the table given based on information gain.
Entropy and Information Gain
Let’s use IG based criterion to construct a DT for the Tennis example
At root node, let’s compute IG of each of the 4 features
Consider feature “wind”. Root contains all examples S = [9+,5-]
H(S ) = −(9/14) log2(9/14) − (5/14) log2(5/14) = 0.94
Sweak = [6+, 2−] ⇒ H(Sweak ) = 0.811
Sstrong = [3+, 3−] ⇒ H(Sstrong) = 1
𝑆weak 𝑆strong
𝐼𝐺(𝑆, 𝑤𝑖𝑛𝑑) = 𝐻 𝑆 − 𝐻 𝑆weak − 𝐻 𝑆strong = 0.94 − 8/14 ∗ 0.811 − 6/14 ∗ 1 = 0.048
𝑆 𝑆
Likewise, at root: IG(S, outlook) = 0.246, IG(S, humidity) = 0.151, IG(S,temp) = 0.029
Thus we choose “outlook” feature to be tested at the root node
Now how to grow the DT, i.e., what to do at the next level? Which feature to test next?
Rule: Iterate - for each child node, select the feature with the highest IG
Growing the tree
Proceeding as before, for level 2, left node, we can verify that
IG(S,temp) = 0.570, IG(S, humidity) = 0.970, IG(S, wind) = 0.019
Thus humidity chosen as the feature to be tested at level 2, left node
No need to expand the middle node (already “pure” - all “yes” training examples )
Can also verify that wind has the largest IG for the right node
Note: If a feature has already been tested along a path earlier, we don’t consider it again
When to stop growing the tree?
Stop expanding a node further (i.e., make it a leaf node) when
It consist of all training examples having the same label (the node becomes “pure”)
We run out of features to test along the path to that node
The DT starts to overfit (can be checked by monitoring
the validation set accuracy)