KEMBAR78
Decision Tree Learning Guide | PDF | Statistical Classification | Theoretical Computer Science
0% found this document useful (0 votes)
17 views33 pages

Decision Tree Learning Guide

Uploaded by

gopineedivignesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views33 pages

Decision Tree Learning Guide

Uploaded by

gopineedivignesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Decision Tree

Supervised vs. Unsupervised Learning

 Supervised learning (classification)


 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
2
Prediction Problems: Classification vs.
Numeric Prediction
 Classification
 predicts categorical class labels (discrete or nominal)

 classifies data (constructs a model) based on the training


set and the values (class labels) in a classifying attribute
and uses it in classifying new data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical applications
 Credit/loan approval:

 Medical diagnosis: if a tumor is cancerous or benign

 Fraud detection: if a transaction is fraudulent

 Web page categorization: which category it is

3
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as

determined by the class label attribute


 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees, or

mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model

 The known label of test sample is compared with the classified

result from the model


 Accuracy rate is the percentage of test set samples that are

correctly classified by the model


 Test set is independent of training set (otherwise overfitting)

 If the accuracy is acceptable, use the model to classify new data

 Note: If the test set is used to select models, it is called validation (test) set
4
A is discrete valued

A is continuous valued

A is discrete valued and


2 branches (binary value)
Decision Tree

1. Learning

Training data are analyzed by a classification algorithm. Here, Class label is loan
decision; learned model or classifier is represented in the form of classification rules.
Decision Tree

2. Classification

Test data are used to estimate the accuracy of classification rules. If accuracy is
acceptable, the rules can be applied to the classification of new data tuples.
Decision Tree Induction
It is the learning of decision tree from class labeled
training tuples.

Internal nodes-test on an attribute


Branch-outcome of the test
Leaf node-holds a class label
Path is traced from root to leaf
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-

conquer manner
 At start, all the training examples are at the root

 Attributes are categorical (if continuous-valued, they are

discretized in advance)
 Examples are partitioned recursively based on selected

attributes
 Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)


 Conditions for stopping partitioning
 All samples for a given node belong to the same class

 There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf


 There are no samples left
11
Brief Review of Entropy

m=2

12
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Info( D)   pi log2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
classify D: v | D |
InfoA ( D)    Info( D j )
j

j 1 | D |
 Information gained by branching on attribute A

Gain(A)  Info(D)  InfoA(D)


13
Attribute Selection: Information Gain

Gain(age)  Info( D)  Infoage ( D)  0.246

Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048

15
Computing Information-Gain for
Continuous-Valued Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
18
Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes with a
large number of values
 Example, product id, attribute acts as an unique identifier
 A split on product id would result in large number of partitions,
each one containing just one tuple

 Gain (Product-Id) is maximal. Such a partitioning is useless for


classification
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D)    log2 ( )
j 1 | D| | D|
19
Gain Ratio for Attribute Selection (C4.5)
v | Dj | | Dj |
SplitInfoA ( D)    log2 ( )
j 1 | D| | D|

 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex. Consider income as the attribute

 gain_ratio(income) = 0.029/1.557 = 0.019


 The attribute with the maximum gain ratio is selected as the
splitting attribute

20
Gini Index (CART, IBM IntelligentMiner)
 Gini Index considers a binary split on each attribute
 If a data set D contains examples from m classes, gini index, gini(D) is
defined as

where pi is the probability that a tuple in D belongs to class Ci pi =


 If a data set D is split (binary split) on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D2 |
giniA ( D)  gini( D1)  gini( D2)
|D| |D|

For each attribute, each of the possible binary split is considered

21
Gini Index (CART, IBM IntelligentMiner)

 Reduction in Impurity:
gini( A)  gini(D)  giniA(D)

 The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute)

22
Computation of Gini Index
 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini( D)  1        0.459
 14   14 
 Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2 giniincome{low,medium} ( D)   10 Gini( D1 )   4 Gini( D2 )
 14   14 

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the


{low,medium} (and {high}) since it has the lowest Gini index

23
Comparing Attribute Selection Measures

 The three measures, in general, return good results but


 Information gain:
 biased towards multivalued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one partition is
much smaller than the others
 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized partitions
and purity in both partitions
24
PRUNING

If Information gain or
gini index falls below a
prespecified threshold,
then further
partitioning of the
given subset is halted.

Pre Pruning -halting its construction early. i.e. by deciding not to further split or partition the
subset of training tuples at a given node. The leaf may hold the most frequent class among
the subset tuples.
Post Pruning - removes the subtrees from a fully grown tree
A subtree at a given node is pruned by removing its branches and replacing it with a leaf.
The leaf is labelled with the most frequent class among the subtree being replaced.

Cost complexity-number of leaves in the tree; error rate-% of tuples misclassified by the tree
Drawbacks of Decision Tree

Repitition
Drawbacks of Decision Tree

Replication
Scalability Framework for RainForest

 Separates the scalability aspects from the criteria that


determine the quality of the tree
 Capable of handling large data set that should fit in memory
 Builds an AVC-list: AVC (Attribute, Value, Class_label)
 AVC-set (of an attribute X )
 Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated

28
Home work problem.
Build a decision tree for the table given based on information gain.
Entropy and Information Gain
 Let’s use IG based criterion to construct a DT for the Tennis example
 At root node, let’s compute IG of each of the 4 features
 Consider feature “wind”. Root contains all examples S = [9+,5-]
H(S ) = −(9/14) log2(9/14) − (5/14) log2(5/14) = 0.94
Sweak = [6+, 2−] ⇒ H(Sweak ) = 0.811
Sstrong = [3+, 3−] ⇒ H(Sstrong) = 1
𝑆weak 𝑆strong
𝐼𝐺(𝑆, 𝑤𝑖𝑛𝑑) = 𝐻 𝑆 − 𝐻 𝑆weak − 𝐻 𝑆strong = 0.94 − 8/14 ∗ 0.811 − 6/14 ∗ 1 = 0.048
𝑆 𝑆

 Likewise, at root: IG(S, outlook) = 0.246, IG(S, humidity) = 0.151, IG(S,temp) = 0.029
 Thus we choose “outlook” feature to be tested at the root node
 Now how to grow the DT, i.e., what to do at the next level? Which feature to test next?
 Rule: Iterate - for each child node, select the feature with the highest IG
Growing the tree

 Proceeding as before, for level 2, left node, we can verify that


 IG(S,temp) = 0.570, IG(S, humidity) = 0.970, IG(S, wind) = 0.019
 Thus humidity chosen as the feature to be tested at level 2, left node
 No need to expand the middle node (already “pure” - all “yes” training examples )
 Can also verify that wind has the largest IG for the right node
 Note: If a feature has already been tested along a path earlier, we don’t consider it again
When to stop growing the tree?

 Stop expanding a node further (i.e., make it a leaf node) when


 It consist of all training examples having the same label (the node becomes “pure”)
 We run out of features to test along the path to that node
 The DT starts to overfit (can be checked by monitoring
the validation set accuracy)

You might also like