Decision-Tree Learning
• Introduction
– Decision Trees
– TDIDT: Top-Down Induction of Decision Trees
• ID3
– Attribute selection
– Entropy, Information, Information Gain
– Gain Ratio
• C4.5
– Numeric Values
– Missing Values Acknowledgements: Many slides
– Pruning based on Frank & Witten, a few on
Kan, Steinbach & Kumar
• Regression and Model Tree
Decision Trees
• A decision tree consists of
• Nodes:
• test for the value of a certain attribute
• Edges:
• correspond to the outcome of a test
• connect to the next node or leaf
• Leaves:
• terminal nodes that predict the outcome
to classifiy an example:
1. start at the root
2. perform the test
3. follow the edge corresponding to outcome
4. goto 2. unless leaf
5. predict that outcome associated with the leaf
Decision Tree Learning
In Decision Tree Learning, a The training examples are used
new example is classified by for choosing appropriate tests
submitting it to a series of in the decision tree. Typically,
tests that determine the class a tree is built from top to
label of the example.These Training bottom, where tests that
tests are organized in a maximize the information gain
hierarchical structure called a about the classification are
decision tree. selected first.
New Example Classification
A Sample Task
Decision Tree Learning
Divide-And-Conquer Algorithms
• Family of decision tree learning algorithms
– TDIDT: Top-Down Induction of Decision Trees
• Learn trees in a Top-Down fashion:
– divide the problem in subproblems
– solve each problem
Basic Divide-And-Conquer Algorithm:
1.select a test for root node
Create branch for each possible outcome of the test
2.split instances into subsets
One for each branch extending from the node
3. repeat recursively for each branch, using only instances that
reach the branch
4. stop recursion for a branch if all its instances have the same
class
ID3 Algorithm
• Function ID3
– Input: Example set S
– Output: Decision Tree DT
• If all examples in S belong to the same class c
– return a new leaf and label it with c
• Else
– i. Select an attribute A according to some heuristic function
– ii. Generate a new node DT with A as test
– iii.For each Value vi of A
• (a) Let Si = all examples in S with A = v i
• (b) Use ID3 to construct a decision tree DTi for example set Si
• (c) Generate an edge that connects DT and DTi
A Different Decision Tree
•also explains all of the training data
•will it generalize well to new data?
Which attribute to select as the root?
What is a good Attribute?
• We want to grow a simple tree
→ a good attribute prefers attributes that split the data so that each
successor node is as pure as possible
– i.e., the distribution of examples in each node is so that it mostly
contains examples of a single class
• In other words:
– We want a measure that prefers attributes that have a high degree of „order“:
• Maximum order: All examples are of the same class
• Minimum order: All classes are equally likely
– → Entropy is a measure for (un-)orderedness
– Another interpretation:
• Entropy is the amount of information that is contained
• all examples of the same class → no information
Entropy (for two classes)
• S is a set of examples
• p⊕ is the proportion of examples in
class ⊕
• p⊖ = 1 − p⊕ is the proportion of
examples in class ⊖
Interpretation:
•amount of unorderedness in the class
distribution of S
Example: Attribute Outlook
• Outlook = sunny: 3 examples yes, 2 examples no
• Outlook = overcast: 4 examples yes, 0 examples no
Note: this
is normally
undefined.
• Outlook = rainy : 2 examples yes, 3 examples no Here: = 0
Entropy (for more classes)
• Entropy can be easily generalized for n > 2 classes
– p i is the proportion of examples in S that belong to the i-th class
Average Entropy / Information
• Problem:
– Entropy only computes the quality of a single (sub-)set of examples
• corresponds to a single value
– How can we compute the quality of the entire split?
• corresponds to an entire attribute
• Solution:
– Compute the weighted average over all sets resulting from the split
• weighted by their size
I (S , A) =
• Example:
• Average entropy for attribute Outlook:
Information Gain
• When an attribute A splits the set S into subsets Si
– we compute the average entropy
– and compare the sum to the entropy of the original set S
Information Gain for Attribute A
• The attribute that maximizes the difference is selected
i.e., the attribute that reduces the unorderedness most!
• Note:
maximizing information gain is equivalent to minimizing average
entropy, because E(S) is constant for all attributes A
Example
Example (Ctd.)
Outlook is selected as
Outlook
the root note
overcast Rain
Sunny
? Yes ?
further splitting
necessary
Outlook = overcast
contains only examples
of class yes
Example (Ctd.)
Gain(Temperature ) = 0.571 bits
Gain(Humidity ) = 0.971 bits Humidity is selected
Gain(Windy ) = 0.020 bits
Example (Ctd.)
Humidity is selected Outlook
overcast Rain
Sunny
Humidity Yes ?
normal high
Yes No
further splitting
necessary
Pure leaves
→ No further expansion necessary
Final decision tree
Properties of Entropy
• Entropy is the only function that satisfies all of the following three properties
• When node is pure, measure should be zero
• When impurity is maximal (i.e. all classes equally likely), measure should be
maximal
• Measure should obey multistage property:
• p, q, r are classes in set S, and T are examples of class t = q ˅ r
• → decisions can be made in several stages
• Simplification of computation of average entropy (information):
Highly-branching attributes
• Problematic: attributes with a large number of values
– extreme case: each example has its own value
• e.g. example ID; Day attribute in weather data
• Subsets are more likely to be pure if there is a large number of different
attribute values
– Information gain is biased towards choosing attributes with a large number of
values
• This may cause several problems:
– Overfitting
• selection of an attribute that is non-optimal for prediction
– Fragmentation
• data are fragmented into (too) many small sets
Decision Tree for Day attribute
Entropy of split:
Information gain is maximal for Day (0.940 bits)
Intrinsic Information of an Attribute
• Intrinsic information of a split
– entropy of distribution of instances into branches
– i.e. how much information do we need to tell which branch an instance belongs to
IntI (S , A) =
• Example:
– Intrinsic information of Day attribute:
• Observation:
– Attributes with higher intrinsic information are less useful
Gain Ratio
• modification of the information gain that reduces its bias towards multi-valued
attributes
• takes number and size of branches into account when choosing an attribute
• corrects the information gain by taking the intrinsic information of a split into
account
• Definition of Gain Ratio:
• Example:
• Gain Ratio of Day attribute
Gain ratios for weather data
• Day attribute would still win...
– one has to be careful which attributes to add...
• Nevertheless: Gain ratio is more reliable than Information Gain
Gini Index
• Many alternative measures to Information Gain
• Most popular altermative: Gini index
– used in e.g., in CART (Classification And Regression Trees)
– impurity measure (instead of entropy)
• average Gini index (instead of average entropy / information)
• Gini Gain
– could be defined analogously to information gain
– but typically avg. Gini index is minimized instead of maximizing Gini gain
Comparison among Splitting Criteria
• For a 2-class problem:
Industrial-strength algorithms
• For an algorithm to be useful in a wide range of real-world applications it
must:
– Permit numeric attributes
– Allow missing values
– Be robust in the presence of noise
– Be able to approximate arbitrary concept descriptions (at least in principle)
• → ID3 needs to be extended to be able to deal with real-world data
• Result: C4.5
– Best-known and (probably) most widely-used learning algorith
• original C-implementation at http://www.rulequest.com/Personal/
– Re-implementation of C4.5 Release 8 in Weka: J4.8
– Commercial successor: C5.0