KEMBAR78
Decision Tree | PDF | Statistical Classification | Applied Mathematics
0% found this document useful (0 votes)
11 views98 pages

Decision Tree

Uploaded by

Priyam Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views98 pages

Decision Tree

Uploaded by

Priyam Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Decision tree learning

Inductive inference with decision trees

§ Decision Trees is one of the most widely used and


practical methods of inductive inference
§ Features
§ Method for approximating discrete-valued functions
(including boolean)
§ Learned functions are represented as decision trees (or if-
then-else rules)
§ Expressive hypotheses space, including disjunction
§ Robust to noisy data
Decision tree representation (PlayTennis)

áOutlook=Sunny, Temp=Hot, Humidity=High, Wind=Strongñ No


Decision trees expressivity
§ Decision trees represent a disjunction of conjunctions on
constraints on the value of attributes:
(Outlook = Sunny Ù Humidity = Normal) Ú
(Outlook = Overcast) Ú
(Outlook = Rain Ù Wind = Weak)
Decision trees representation

+
-
When to use Decision Trees
§ Problem characteristics:
§ Instances can be described by attribute value pairs
§ Target function is discrete valued
§ Disjunctive hypothesis may be required
§ Possibly noisy training data samples
§ Robust to errors in training data
§ Missing attribute values
§ Different classification problems:
§ Equipment or medical diagnosis
§ Credit risk analysis
§ Several tasks in natural language processing
Top-down induction of Decision Trees
§ ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
§ Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees
§ The construction of the tree is top-down. The algorithm is greedy.
§ The fundamental question is “which attribute should be tested next?
Which question gives us more information?”
§ Select the best attribute
§ A descendent node is then created for each possible value of this
attribute and examples are partitioned according to this value
§ The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
Which attribute is the best classifier?

§ A statistical property called information gain, measures how


well a given attribute separates the training examples
§ Information gain uses the notion of entropy, commonly used in
information theory
§ Information gain = expected reduction of entropy
Entropy in binary classification
§ Entropy measures the impurity of a collection of examples. It
depends from the distribution of the random variable p.
§ S is a collection of training examples
§ p+ the proportion of positive examples in S
§ p– the proportion of negative examples in S
Entropy (S) º – p+ log2 p+ – p–log2 p– [0 log20 = 0]
Entropy ([14+, 0–]) = – 14/14 log2 (14/14) – 0 log2 (0) = 0
Entropy ([9+, 5–]) = – 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0,94
Entropy ([7+, 7– ]) = – 7/14 log2 (7/14) – 7/14 log2 (7/14) =
= 1/2 + 1/2 = 1 [log21/2 = – 1]
Note: the log of a number < 1 is negative, 0 £ p £ 1, 0 £ entropy £ 1
Entropy
Entropy in general
§ Entropy measures the amount of information in a random
variable
H(X) = – p+ log2 p+ – p– log2 p– X = {+, –}
for binary classification [two-valued random variable]
c c
H(X) = – S pi log2 pi = S pi log2 1/ pi X = {i, …, c}
i=1 i=1
for classification in c classes
Example: rolling a die with 8, equally probable, sides
8
H(X) = – S 1/8 log2 1/8 = – log2 1/8 = log2 8 = 3
i=1
Entropy and information theory
§ Entropy specifies the number the average length (in bits) of the
message needed to transmit the outcome of a random variable.
This depends on the probability distribution.
§ Optimal length code assigns é- log2 pù bits to messages with
probability p. Most probable messages get shorter codes.
§ Example: 8-sided [unbalanced] die
1 2 3 4 5 6 7 8
4/16 4/16 2/16 2/16 1/16 1/16 1/16 1/16
2 bits 2 bits 3 bits 3 bits 4bits 4bits 4bits 4bits
E = (1/4 log2 4) ´ 2 + (1/8 log2 8) ´ 2 + (1/16 log2 16) ´ 4 = 1+3/4+1 = 2,75
Information gain as entropy reduction
§ Information gain is the expected reduction in entropy caused by
partitioning the examples on an attribute.
§ The higher the information gain the more effective the attribute
in classifying training data.
§ Expected reduction in entropy knowing A

Gain(S, A) = Entropy(S) − S |Sv|


Entropy(Sv)
v Î Values(A) |S|
Values(A) possible values for A
Sv subset of S for which A has value v
Example: expected information gain
§ Let
§ Values(Wind) = {Weak, Strong}
§ S = [9+, 5−]
§ SWeak = [6+, 2−]
§ SStrong = [3+, 3−]
§ Information gain due to knowing Wind:
Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14 Entropy(SStrong)
= 0,94 − 8/14 ´ 0,811 − 6/14 ´ 1,00
= 0,048
Which attribute is the best classifier?
Example
First step: which attribute to test at the root?

§ Which attribute should be tested at the root?


§ Gain(S, Outlook) = 0.246
§ Gain(S, Humidity) = 0.151
§ Gain(S, Wind) = 0.084
§ Gain(S, Temperature) = 0.029
§ Outlook provides the best prediction for the target
§ Lets grow the tree:
§ add to the tree a successor for each possible value of Outlook
§ partition the training samples according to the value of Outlook
After first step
Second step
§ Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970 - 3/5 ´ 0.0 - 2/5 ´ 0.0 = 0.970
Gain(SSunny, Wind) = 0.970 - 2/5 ´ 1.0 - 3.5 ´ 0.918 = 0 .019
Gain(SSunny, Temp.) = 0.970 - 2/5 ´ 0.0 - 2/5 ´ 1.0 - 1/5 ´ 0.0 = 0.570
§ Humidity provides the best prediction for the target
§ Lets grow the tree:
§ add to the tree a successor for each possible value of Humidity
§ partition the training samples according to the value of Humidity
Second and third steps

{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No
ID3: algorithm
ID3(X, T, Attrs) X: training examples:
T: target attribute (e.g. PlayTennis),
Attrs: other attributes, initially all attributes
Create Root node
If all X's are +, return Root with class +
If all X's are –, return Root with class –
If Attrs is empty return Root with class most common value of T in X
else
A ¬ best attribute; decision attribute for Root ¬ A
For each possible value vi of A:
- add a new branch below Root, for test A = vi
- Xi ¬ subset of X with A = vi
- If Xi is empty then add a new leaf with class the most common value of T in X
else add the subtree generated by ID3(Xi, T, Attrs - {A})
return Root
Search space in Decision Tree learning
§ The search space is made by
partial decision trees
§ The algorithm is hill-climbing
§ The evaluation function is
information gain
§ The hypotheses space is complete
(represents all discrete-valued
functions)
§ The search maintains a single
current hypothesis
§ No backtracking; no guarantee of
optimality
§ It uses all the available examples
(not incremental)
§ May terminate earlier, accepting
noisy classes
Inductive bias in decision tree learning

§ What is the inductive bias of DT learning?


1. Shorter trees are preferred over longer trees
Not enough. This is the bias exhibited by a simple breadth
first algorithm generating all DT's e selecting the shorter one
2. Prefer trees that place high information gain attributes close to
the root
§ Note: DT's are not limited in representing all possible functions
Two kinds of biases
§ Preference or search biases (due to the search strategy)
§ ID3 searches a complete hypotheses space; the search strategy is
incomplete
§ Restriction or language biases (due to the set of hypotheses
expressible or considered)
§ Candidate-Elimination searches an incomplete hypotheses space; the
search strategy is complete
§ A combination of biases in learning a linear combination of
weighted features in board games.
Prefer shorter hypotheses: Occam's rasor
§ Why prefer shorter hypotheses?
§ Arguments in favor:
§ There are fewer short hypotheses than long ones
§ If a short hypothesis fits data unlikely to be a coincidence
§ Elegance and aesthetics
§ Arguments against:
§ Not every short hypothesis is a reasonable one.
§ Occam's razor:"The simplest explanation is usually the best one."
§ a principle usually (though incorrectly) attributed14th-century English
logician and Franciscan friar, William of Ockham.
§ lex parsimoniae ("law of parsimony", "law of economy", or "law of
succinctness")
§ The term razor refers to the act of shaving away unnecessary
assumptions to get to the simplest explanation.
Issues in decision trees learning
§ Overfitting
§ Reduced error pruning
§ Rule post-pruning
§ Extensions
§ Continuous valued attributes
§ Alternative measures for selecting attributes
§ Handling training examples with missing attribute values
§ Handling attributes with different costs
§ Improving computational efficiency
§ Most of these improvements in C4.5 (Quinlan, 1993)
Overfitting: definition
§ Building trees that “adapt too much” to the training examples
may lead to “overfitting”.
§ Consider error of hypothesis h over
§ training data: errorD(h) empirical error
§ entire distribution X of data: errorX(h) expected error
§ Hypothesis h overfits training data if there is an alternative
hypothesis h' Î H such that
errorD(h) < errorD(h’) and
errorX(h’) < errorX(h)
i.e. h’ behaves better over unseen data
Example

D15 Sunny Hot Normal Strong No


Overfitting in decision trees

áOutlook=Sunny, Temp=Hot, Humidity=Normal, Wind=Strong, PlayTennis=No ñ

New noisy example causes splitting of second leaf node.


Overfitting in decision tree learning
Avoid overfitting in Decision Trees
§ Two strategies:
1. Stop growing the tree earlier, before perfect classification
2. Allow the tree to overfit the data, and then post-prune the tree
§ Training and validation set
§ split the training in two parts (training and validation) and use
validation to assess the utility of post-pruning
§ Reduced error pruning
§ Rule pruning
§ Other approaches
§ Use a statistical test to estimate effect of expanding or pruning
§ Minimum description length principle: uses a measure of complexity of
encoding the DT and the examples, and halt growing the tree when this
encoding size is minimal
Reduced-error pruning (Quinlan 1987)
§ Each node is a candidate for pruning
§ Pruning consists in removing a subtree rooted in a node: the
node becomes a leaf and is assigned the most common
classification
§ Nodes are removed only if the resulting tree performs no
worse on the validation set.
§ Nodes are pruned iteratively: at each iteration the node
whose removal most increases accuracy on the validation set is
pruned.
§ Pruning stops when no pruning increases accuracy
Effect of reduced error pruning
Rule post-pruning
1. Create the decision tree from the training set
2. Convert the tree into an equivalent set of rules
§ Each path corresponds to a rule
§ Each node along a path corresponds to a pre-condition
§ Each leaf classification to the post-condition
3. Prune (generalize) each rule by removing those preconditions
whose removal improves accuracy …
§ … over validation set
§ … over training with a pessimistic, statistically inspired, measure
4. Sort the rules in estimated order of accuracy, and consider
them in sequence when classifying new instances
Converting to rules

(Outlook=Sunny)Ù(Humidity=High) ⇒ (PlayTennis=No)
Why converting to rules?
§ Each distinct path produces a different rule: a condition
removal may be based on a local (contextual) criterion. Node
pruning is global and affects all the rules
§ In rule form, tests are not ordered and there is no book-
keeping involved when conditions (nodes) are removed
§ Converting to rules improves readability for humans
Dealing with continuous-valued attributes
§ So far discrete values for attributes and for outcome.
§ Given a continuous-valued attribute A, dynamically create a
new attribute Ac
Ac = True if A < c, False otherwise
§ How to determine threshold value c ?
§ Example. Temperature in the PlayTennis example
§ Sort the examples according to Temperature
Temperature 40 48 | 60 72 80 | 90
PlayTennis No No 54 Yes Yes Yes 85 No
§ Determine candidate thresholds by averaging consecutive values where
there is a change in classification: (48+60)/2=54 and (80+90)/2=85
§ Evaluate candidate thresholds (attributes) according to information gain.
The best is Temperature>54.The new attribute competes with the other
ones
Problems with information gain
§ Natural bias of information gain: it favours attributes with
many possible values.
§ Consider the attribute Date in the PlayTennis example.
§ Date would have the highest information gain since it perfectly
separates the training data.
§ It would be selected at the root resulting in a very broad tree
§ Very good on the training, this tree would perform poorly in predicting
unknown instances. Overfitting.
§ The problem is that the partition is too specific, too many small
classes are generated.
§ We need to look at alternative measures …
An alternative measure: gain ratio
c |Si | |Si |
SplitInformation(S, A) º − S log2
|S |
i=1 |S |
§ Si are the sets obtained by partitioning on value i of A
§ SplitInformation measures the entropy of S with respect to the values of A. The
more uniformly dispersed the data the higher it is.
Gain(S, A)
GainRatio(S, A) º
SplitInformation(S, A)
§ GainRatio penalizes attributes that split examples in many small classes such as
Date. Let |S |=n, Date splits examples in n classes
§ SplitInformation(S, Date)= −[(1/n log2 1/n)+…+ (1/n log2 1/n)]= −log21/n =log2n
§ Compare with A, which splits data in two even classes:
§ SplitInformation(S, A)= − [(1/2 log21/2)+ (1/2 log21/2) ]= − [− 1/2 −1/2]=1
Adjusting gain-ratio
§ Problem: SplitInformation(S, A) can be zero or very small
when |Si | ≈ |S | for some value i
§ To mitigate this effect, the following heuristics has been used:
1. compute Gain for each attribute
2. apply GainRatio only to attributes with Gain above average
§ Other measures have been proposed:
§ Distance-based metric [Lopez-De Mantaras, 1991] on the partitions of
data
§ Each partition (induced by an attribute) is evaluated according to the
distance to the partition that perfectly classifies the data.
§ The partition closest to the ideal partition is chosen
Handling incomplete training data
§ How to cope with the problem that the value of some attribute
may be missing?
§ Example: Blood-Test-Result in a medical diagnosis problem
§ The strategy: use other examples to guess attribute
1. Assign the value that is most common among the training examples at
the node
2. Assign a probability to each value, based on frequencies, and assign
values to missing attribute, according to this probability distribution
§ Missing values in new instances to be classified are treated
accordingly, and the most probable classification is chosen
(C4.5)
Handling attributes with different
costs
§ Instance attributes may have an associated cost: we would
prefer decision trees that use low-cost attributes
§ ID3 can be modified to take into account costs:
1. Tan and Schlimmer (1990)
Gain2(S, A)
Cost(S, A)
2. Nunez (1988)
2Gain(S, A) - 1
(Cost(A) + 1)w w ∈ [0,1]
Classification: Basic Concepts,
Decision Trees
Classification
!
Learning: Definition
Given a collection of records (training set)
– Each record contains a set of attributes, one of the
attributes is the class
! Find a model for the class attribute as a function of the
values of the other attributes
! Goal: previously unseen records should be assigned a class
as accurately as possible
– Use test set to estimate the accuracy of the model
– Often, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it
Illustrating Classification Learning
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Examples of Classification Task
! Predicting tumor cells as benign or malignant

! Classifying credit card transactions


as legitimate or fraudulent

! Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

! Categorizing news stories as finance,


weather, entertainment, sports, etc.
Classification Learning Techniques
! Decision tree-based methods
! Rule-based methods
! Instance-based methods
! Probability-based methods
! Neural networks
! Support vector machines
! Logic-based methods
Example of a Decision Tree
Tid Refund Marital Taxable
Status Income Cheat

No
Refund
1 Yes Single 125K
Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
TaxInc NO
6 No Married 60K No
No
< 80K > 80K
7 Yes Divorced 220K
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Apply Model to Test Data
Test Data
Start at the root of tree Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Learning: ID3
! Function ID3(Training-set, Attributes)
– If all elements in Training-set are in same class, then return leaf
node labeled with that class
– Else if Attributes is empty, then return leaf node labeled with
majority class in Training-set
– Else if Training-Set is empty, then return leaf node labeled with
default majority class
– Else
u Select and remove A from Attributes
u Make A the root of the current tree
u For each value V of A
§ Create a branch of the current tree labeled by V
§ Partition_V ¬ Elements of Training-set with value V for A
§ Induce-Tree(Partition_V, Attributes)
§ Attach result to branch V
Illustrative Training Set
Risk Assessment for Loan Applications
Client # Credit History Debt Level Collateral Income Level RISK LEVEL
1 Bad High None Low HIGH
2 Unknown High None Medium HIGH
3 Unknown Low None Medium MODERATE
4 Unknown Low None Low HIGH
5 Unknown Low None High LOW
6 Unknown Low Adequate High LOW
7 Bad Low None Low HIGH
8 Bad Low Adequate High MODERATE
9 Good Low None High LOW
10 Good High Adequate High LOW
11 Good High None Low HIGH
12 Good High None Medium MODERATE
13 Good High None High LOW
14 Bad High None Medium HIGH
ID3 Example (I)
2) All examples are in the same class, HIGH.
1) Choose Income as root of tree.
Return Leaf Node.
Income
High Income
Low
Medium Low High
2) 3) 4) Medium
1,4,7,11 2,3,12,14 5,6,8,9,10,13
HIGH 2,3,12,14 5,6,8,9,10,13

3) Choose Debt Level as root of subtree. 3a) All examples are in the same class, MODERATE.
Return Leaf node.
Debt Level
Debt Level
Low High High
Low
3a) 3b) 3b)
3 2,12,14 MODERATE 2,12,14
ID3 Example (II)
3b’-3b’’’) All examples are in the same class.
3b) Choose Credit History as root of subtree.
Return Leaf nodes.
Credit History
Credit History
Unknown Good
Unknown Good
Bad
3b’) 3b’’) 3b’’’ Bad
2 14 ) 12
HIGH HIGH MODERATE

4) Choose Credit History as root of subtree. 4a-4c) All examples are in the same class.
Return Leaf nodes.
Credit History
Credit History
Unknown Good
Bad Unknown Good
4a) 4b) 4c) Bad
5,6 8 9,10,13
LOW MODERATE LOW
ID3 Example (III)
Attach subtrees at appropriate places.

Income

Low
Medium High
HIGH Debt Level Credit History
Low High Unknown Good
Bad

MODERATE Credit History LOW MODERATE LOW

Unknown Good
Bad

HIGH HIGH MODERATE


Another Example
! Assume A1 is binary feature (Gender: M/F)
! A1
Assume A2 is nominal feature (Color: R/G/B)
R
A2

G A2
B

R
A1
M F
G A2
B
Decision surfaces are axis-
aligned Hyper-rectangles
A1
M F
Non-Uniqueness
! Decision trees are not unique:
– Given a set of training instances T, there
generally exists a number of decision trees
Tid
that are consistent with (or
Refund Marital
Status
Taxable
Income fit)
MarSt
T
Cheat
Single,
1 Yes Single 125K No
Married Divorced
2 No Married 100K No
3 No Single 70K No NO Refund
4 Yes Married 120K No Yes No
5 No Divorced 95K Yes
No
NO TaxInc
6 No Married 60K
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10
ID3’s Question
Given a training set, which of all of the decision trees consistent
with that training set should we pick?

More precisely:
Given a training set, which of all of the decision trees consistent
with that training set has the greatest likelihood of correctly
classifying unseen instances of the population?
ID3’s (Approximate) Bias
§ ID3 (and family) prefers simpler decision trees
§ Occam’s Razor Principle:
§ “It is vain to do with more what can be
done with less...Entities should not be
multiplied beyond necessity.”
§ Intuitively:
§ Always accept the simplest answer that fits
the data, avoid unnecessary constraints
§ Simpler trees are more general
ID3’s Question Revisited
! Since ID3 builds a decision tree by recursively selecting
attributes and splitting the training data based on the values
of these attributes

Practically:
Given a training set, how do we select attributes so that the
resulting tree is as small as possible, i.e. contains as few
attributes as possible?
Not All Attributes Are Created Equal
! Each attribute of an instance may be thought
of as contributing a certain amount of
information to its classification
– Think 20-Questions:
u What are good questions?
u Ones whose answers maximize information gained
– For example, determine shape of an object:
u Num. sides contributes a certain amount of information
u Color contributes a different amount of information

! ID3 measures information gained by making


each attribute the root of the current subtree,
and subsequently chooses the attribute that
produces the greatest information gain
!
Entropy (as information)
Entropy at a given node t:

Entropy (t ) = - å p ( j | t ) log p ( j | t )
j
(where p( j | t) is the relative frequency of class j at node t)

Based on Shannon’s information theory


For simplicity, assume only 2 classes Yes and No
Assume t is a set of messages sent to a receiver that must guess their class
If p( Yes | t )=1 (resp., p( No | t )=1), then the receiver guesses a new example as
Yes (resp., No). No message need be sent.
If p( Yes | t )=p( No | t )=0.5, then the receiver cannot guess and must be told the
class of a new example. A 1-bit message must be sent.
If 0<p( Yes | t )<1, then the receiver needs less than 1 bit on average to know the
class of a new example.
Entropy (as homogeneity)
Think chemistry/physics
!
– Entropy is measure of disorder or homogeneity
– Minimum (0.0) when homogeneous / perfect order
– Maximum (1.0, in general log C) when most heterogeneous /
complete chaos
! In ID3
– Minimum (0.0) when all records belong to one class, implying
most information
– Maximum (log C) when records are equally distributed among
all classes, implying least information
– Intuitively, the smaller the entropy the purer the partition
Examples of Computing Entropy
Entropy (t ) = - å p ( j | t ) log p ( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Information Gain
! Information Gain:
æ n ö
= Entropy ( p ) - ç å Entropy (i ) ÷
k
GAIN i

è n ø
split i =1

(where parent Node, p is split into k partitions, and


ni is number of records in partition i)
– Measures reduction in entropy achieved because of the split
è maximize
– ID3 chooses to split on the attribute that results in the largest
reduction, i.e, (maximizes GAIN)
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
Computing Gain Before Splitting: C0
C1
N00
N01
E0

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

E1 E2 E3 E4

E12 E34
Gain = E0 – E12 vs. E0 – E34
Gain Ratio
! Gain Ratio:

GAIN n n
GainRATIO = SplitINFO = - å log
Split k
i i

SplitINFO
split

n n i =1

(where parent Node, p is split into k partitions, and


ni is number of records in partition i)

– Designed to overcome the disadvantage of GAIN


– Adjusts GAIN by the entropy of the partitioning (SplitINFO)
– Higher entropy partitioning (large number of small partitions)
is penalized
– Used by C4.5 (an extension of ID3)
Other
!
Splitting
GINI Index Criterion:
for a given node t: GINI Index
k
GINI (t ) = 1 - å [ p ( j | t )] ni
= å GINI (i )
2
GINI split
i =1 n
j

– Maximum (1 - 1/nc) when records are equally distributed


among all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class, implying
most interesting information
– Minimize GINI split value
– Used by CART, SLIQ, SPRINT
How to Specify Test Condition?
! Depends on attribute types
– Nominal
– Ordinal
– Continuous

! Depends on number of ways to split


– Binary split
– Multi-way split
Splitting Based on Nominal Attributes
! Multi-way split: Use as many partitions as values

CarType
Family Luxury
Sports
! Binary split: Divide values into two subsets

CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}

Need to find optimal partitioning!


Splitting Based on Continuous Attributes
§ Different ways of handling
§ Multi-way split: form ordinal categorical attribute
§ Static – discretize once at the beginning
§ Dynamic – repeat on each new partition

§ Binary split: (A < v) or (A ³ v)


§ How to choose v?

Need to find optimal partitioning!

Can use GAIN or GINI !


Overfitting and Underfitting
§ Overfitting:
§ Given a model space H, a specific model hÎH is said to overfit the
training data if there exists some alternative model h’ÎH, such that h
has smaller error than h’ over the training examples, but h’ has
smaller error than h over the entire distribution of instances

§ Underfitting:
§ The model is too simple, so that both training and test errors are large
Detecting Overfitting
Overfitting
Overfitting in Decision Tree Learning
! Overfitting results in decision trees that are more complex than
necessary
– Tree growth went too far
– Number of instances gets smaller as we build the tree (e.g., several
leaves match a single example)

! Training error no longer provides a good estimate of how well


the tree will perform on previously unseen records
Avoiding
!
Tree Overfitting
Pre-Pruning (Early Stopping Rule)
– Solution 1
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
u Stop if all instances belong to the same class
u Stop if all the attribute values are the same
– More restrictive conditions:
u Stop if number of instances is less than some user-specified threshold
u Stop if class distribution of instances are independent of the available
features (e.g., using c 2 test)
u Stop if expanding the current node does not improve impurity measures
(e.g., GINI or GAIN)
Avoiding Tree Overfitting – Solution 2
! Post-pruning
– Split dataset into training and validation sets
– Grow full decision tree on training set
– While the accuracy on the validation set increases:
u Evaluate the impact of pruning each subtree, replacing its root by a leaf
labeled with the majority class for that subtree
u Replace subtree that most increases validation set accuracy (greedy
approach)
Decision Tree Based Classification
! Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Good accuracy (careful: NFL)
! Disadvantages:
– Axis-parallel decision boundaries
– Redundancy
– Need data to fit in memory
– Need to retrain with new data
Oblique Decision Trees
x+y<1

Class = + Class =

• Test condition may involve multiple attributes


• More expressive representation
• Finding optimal test condition is computationally expensive
Subtree Replication P

Q R

S 0 Q 1

0 1 S 0

0 1

• Same subtree appears in multiple branches


Decision Trees and Big Data
! For each node in the decision tree we need to scan the entire
data set

! Could just do multiple MapReduce steps across the entire data


set
– Ouch!

! Have each processor build a decision tree based on local data


– Combine decision trees
– Better?
Combining Decision Trees
! Meta Decision Tree
– Somehow decide which decision tree to use
– Hierarchical
– More training needed
! Voting
– Run classification data through all decision trees and then vote
! Merge Decision Trees
Specific techniques for dealing with overfitting
(Model selection provides general framework)
§ 1) Decision tree pruning or grow only up to certain size.
§ Prevent splitting on features that are not clearly
relevant.

§ Testing of relevance of features --- “does split provide


new information”:
§ statistical tests ---> Section 18.3.5 R&N
test.
MDL (minimal description length):
§ 2) minimize
Grow full tree, then post-prune rule post-pruning
size(tree) + size(misclassifications(tree))
§ 3)
Decision Tree Representation

Outlook
Sunny Rain
Overcast
Humidity Wind
High Normal Strong Weak

A Decision Tree for the concept PlayTennis


87
Appropriate Problems for Decision
Tree Learning
§ Instances are represented by discrete attribute-value
pairs (though the basic algorithm was extended to
real-valued attributes as well)
§ The target function has discrete output values (can
have more than two possible output values -->
classes)
§ Disjunctive hypothesis descriptions may be required
§ The training data may contain errors
§ The training data may contain missing attribute values

88
ID3: The Basic Decision Tree
Learning Algorithm
§ Database, See [Mitchell, p. 59]

D12 D11 What is the “best”


D1
D2 D5
attribute?
D10 D4
D6
D14
D3 Answer: Outlook
D8 D9
D7 D13 [“best” = with highest
information gain]

89
ID3 (Cont’d)
Outlook
Sunny Rain
Overcast

D1 D8 D10 D6
D3
D14
D11 D12 D4
D9 D2 D7 D5
D13

What are the


“best” attributes? Humidity and Wind
90
What Attribute to choose to “best” split
a node?
§ Choose the attribute that minimize the Disorder (or Entropy)
in the subtree rooted at a given node.
§ Disorder and Information are related as follows: the more
disorderly a set, the more information is required to correctly
guess an element of that set.
§ Information: What is the best strategy for guessing a
number from a finite set of possible numbers? i.e., how many
questions do you need to ask in order to know the answer
(we are looking for the minimal number of questions). Answer
Log_2(S), where S is the set of numbers and |S|, its
cardinality.

Q1: is it smaller than 5?


Q2: is it smaller than 2?
E.g.: 0 1 2 3 4 5 6 7 8 9 10
91 Q2 Q1
What Attribute to choose to “best” split
a node? (Cont’d)
§ Log_2 |S| can also be thought of as the information value of
being told x (the number to be guessed) instead of having to
guess it.
§ Let U be a subset of S. What is the informatin value of being
told x after finding out whether or not xÎ U? Ans:
Log_2|S|-[P(x Î U) Log_2|U|+ P(s ÏU) Log_2|S-U|
§ Let S = P ÈN (positive and negative data). The information
value of being told x after finding out whether x Î U or x Î
N is I({P,N})=Log_2(|S|)-|P|/|S|
Log_2|P| -|N|/|S| Log_2|N|

92
What Attribute to choose to “best” split
a node? (Cont’d)
§ We want to use this measure to choose an attribute that minimizes
the disorder in the partitions it creates. Let {S_i | 1£i £n} be a
partition of S resulting from a particular attribute. The disorder
associated with this partition is:
V({S_i | 1£i £n})=S|S_i|/|S|.I({P(S_i),N(S_i)})

Set of positive Set of negative


examples in S_i examples in S_i

93
Hypothesis Space Search in Decision
Tree Learning
§ Hypothesis Space: Set of possible decision trees (i.e., complete
space of finie discrete-valued functions).
§ Search Method: Simple-to-Complex Hill-Climbing Search (only
a single current hypothesis is maintained (¹ from candidate-
elimination method)). No Backtracking!!!
§ Evaluation Function: Information Gain Measure
§ Batch Learning: ID3 uses all training examples at each step to
make statistically-based decisions (¹ from candidate-
elimination method which makes decisions incrementally). ==>
the search is less sensitive to errors in individual training
examples.

94
Inductive Bias in Decision Tree
Learning
§ ID3’s Inductive Bias: Shorter trees are preferred over
longer trees. Trees that place high information gain
attributes close to the root are preferred over those that
do not.
§ Note: this type of bias is different from the type of bias
used by Candidate-Elimination: the inductive bias of ID3
follows from its search strategy (preference or search
bias) whereas the inductive bias of the Candidate-
Elimination algorithm follows from the definition of its
hypothesis space (restriction or language bias).

95
Why Prefer Short Hypotheses?
§ Occam’s razor:
Prefer the simplest hypothesis that fits the data [William
of Occam (Philosopher), circa 1320]
§ Scientists seem to do that: E.g., Physicist seem to prefer simple
explanations for the motion of planets, over more complex ones
§ Argument: Since there are fewer short hypotheses than long ones, it is
less likely that one will find a short hypothesis that coincidentally fits the
training data.
§ Problem with this argument: it can be made about many other
constraints. Why is the “short description” constraint more relevant than
others?
§ Nevertheless: Occam’s razor was shown experimentally to be a
successful strategy!

96
Issues in Decision Tree Learning: I.
Avoiding Overfitting the Data
§ Definition: Given a hypothesis space H, a hypothesis hÎH is
said to overfit the training data if there exists some
alternative hypothesis h’ÎH, such that h has smaller error
than h’ over the training examples, but h’ has a smaller error
than h over the entire distribution of instances. (See curves in
[Mitchell, p.67])
§ There are two approaches for overfitting avoidance in
Decision Trees:
§ Stop growing the tree before it perfectly fits the data
§ Allow the tree to overfit the data, and then post-prune it.

97
Issues in Decision Tree Learning: II. Other
Issues
§ Incorporating Continuous-Valued Attributes
§ Alternative Measures for Selecting Attributes
§ Handling Training Examples with Missing Attribute Values
§ Handling Attributes with Differing Costs

98

You might also like