KEMBAR78
Module 3-Decision Tree Learning | PDF | Teaching Methods & Materials
100% found this document useful (1 vote)
115 views33 pages

Module 3-Decision Tree Learning

The document describes module 2 on decision tree learning. It covers topics like decision tree representation, appropriate problems for decision tree learning, and the basic ID3 algorithm. The ID3 algorithm uses information gain, based on entropy, to determine the attribute that best splits the data when building a decision tree in a top-down greedy manner. It calculates the information gain of attributes like outlook, temperature, humidity and wind on a tennis play dataset to select the root node of the decision tree.

Uploaded by

ramya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
115 views33 pages

Module 3-Decision Tree Learning

The document describes module 2 on decision tree learning. It covers topics like decision tree representation, appropriate problems for decision tree learning, and the basic ID3 algorithm. The ID3 algorithm uses information gain, based on entropy, to determine the attribute that best splits the data when building a decision tree in a top-down greedy manner. It calculates the information gain of attributes like outlook, temperature, humidity and wind on a tennis play dataset to select the root node of the decision tree.

Uploaded by

ramya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Module 2

Decision Tree Learning


Topics to be covered:
1. Introduction
2. Decision Tree Representation
3. Appropriate Problems for Decision Tree Learning
4. Basic decision tree learning algorithm
5. Hypothesis space search in Decision Tree Learning
6. Inductive bias in Decision Tree Learning
7. Issues in Decision Tree Learning
1. Introduction
● Decision Tree Learning is one of the widely used practical method for inductive
inference like diagnosing medical cases , assessing credit risk of loan applications,
etc.
● Method for approximating discrete-valued functions
● Robust to noisy data and capable of learning disjunctive expressions
● Learned trees can also be re-represented as a set of if-then-rules to improve
human readability
● Algorithms: ID3, ASSISTANT, C4.5
2. Decision Tree Representation
● A tree classifies instances:
○ Node: an attribute which describes an instance.
○ Branch: possible values of the attribute
○ Leaf: class to which the instances belong

● Procedure (of classifying):

○ An instance is classified by starting at the root node of the tree

○ Repeat: - test the attribute specified by the node - move down the tree branch
corresponding to the value of the attribute-value in the given example
A Decision Tree for the concept PlayTennis.
Day Outlook Temperatur Humidity Wind PlayTennis
e

D1 Sunny Hot High Weak No


D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes

The decision tree shown in the figure corresponds to


the expression:

(Outlook =Sunny ^ Humidity=Normal)

Ú (Outlook=Overcast)

 (Outlook = Rain ^ Wind = Weak)


Example: Consider the following instance:

(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind=Strong)

● This instance will be sorted down to the leftmost branch of the decision tree shown in
previous slide and will be classified as a negative example.

● Tree predicts that “PlayTennis = No”


3. Appropriate Problems for Decision Tree
Learning
Decision tree learning is generally best suited to the problems with the following
characteristics:

1. Instances are represented by attribute-value pairs: Instances are described by a fixed set of
attributes (e.g., Temperature) and their values (e.g., Hot). The easiest situation for decision tree
learning is when each attribute takes on small number of disjoint possible values (e.g., Hot, Mild,
Cold).

2. The target function has discrete output values: The decision tree for the concept PlayTennis
assigns a boolean classification (e.g., yes or no) to each example. Decision tree methods can
also have more than two possible output values
3. Disjunctive descriptions may be required: As noted above, decision trees naturally
represent disjunctive expressions.

4. The training data may contain errors: Decision tree learning methods are robust to
errors, both errors in classifications of the training examples and errors in the attribute
values that describe these examples.

5. The training data may contain missing attribute values: Decision tree methods can be
used even when some training examples have unknown values (e.g., if the Humidity of the
day is known for only some of the training examples).

• Many practical problems such as learning to classify medical patients by their disease,
equipment malfunctions by their cause, and loan applicants by their likelihood of defaulting
on payments, etc. have been found to fit these characteristics.

• Such problems, in which the task is to classify examples into one of a discrete set of possible
categories, are often referred to as classification problems.
4. Basic Decision Tree Learning Algorithm
● Decision Tree learning algorithms employ top-down greedy search through the space of
possible solutions.

● A general Decision Tree learning algorithm:


1. Perform a statistical test of each attribute to determine how well it classifies the training
examples when considered alone;
2. Select the attribute that performs best and use it as the root of the tree;
3. To decide the descendant node down each branch of the root (parent node), sort the
training examples according to the value related to the current branch and repeat the
process described in steps 1 and 2.

● ID3 algorithm is one of the most commonly used Decision Tree learning algorithms and it
applies this general approach to learning the decision tree
ID3 Algorithm
ID3 Algorithm ( Iterative Dichotomiser 3 )
• ID3 algorithm uses “Information Gain” to determine how informative an attribute is (i.e.,
how well an attribute classifies the training examples).

• Information Gain is based on a measure that we call Entropy, which characterizes the
impurity of a collection of examples S (i.e., impurity↑ → E(S)↑):
Entropy(S) ≡ – p⊕ log2 p⊕ – p⊗ log2 p⊗,
where p⊕ and p⊗ are the proportion of positive and negative examples in S, respectively.

• Entropy(S) = 0 if S contains only positive or only negative examples


p⊕ = 1, p⊗ = 0, Entropy(S) = – 1· 0 – 0 · log2 p ⊗ = 0

• Entropy(S) = 1 if S contains equal amount of positive and negative examples


p⊕ = ½, p⊗ = ½, Entropy(S) = – ½· (-1) – ½· (-1) = 1

• In the case that that the target attribute can take n values:
Entropy(S) ≡ – ∑i pi log2 pi, i = [1..n]
where pi is the proportion of examples in S having the target attribute value i.
where values(A) is the set of all possible values for A
INFORMATION GAIN Sv is the subset of S for which A has a value
|S| is the size of S
Gain(S,A): Expected Day Outlook Temperature Humidity Wind PlayTennis

reduction in entropy D1 Sunny Hot High Weak No


caused by partitioning D2 Sunny Hot High Strong No
the examples according D3 Overcast Hot High Weak Yes
to this attribute A: D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes


Consider the following
Example: Target concept: D8 Sunny Mild High Weak No

Play Tennis D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes


D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No


Entropy Calculation:

Compute the entropy of the play-tennis example:


• We have two classes, YES and NO
• We have 14 instances with 9 classified as YES and 5 as NO – i.e. no. of classes, c=2

• = - (9/14) log2 (9/14) = 0.41


• = - (5/14) log2 (5/14) = 0.53
• E(S) = + = 0.94

Information gain calculation for Attribute “wind”:

Compute the information gain for the attributes wind in the play-tennis data set:
• |S|=14 , Attribute wind
• Two values: weak and strong
•||=8
•| |=6
Now, let us determine E| | Now, let us determine E| |
• Instances=8, YES=6, NO=2 • Instances=6, YES=3, NO=3

• [6+,2-] • [3+,3-]
• E| | = -(6/8)log2 (6/8) - (2/8)log2 (2/8)=0.81 • E| | =-(3/6)log2 (3/6)-(3/6)log2 (3/6)=1.0

Going back to information gain computation for the attribute wind:

= 0.94 - (8/14) 0.81 - (6/14)1.00


= 0.048

Gain(S,wind) = 0.048
Information gain calculation for Attribute “humidity”:

|S|=14 Attribute humidity


Two values: high and normal
||=7
||=7

For value: high –> [3+,4-] E| | =-(3/7)log2 (3/7)-(4/7)log2 (4/7)=0.98


For value: normal->[6+,1-] E| | =-(6/7)log2 (6/7)-(1/7)log2 (1/7)=0.59

= 0.94 - (7/14) 0.98 - (7/14)0.59


= 0.15
So, humidity provides GREATER information gain
Gain(S,humidity) = 0.15
than wind
Information gain calculation for Attribute “outlook” and “temperature”:
Attribute outlook:

Gain(S, outlook)=0.25

Attribute temperature:

Gain(S, temperature)=0.003

Summary
• Gain(S, outlook)=0.25
• Gain(S, temp)=0.03
• Gain(S, humidity)=0.15 So, attribute with highest info.
• Gain(S, wind)=0.048 gain is the OUTLOOK, therefore
use outlook as the root node
Decision Tree – Next Level

• After determining OUTLOOK as the root node, we need to


expand the tree

E| | =-(2/5)log2 (2/5)-(3/5)log2 (3/5)


=0.97
• Gain( , Humidity) = 0.97-(3/5) 0.0 – (2/5) 0.0=0.97
• Gain ( , Wind) = 0.97– (3/5) 0.918 – (2/5) 1.0 = 0.019
• Gain( , Temperature) = 0.97-(2/5) 0.0 – (2/5) 1.0 – (1/5) 0.0 = 0.57

Highest information gain is humidity, so use this attribute


Continue ….. and Final DT

• Continue until all the examples are classified


– Gain ( , Wind), Gain ( , Humidity),Gain ( , Temp)
– Gain ( , Wind) is the highest

• All leaf nodes are associated with training examples from the same class (entropy=0)
• The attribute temperature is not used
5. Hypothesis space search in decision tree
● The hypothesis space searched bylearning
ID3 is the set of possible decision trees.
● ID3 perform a simple-to-complex, hill climbing search through the hypothesis space
starting from empty tree and then progressing to elaborated hypothesis to correctly
classifying the training data.
● Features of the Hypothesis space search in decision tree learning:
• Complete hypothesis space: any finite discrete-valued function can be expressed.
• Incomplete search: searches incompletely through the hypothesis space until the tree is
consistent with the data.
• Single hypothesis: only one current hypothesis (simplest one) is maintained.
• No backtracking: one an attribute is selected, this cannot be changed. Problem: might not
be the optimum solution (globally).
• Full training set at each step: attributes are selected by computing information gain on the
full training set. Advantage: Robustness to errors. Problem: Non-incremental
6. Inductive Bias in Decision Tree Learning
● Inductive bias is the set of assumptions that, together with the training data, deductively justify
the classifications assigned by the learner to future instances.
● The inductive bias of ID3 consists of describing the basis by which it chooses one of the
consistent hypotheses over the others.
● The ID3 search strategy
(a) selects in favor of shorter trees over longer ones
(b) selects trees that place the attributes with highest information gain closest to the root.

a. Restriction Biases and Preference Biases

Consider inductive bias exhibited by ID3 and by the CANDIDATE-ELIMINATION algorithm.

● ID3 considers a complete hypothesis space (i.e., one capable of expressing any finite discrete
valued function) but it searches incompletely through this complete hypothesis space, from
simple to complex hypotheses, until its termination condition is met.
● CANDIDATE-ELIMINATION algorithm consider an incomplete hypothesis space (i.e., one that
can express only a subset of the potentially teachable concepts) but it searches this space
completely, finding every hypothesis consistent with the training data.

Thus,
Preference bias: The inductive bias of ID3 is preferred for certain hypotheses over others (e.g., for
shorter hypotheses), with no hard restriction on the hypotheses. This form of bias is typically
called a preference bias (or, alternatively, a search bias).

Restriction bias: The bias of the CANDIDATEELIMINATION algorithm is in the form of a categorical
restriction on the set of hypotheses considered. This form of bias is typically called a restriction
bias (or, alternatively, a language bias).

Typically, a preference bias is more desirable than a restriction bias, because it allows the learner
to work within a complete hypothesis space that is assured to contain the unknown target
function.
b. Why Prefer Short Hypotheses?

William Occam was one of the first to discuss this question, around the year 1320, so this bias
often goes by the name of Occam's razor.

Argument in favour:
● There are fewer short hypotheses than long hypotheses
○ a short hypotheses that fits data unlikely to be coincidence
○ a long hypotheses that fits data might be coincidence

Argument opposed:
● There are many ways to define small sets of hypotheses e.g., all trees with a prime number of
nodes that use attributes beginning with “Z"
● It will produce two different hypotheses from the same training examples when it is applied by
two learners basis
Example: two learners, both applying Occam's razor, would generalize in different ways if one used
the XYZ attribute to describe its examples and the other used only the attributes Outlook, Temperature,
Humidity, and Wind.
7. Issues in Decision Tree Learning
Practical issues in learning decision trees include:
○ Determining how deeply to grow the decision tree
○ Handling continuous attributes
○ Choosing an appropriate attribute selection measure
○ Handling training data with missing attribute values
○ Handling attributes with differing costs, and
○ Improving computational efficiency.

Lets discuss how these issues are addressed using basic ID3 algorithm

1. Avoiding Overfitting the Data


Definition: given a hypothesis space H, a hypothesis           is said to overfit the training data if
there exists some alternative hypothesis        , such that h has smaller error than h' over the
training examples, but h' has smaller error than h over the entire distribution of instances .
Example: When there is noise in the data, or when the number of training examples is
too small to produce a representative sample of the true target function then it can
produce trees that overfit the training examples.

● How ID3 avoid overfitting


ID3 adds new nodes to grow the decision tree as a result the accuracy of the tree
measured over the training examples increases monotonically.

Here, it can be seen that accuracy of the


tree over the training data increases when
the tree size exceeds approximately 25
nodes.
● How can we prevent overfitting? Here are some common heuristics:
• Don't try to fit all examples, stop before the training set is exhausted.
• Fit all examples then prune(expand) the resultant tree.

● Methods to use a validation set to prevent overfitting:


a. Reduced Error Pruning
b. Rule Post-Pruning

a. Reduced Error Pruning:

○ Consider each of the decision nodes to be a candidate for pruning. Pruning


means to substitute a subtree rooted at the node, by a leaf which the most
common class of the training examples assigned

○ Nodes are removed only if the resulting pruned tree performs no worse than
the original over the validation set
○ Nodes are pruned iteratively by choosing the node whose removal most increases
the accuracy of decision tree over validation set
○ Continue until further pruning is necessary
○ Here the validation set used for pruning is distinct from both the training and test
sets
○ Disadvantage: Data is limited (withholding part of it for the validation set reduces
even further the number of examples available for training)

The impact of reduced-error pruning on the accuracy of


the decision tree is illustrated here. The additional line
in figure shows accuracy over the test examples as the
tree is pruned. When pruning begins, the tree is at its
maximum size and lowest accuracy over the test set.
As pruning proceeds, the number of nodes is reduced
and accuracy over the test set increases.
b. Rule Post-Pruning

● Rule post-pruning finds the high accuracy hypotheses. It involves flowing steps:
○ Infer decision tree growing until the training data fit as well as possible and allow
overfitting to occur
○ Convert the learned tree into an equivalent set of rules by creating one rule for each path
from the root to a leaf node
○ Prune each rule by removing any preconditions that result in improving its estimated
accuracy
○ Sort the pruned rules by their estimated accuracy and consider them in this sequence
when classifying subsequent instances
○ One rule is generated for each leaf node in the tree
○ Antecedent: Each attribute test along the path from the root to the leaf
○ Consequent: The classification at the leaf
○ Removing any antecedent, whose removal does not worsen its estimated accuracy
2. Incorporating Continuous-Valued Attributes

● Continuous valued attributes can be partitioned into a discrete number of disjoint intervals
and then membership can be tested over these intervals.

Example: If the learning task PlayTennis include continuous valued attribute “Temperature” in the
range 40-90 then “Temperature” becomes a bad choice for classification (It alone may perfectly
classify the training examples and therefore promise the highest information gain) while
remaining a poor predictor on the test set.

● The solution to this problem is to classify based not on the actual temperature, but on
dynamically determined intervals within which the temperature falls.
● Like, by introducing boolean attributes T a , a <T b , b < T c and T> c , instead of real
valued T. a, b and c.
● In the PlayTennis example, there are two candidate thresholds, corresponding to the values of
Temperature at which the value of PlayTennis changes: (48 + 60)/2, and (80 + 90)/2.

● The information gain can then be computed for each of the candidate attributes,
Temperature>54 and Temperature>85 the best can be selected (Temperature>54). This
dynamically created boolean attribute can then compete with the other discrete-valued
candidate attributes available for growing the decision tree.

3. Alternative Measures for Selecting Attributes

● The information gain measure has a bias that favors attributes with many values over those
with only a few.
For example: If you imagine an attribute Date with unique values for each training example,
then Gain(S,Date) will yield H(S) since

● Obviously no other attribute can do better. This will result in a very broad tree of depth 1.
● To guard against this, GainRatio(S,A) can be used instead of Gain(S,A).

where,

with P(Sv) estimated by relative frequency as before

4. Handling Training Examples with Missing Attribute Values

● What happens if some of the training examples contain one or more ``?'', meaning ``value not
known'' instead of the actual attribute values?
● Here are some common ad hoc solutions:
• Substitute ``?'' by the most common value in that column.
• Substitute ``?'' by the most common value among all training examples that have been sorted into the
tree at that node.
• Substitute ``?'' by the most common value among all training examples that have been sorted into the
tree at that node with the same classification as the incomplete example.
5. Handling attributes with differing costs

● In some learning tasks the instance attributes may have associated costs.

Example, in learning to classify medical diseases patients can be described in terms of attributes
such as Temperature, Biopsy-Result, Pulse, Blood Test-Results, etc. These attributes vary
significantly in their costs, both in terms of monetary cost and cost to patient comfort. In such
tasks, we would prefer decision trees that use low-cost attributes where possible, relying on high-
cost attributes only when needed to produce reliable classifications.

● It can be done by dividing the Gain by the cost of the attribute  Cost(A), so that lower-cost
attributes would be preferred.
● Use a CostedGain(S,A) which is defined along the lines of:

where             may be a constant that


determines the relative importance of cost
versus information gain.
End of Module 2

You might also like