UNIT-1
Module-2
                             DECISION TREE LEARNING
2.1 INTRODUCTION
Decision tree learning is a method for approximating discrete-valued target functions, in which the
learned function is represented by a decision tree. Learned trees can also be re-represented as sets
of if-then rules to improve human readability. These learning methods are among the most popular
of inductive inference algorithms and have been successfully applied to a broad range of tasks from
learning to diagnose medical cases to learning to assess credit risk of loan applicants.
2.2 DECISION TREE REPRESENTATION
Decision trees classify instances by sorting them down the tree from the root to some leaf node,
which provides the classification of the instance. Each node in the tree specifies a test of some
attribute of the instance, and each branch descending from that node corresponds to one of the
possible values for this attribute. An instance is classified by starting at the root node of the tree,
testing the attribute specified by this node, then moving down the tree branch corresponding to the
value of the attribute in the given example. This process is then repeated for the subtree rooted at
the new node.
                                       Fig 2.1 Decision Tree
Figure 2.1 illustrates a typical learned decision tree. This decision tree classifies Saturday mornings
according to whether they are suitable for playing tennis.
           (Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)
For example, the instance would be sorted down the leftmost branch of this decision tree and would
therefore be classified as a negative instance (i.e., the tree predicts that PlayTennis = no). In
general, decision trees represent a disjunction of conjunctions of constraints on the attribute values
of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests,
and the tree itself to a disjunction of these conjunctions. For example, the decision tree shown in
Figure 2.1 corresponds to the expression
APPROPRIATE PROBLEMS FOR DECISION TREE LEARNING:
Decision tree learning is generally best suited to problems with the following characteristics:
• Instances are represented by attribute-value pairs - Instances are described by a fixed set of
attributes (e.g., Temperature) and their values (e.g., Hot). The easiest situation for decision tree
learning is when each attribute takes on small number of disjoint possible values (e.g., Hot, Mild,
Cold).
• The target function has discrete output values - The decision tree in Figure 2.1assigns a boolean
classification (e.g., yes or no) to each example. Decision tree methods easily extend to learning
functions with more than two possible output values.
• Disjunctive descriptions may be required - Decision trees naturally represent disjunctive
expressions.
• The training data may contain errors - Decision tree learning methods are robust to errors, both
errors in classifications of the training examples and errors in the attribute values that describe these
examples.
 The training data may contain missing attribute values - Decision tree methods can be used
  even when some training examples have unknown values (e.g., if the Humidity of the day is
  known for only some of the training).
 Decision tree learning has therefore been applied to problems such as learning to classify medical
 patients by their disease, equipment malfunctions by their cause, and loan applicants by their
 likelihood of defaulting on payments. Such problems, in which the task is to classify examples
 into one of a discrete set of possible categories, are often referred to as classification problems.
THE BASIC DECISION TREE LEARNING ALGORITHM – ID3:
Most algorithms that have been developed for learning decision trees are variations on a core
algorithm that employs a top-down, greedy search through the space of possible decision trees.
This approach is exemplified by the ID3 algorithm.
Basic algorithm, ID3, learns decision trees by constructing them top-down, beginning with the
question "which attribute should be tested at the root of the tree?' To answer this question, each
instance attribute is evaluated using a statistical test to determine how well it alone classifies the
training examples. The best attribute is selected and used as the test at the root node of the tree. A
descendant of the root node is then created for each possible value of this attribute, and the training
examples are sorted to the appropriate descendant node (Leaf node). The entire process is then
repeated using the training examples associated with each descendant node to select the best
attribute to test at that point in the tree. This forms a greedy search for an acceptable decision tree,
in which the algorithm never backtracks to reconsider earlier choices.
Which Attribute Is the Best Classifier?
The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree.
We would like to select the attribute that is most useful for classifying examples. What is a good
quantitative measure of the worth of an attribute? A statistical property, called information gain,
that measures how well a given attribute separates the training examples according to their target
classification. ID3 uses this information gain measure to select among the candidate attributes at
each step while growing the tree.
ENTROPY MEASURES HOMOGENEITY OF EXAMPLES:
In order to define information, gain precisely, we begin by defining a measure commonly used in
information theory, called entropy.
Definition of entropy:
It characterizes the impurity of an arbitrary collection of examples. Given a collection S samples,
containing positive and negative examples of some target concept, the entropy of S relative to this
Boolean classification is where p+ is the proportion of positive examples in S and, p- is the
proportion of negative examples in S.
To illustrate, suppose S is a collection of 14 examples of some Boolean concept, including 9
positive and 5 negative examples (we adopt the notation [9+, 5-] to summarize such a sample of
data).
One interpretation of entropy from information theory is that it specifies the minimum number of
bits of information needed to encode the classification of an arbitrary member of S (i.e., a member
of S drawn at random with uniform probability). For example, if p, is 1, the receiver knows the
drawn example will be positive, so no message need be sent, and the entropy is zero.
INFORMATION GAIN MEASURES THE EXPECTED REDUCTION IN ENTROPY
        Given entropy as a measure of the impurity in a collection of training examples, we can
now define a measure of the effectiveness of an attribute in classifying the training data. The
measure we will use, called information gain, is simply the expected reduction in entropy caused
by partitioning the examples according to this attribute. More precisely, the information gain,
Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as:
where Values(A) is the set of all possible values for attribute A, and S, is the subset of S for which
attribute A has value
To illustrate the operation of ID3, consider the learning task represented by the training examples
of Table 3.2. Here the target attribute PlayTennis, which can have values yes or no for different
Saturday mornings, is to be predicted based on other attributes of the morning in question. Consider
the first step through the algorithm, in which the topmost node of the decision tree is created. Which
attribute should be tested first in the tree? ID3 determines the information gain for each candidate
attribute (i.e., Outlook, Temperature, Humidity, and Wind), then gain for two of these attributes
is shown in Figure 3.3
(Please refer to the problems solved in the class).
The information gain values for all four attributes are
Gain (S, Outlook) = 0.246
Gain (S, Humidity) = 0.151
Gain (S, Wind) = 0.048
Gain (S, Temperature) = 0.029
According to the information gain measure, the Outlook attribute provides the best prediction of
the target attribute, PlayTennis, over the training examples. Therefore, Outlook is selected as the
decision attribute for the root node, and branches are created below the root for each of its possible
values (i.e. Sunny, Overcast, and Rain).
HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING:
As with other inductive learning methods, ID3 can be characterized as searching a space of
hypotheses for one that fits the training examples. The hypothesis space searched by ID3 is the set
of possible decision trees. ID3 performs a simple-to complex, hill-climbing search through this
hypothesis space, beginning with the empty tree, then considering progressively more elaborate
hypotheses in search of a decision tree that correctly classifies the training data. The evaluation
function that guides this hill-climbing search is the information gain measure.
Capabilities and Limitations of ID3 Algorithm:
       ID3's hypothesis space of all decision trees is a complete space of finite discrete-valued
        functions, relative to the available attributes. ID3 avoids one of the major risks of methods
        that search incomplete hypothesis spaces.
       ID3 maintains only a single current hypothesis as it searches through the space of decision
        trees.
       ID3 in its pure form performs no backtracking in its search
       ID3 uses all training examples at each step in the search to make statistically based
        decisions regarding how to refine its current hypothesis.
       ID3 can be easily extended to handle noisy training data by modifying its termination
        criterion to accept hypotheses that imperfectly fit the training data.
INDUCTIVE BIAS IN DECISION TREE LEARNING:
The ID3 search strategy (a) selects in favour of shorter trees over longer ones, and (b) selects trees
that place the attributes with highest information gain closest to the root.
Approximate inductive bias of ID3: Shorter trees are preferred over larger trees.
A closer approximation to the inductive bias of ID3: Shorter trees are preferred over longer
trees. Trees that place high information gain attributes close to the root are preferred over those that
do not.
Why Prefer Short Hypotheses?
Is ID3's inductive bias favouring shorter decision trees a sound basis for generalizing beyond the
training data? Philosophers and others have debated this question for centuries, and the debate
remains unresolved to this day. William of Occam was one of the first to discuss the question,
around the year 1320, so this bias often goes by the name of Occam's razor.
Occam's razor: Prefer the simplest hypothesis that fits the data.
ISSUES IN DECISION TREE LEARNING
Practical issues in learning decision trees include determining how deeply to grow the decision
tree, handling continuous attributes, choosing an appropriate attribute selection measure, handling
training data with missing attribute values, handling attributes with differing costs, and improving
computational efficiency.
        1. Avoiding Overfitting the Data - We will say that a hypothesis overfits the training
            examples if some other hypothesis that fits the training examples less well actually
            performs better over the entire distribution of instances (i.e., including instances
            beyond the training set).
        Given a hypothesis space H, a hypothesis h ɛ H is said to overfit the training data if there
        exists some alternative hypothesis h' ɛ H, such that h has smaller error than h' over the
        training examples, but h' has a smaller error than h over the entire distribution of
        instances.
       2. REDUCED ERROR PRUNING -
               How exactly might we use a validation set to prevent overfitting? One approach,
       called reduced-error pruning is to consider each of the decision nodes in the tree to be
       candidates for pruning. Pruning a decision node consists of removing the subtree rooted at
       that node, making it a leaf node, and assigning it the most common classification of the
       training examples affiliated with that node. Nodes are removed only if the resulting pruned
       tree performs no worse than-the original over the validation set.
       As pruning proceeds, the number of nodes is reduced and accuracy over the test set
       increases. Here, the available data has been split into three subsets: the training examples,
       the validation examples used for pruning the tree, and a set of test examples used to provide
       an unbiased estimate of accuracy over future unseen examples.
       Why convert the decision tree to rules before pruning? There are three main advantages.
    • Converting to rules allows distinguishing among the different contexts in which a
      decision node is used.
     Converting to rules removes the distinction between attribute tests that occur near
       the root of the tree and those that occur near the leaves.
       Converting to rules improves readability. Rules are often easier for people to
        understand.
3. Incorporating Continuous-Valued Attributes
    Our initial definition of ID3 is restricted to attributes that take on a discrete set of values.
First, the target attributes whose value is predicted by the learned tree must be discrete
valued. Second, the attributes tested in the decision nodes of the tree must also be discrete
valued.
4. Alternative Measures for Selecting Attributes
       There is a natural bias in the information gain measure that favors attributes with
many values over those with few values. As an extreme example, consider the attribute
Date, which has a very large number of possible values (e.g., March 4, 1979).
5. Handling Training Examples with Missing Attribute Values
        In certain cases, the available data may be missing values for some attributes. For
example, in a medical domain in which we wish to predict patient outcome based on various
laboratory tests, it may be that the lab test Blood-Test-Result is available only for a subset
of the patients. In such cases, it is common to estimate the missing attribute value based on
other examples for which this attribute has a known value.
6. Handling Attributes with Differing Costs
         In some learning tasks the instance attributes may have associated costs. For
example, in learning to classify medical diseases we might describe patients in terms of
attributes such as Temperature, BiopsyResult, Pulse, BloodTestResults, etc. These
attributes vary significantly in their costs, both in terms of monetary cost and cost to patient
comfort. In such tasks, we would prefer decision trees that use low-cost attributes where
possible, relying on high-cost attributes only when needed to produce reliable
classifications.