DATA MINING.
COURSE INSTRUCTOR : Sheza Naeem
Lecture# 19
Decision trees and decision rules are data-mining methodologies applied in many realworld applications
as a powerful solution to classification problems. Therefore, at the beginning, let us briefly summarize
the basic principles of classification. In general,
classification is a process of learning a function that maps a data item into one of several predefined
classes. Every classification based on inductive-learning algorithms is given as input a set of samples
that consist of vectors of attribute values (also called
feature vectors) and a corresponding class. The goal of learning is to create a classification model,
known as a classifier, which will predict, with the values of its available input attributes, the class for
some entity (a given sample). In other words,
classification is the process of assigning a discrete label value (class) to an unlabeled record, and a
classifier is a model (a result of classification) that predicts one attribute—class of a sample—when the
other attributes are given. In doing so, samples
are divided into predefined groups. For example, a simple classification might group customer billing
records into two specific classes: those who pay their bills within 30 days and those who takes longer
than 30 days to pay. Different classification methodologies are applied today in almost every discipline
where the task of classification, because of the large amount of data, requires automation of the
process. Examples of classification methods used as a part of data-mining applications include
classifying trends in financial market and identifying objects in large image databases. A more
formalized approach to classification problems is given through its graphical interpretation. A data set
with n features may be thought of as a collection of discrete points (one per example) in an n-
dimensional space. A classification rule is a hypercube that contains one or more of these points.
DECISION TREES
A particularly efficient method for producing classifiers from data is to generate a
decision tree. The decision-tree representation is the most widely used logic method. There is a large
number of decision-tree induction algorithms described primarily in the machine-learning and applied-
statistics literature. They are supervised learning methods that construct decision trees from a set of
input–output samples. It is an efficient nonparametric method for classification and regression. A
decision tree is a hierarchical model for supervised learning where the local region is identified in a
sequence of recursive splits through decision nodes with test function. A decision tree is also a
nonparametric model in the sense that we do not assume any parametric form for the class density.
A typical decision-tree learning system adopts a top-down strategy that searches for a
solution in a part of the search space. It guarantees that a simple, but not necessarily the simplest,
tree will be found. A decision tree consists of nodes where attributes are tested. In a univariate tree,
for each internal node, the test uses only one of the attributes for testing. The outgoing branches of a
node correspond to all the possible outcomes of the test at the node. A simple decision tree for
classification of samples with two input attributes X and Y is given in Figure 6.2.
All samples with feature values X > 1 and Y = B belong to Class2, while the samples with values X < 1
belong to Class1, whatever the value for feature Y. The samples, at a nonleaf node in the tree
structure, are thus partitioned along the branches, and each child node gets its corresponding subset of
samples. Decision trees that use univariate splits have a simple representational form, making it
relatively easy for the user to understand the inferred model; at the same time, they represent a
restriction on the expressiveness of the model. In general, any restriction on a particular tree
representation can significantly restrict the functional form and thus the approximation power of the
model. A wellknown tree-growing algorithm for generating decision trees based on univariate splits is
Quinlan’s ID3 with an extended version called C4.5. Greedy searchmethods, which involve growing and
pruning decision-tree structures, are typically employed in these algorithms to explore the exponential
space of possible models.
The ID3 algorithm starts with all the training samples at the root node of the tree. An attribute is
selected to partition these samples. For each value of the attribute, a branch is created, and the
corresponding subset of samples that have the attribute
value specified by the branch is moved to the newly created child node. The algorithm is applied
recursively to each child node until all samples at a node are of one class.
Every path to the leaf in the decision tree represents a classification rule. Note that the critical
decision in such a top-down decision-tree-generation algorithm is the choice of attribute at a node.
Attribute selection in ID3 and C4.5 algorithms is based on minimizing an information entropy measure
applied to the examples at a node. The approach based on information entropy insists on minimizing
the number of tests that will allow a sample to classify in a database. The attribute selection part of ID3
is based on the assumption that the complexity of the decision tree is strongly related to the amount of
information conveyed by the value of the given attribute. An information- based heuristic selects the
attribute providing the highest information gain,
i.e., the attribute that minimizes the information needed in the resulting subtree to classify the sample.
An extension of ID3 is the C4.5 algorithm, which extends the domain of classification from categorical
attributes to numeric ones. The measure favors attributes that result in partitioning the data into
subsets that have low class entropy, i.e., when the majority of examples in it belong to a single class.
The algorithm basically chooses the attribute that provides the maximum degree of discrimination
between classes locally.
To apply some of the methods, which are based on the inductive-learning approach, several key
requirements have to be satisfied:
1. Attribute-value description—The data to be analyzed must be in a flat-file form—all information
about one object or example must be expressible in terms of a fixed collection of properties or
attributes. Each attribute may have either discrete or numeric values, but the attributes used to
describe samples must not vary from one case to another. This restriction rules out domains in
which samples have an inherently variable structure.
2. Predefined classes—The categories to which samples are to be assigned must have been established
beforehand. In the terminology of machine learning, is supervised learning.
3. Discrete classes—The classes must be sharply delineated: a case either does or does not belong to a
particular class. It is expected that there will be far more samples than classes.
4. Sufficient data—Inductive generalization given in the form of decision tree proceeds by identifying
patterns in data. The approach is valid if enough number of robust patterns can be distinguished from
chance coincidences. As this differentiation usually depends on statistical tests, there must be sufficient
number of samples to allow these tests to be effective. The amount of data required is affected by
factors such as the number of properties and classes
and the complexity of the classification model. As these factors increase, more data will be needed to
construct a reliable model.
5. “Logical” classification models—These methods construct only such classifiers that can be expressed
as decision trees or decision rules. These forms essentially restrict the description of a class to a logical
expression whose primitives are statements about the values of particular attributes. Some
applications require weighted attributes or their arithmetic combinations for a reliable description of
classes. In these situations logical models become very complex, and, in general, they are not
effective.