4
Data Mining & Warehousing
                 MODULE 3
         CS 402 Data Mining & Warehousing   February 27, 20201 / 112
Outline:
1 Classification Models
     Introduction to Classification and Prediction
     Issues regarding classification and prediction
     Decision Tree- ID3
       Tree Pruning
     Decision Tree- C4.5
     Naive Bayes Classifier
                              CS 402 Data Mining & Warehousing   February 27, 20202 / 112
                       Classification Models   Introduction to Classification and Prediction
Classification and PredictionI
    Classification and prediction are two forms of data analysis
  ● They can be used to extract models describing important data classes or
    to predict future data trends
  ● Classification and prediction have numerous applications, including
    fraud detection, target marketing, performance prediction,
    manufacturing, and medical diagnosis.
                            CS 402 Data Mining & Warehousing                         February 27, 20203 / 112
                         Classification Models   Introduction to Classification and Prediction
Classification and
PredictionII
Classification
  ● Classification is the process of predicting the class of given data
     points.
    Classes are sometimes called as targets/ labels or categories.
  ● Examples:
       ● A bank loans officer needs analysis of her data in order to learn which
         loan applicants are: “safe” and which are “risky”
       ● A marketing manager at AllElectronics needs data analysis to help guess
         whether a customer with a given profile will buy a new computer: ”Yes”
         or ”No”
       ● A medical researcher wants to analyze breast cancer data in order to
         predict which one of three specific treatments a patient should receive:
         “treatment A,” “treatment B”, or “treatment C
  ● In each of these examples, the data analysis task is classification, where
    a model or classifier is constructed to predict categorical labels
                              CS 402 Data Mining & Warehousing                         February 27, 20204 / 112
                            Classification Models   Introduction to Classification and Prediction
Classification and PredictionIII
Classification (contd...)
  ● Classification process can also be viewed as the learning of a mapping
    or function, y = f ( X ) , that can predict the associated class label y of a
    given tuple X .
  ● Da ta classification is a two-step process
        ●
          Step 1: Building the Classifier or Model
        ● Step 2: Using Classifier for Classification
                                 CS 402 Data Mining & Warehousing                         February 27, 20205 / 112
                             Classification Models   Introduction to Classification and Prediction
Classification and
PredictionIV
Classification (contd...)
        ● A classifier
  1 Building           is built or
             the Classifier     describing
                                   Model        a predetermined set of data classes or
            concepts.
            This step is the learning step or the training phase.
        ●   Builds the classifier by analyzing or “learning from” a training set made
            up of database tuples and their associated class labels.
            Training tuples are used for training.
        ●   A tuple, X , is represented by an n-dimensional attribute vector,
            X =( x 1, x 2, ..., x n)
        ●   X depicting n measurements made on the tuple from n database
            attributes, respectively, A 1 , A 2 , ..., A n
        ●   X is assumed to belong to a predefined class as determined by another
            database attribute called the class label attribute, Y.
        ●   In the context of classification, data tuples can be referred to as
            samples, examples, instances, data points, or objects
                                  CS 402 Data Mining & Warehousing                         February 27, 20206 / 112
                            Classification Models   Introduction to Classification and Prediction
Classification and PredictionV
Classification (contd...)
  1 Building   the Classifier or Model
         ● The learning process of classifier is supervised, where class label of each
           training tuple is provided or known.
           The training data are analyzed by a classification algorithm. Then
         ● the learned model or classifier is represented in the form of
           classification rules.
  2   Classification:
           Test data are used to estimate the accuracy of the classification rules.
         ● If the accuracy is considered acceptable, the rules can be applied to the
           classification of new data tuples.
         ● A test set is used, made up of test tuples and their associated class
           labels.
         ● These tuples are randomly selected from the general data set.
                                 CS 402 Data Mining & Warehousing                         February 27, 20207 / 112
                     Classification Models   Introduction to Classification and Prediction
Classification and
PredictionVI
                                Figure:Training
                          CS 402 Data Mining & Warehousing                         February 27, 20208 / 112
                     Classification Models   Introduction to Classification and Prediction
Classification and
PredictionVII
                            Figure:Classification
                          CS 402 Data Mining & Warehousing                         February 27, 20209 / 112
                         Classification Models   Introduction to Classification and Prediction
Classification and
PredictionVIII
Prediction
  ● Predicts a continuous-valued function, or ordered value, as opposed
    to a categorical label
  ● The goal of prediction is to forecast or deduce the value of an
    attribute based on values of other attributes.
    Focus on numeric prediction
  ● Regression analysis is a statistical methodology that is most often
    used for numeric prediction, hence the two terms are often used
    synonymously
  ● Examples
       ● Marketing manager would like to predict how much money a given
         customer will spend during a sale at AllElectronics
         Predicting age of a person
       ● Predicting whether stock price of a company will increase tomorrow
                             CS 402 Data Mining & Warehousing
                        Classification Models   Introduction to Classification and Prediction
Classification and
PredictionIX
Prediction (contd...)
  ● Prediction can also be viewed as a mapping or function, y =f ( X ,
    where X is the input tuple. and the output y is a continuous or
    ordered value
  ● Data prediction is a two-step process, similar to that of data
    classification
    Training: Training set used to train the predictor model.
  ● Prediction: Predicts a numerical, continuous-values (ordered)
                             CS 402 Data Mining & Warehousing
                          Classification Models   Issues regarding classification and prediction
Issues regarding classification and
predictionI
  ● Describes issues regarding preprocessing the data for classification
    and prediction.
  ● Data cleaning: for Noisy and Missing Values
       ● Preprocessing of data in order to remove or reduce noise(smoothing )
         and the treatment of missing values(fill with probable values)
       ● Most classification algorithms have some mechanisms for handling noisy
         or missing data, this step can help reduce confusion during learning.
  ● Relevance analysis: detecting redundant and irrelevant attribute
    which do not contribute to the classification or prediction task
         Many of the attributes in the data may be redundant.
       ● Correlation analysis can be used to identify whether any two given
         attributes are statistically related.
       ● Ex: Strong correlation between A and  1 A would
                                                      2   suggest that one of
         the two could be removed from further analysis.
       ● A database may also contain irrelevant attributes
                              CS 402 Data Mining & Warehousing
                           Classification Models   Issues regarding classification and prediction
Issues regarding classification and
predictionII
      ● Attribute subset selection can be used in these cases to find a reduced                     set
           of attributes
   ● Data transformation and reduction: prevent attributes with initially
     large ranges (like, say, income) from outweighing attributes and
     reduce the data size
        ● Normalization involves scaling all values for a given attribute so that
          they fall within a small specified range, such as −1.0 to 1.0, or 0.0 to
          1.0
        ● Generalizing it to higher-level concepts: use concept hierarchies, useful
          for continuous-valued attributes.
        ● Ex: income can be generalized to discrete ranges, such as low,
          medium, and high.
        ● Generalization compresses the original training data, fewer input/
          output operations may be involved during learning
        ● Data reduction: wavelet transformation and principle components
          analysis, discretization techniques, such as binning, histogram analysis,
          and clustering
                               CS 402 Data Mining & Warehousing
                           Classification Models   Issues regarding classification and prediction
Issues regarding classification and
 Comparing Classification and Prediction Methods
predictionIII
  ● Accuracy:
       ● Classifier: ability to correctly predict the class label of new or previously
         unseen data
       ● Predictor: refers to how well a given predictor can guess the value of
         the predicted attribute for new or previously unseen data
  ● Speed:Computational costs involved in generating and using the
    given classifier or predictor
  ● Robustness:ability of the classifier or predictor to make correct
    predictions given noisy data or data with missing values
                               CS 402 Data Mining & Warehousing
                          Classification Models   Issues regarding classification and prediction
Issues regarding classification and
 Comparing Classification and Prediction Methods
predictionIV
  ● Robustness:ability of the classifier or predictor to make correct
    predictions given noisy data or data with missing values
  ● Scalability:ability to construct the classifier or predictor efficiently
    given large amounts of data.
  ● Interpretability:
       ● level of understanding and insight that is provided by the classifier or
         predictor.
       ● Interpretability is subjective and therefore more difficult to assess.
                              CS 402 Data Mining & Warehousing
                         Classification Models   Issues regarding classification and prediction
Issues regarding classification and
predictionV
Classification Methods
    Decision Tree Induction - ID3, C4.5 - M3
  ● Naive Bayes Classifier - M3
    Rule based classification- 1R - M4
  ● Classification using Neural Networks-Back
    propagation  - M4
    Support Vector  Machine(SVM Classifier) - M4
  ● Lazy Learners-K Nearest Neighbor Classifier -M4
                             CS 402 Data Mining & Warehousing
                        Classification Models   Decision Tree- ID3
Descision Tree
InductionI
Example
    Consider the following situation: Somebody is hunting for a job.
  ● At the very beginning, he decides that he will consider only those jobs
    for which the monthly salary is at least Rs.50,000.
  ● Our job hunter does not like spending much time traveling to place of
    work. He is comfortable only if the commuting time is less than one
    hour.
  ● Also, he expects the company to arrange for a free coffee every
    morning!
  ● The decisions to be made before deciding to accept or reject a job
    offer can be schematically represented as in Figure below.
  ● This figure represents a decision tree
                             CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionII
                      CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- ID3
Descision Tree InductionIII
  ● The term “tree” refers to the concept of a tree in graph theory in
    mathematics
  ● A decision tree is a graph-theoretical tree, all terminology related to
    graph-theoretical trees can be applied to describe decision trees also.
    Nodes or vertices shown as ellipses are called the leaf nodes.
  ● All other nodes, except the root node, are called the internal nodes
                              CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- ID3
Descision Tree
InductionIV
Two
  ● types of decision
    Classification      trees
                    trees: Tree models where the target variable can take a
    discrete set of values are called classification trees. In these tree
    structures, leaves represent class labels and branches represent
    conjunctions of features that lead to those class labels.
    Regression trees: Decision trees where the target variable can take
  ● continuous values (real numbers) like the price of a house, or a
    patient’s length of stay in a hospital, are called regression trees.
                              CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- ID3
Descision Tree
InductionV
Classification
  ● Consider trees:  an given
               the data Example
                              in Table below which specify the features of
     certain vertebrates and the class to which they belong.
  ● For each species, four features have been identified: “gives birth”,
     ”aquatic animal”, “aerial animal” and “has legs”.
  ● There are five class labels, namely, “amphibian”, “bird”, “fish”,
     “mammal” and “reptile”.
  ● The problem is how to use this data to identify the class of a newly
     discovered vertebrate.
                              CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionVI
                      CS 402 Data Mining & Warehousing
                          Classification Models   Decision Tree- ID3
Descision Tree
InductionVII
Construction of the tree: Step 1
  ● Split the dataset into disjoint subsets according to the values of the
    feature “gives birth”
  ● There are only two possible values for this feature, we have only two
    subsets
  ● Subset 1: Consists of those examples for which the value of “gives
    birth” is “yes”
    Subset 2: for which the value is “no”
  ● Tree in figure represents the classification tree at after step 1
                               CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionVIII
                      CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionIX
                      CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionX
                      CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- ID3
Descision Tree
InductionXI
Construction
  ● We split of the examples
             these  tree: Step based
                               2     on the values of the feature “aquatic
    animal”
    There are three possible values for this feature
  ● Two appear in Table 8.2, Accordingly, we need consider only two
    subsets : Table 8.4, 8.5
  ● Table 8.4: contains only one example and hence no further splitting is
    required. It leads to the assignment of the class label “fish”.
  ● Table 8.5: need to be split into subsets based on the values of “aerial
    animal”. Subset 1: value as ”yes”, Subset 2: values as ”no”
  ● It can be seen that these subsets immediately lead to unambiguous
    assignment of class labels: The value of “no” leads to “mammal” and
    the value “yes” leads to ”bird”.
                              CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionXII
                      CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionXIII
                      CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- ID3
Descision Tree
InductionXIV
Construction ofSplit
    Table 8.3:  the tree:
                     basedStep
                            on 3the values of the feature “aquatic animal”;
  ● Split into three subsets; Table 8.6 - ”yes”, Table 8.7 - ”No”, Table
    8.8 - ”semi”
     Now Split resulting subsets based on the values of “has legs”, etc.
  ● Finally we get the classification tree
                              CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionXV
                      CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionXVI
                      CS 402 Data Mining & Warehousing
                          Classification Models   Decision Tree- ID3
Descision Tree
InductionXVII
Classification tree in rule format
                               CS 402 Data Mining & Warehousing
                       Classification Models   Decision Tree- ID3
Descision Tree
InductionXVIII
Some Remarks on Classification Tree
                            CS 402 Data Mining & Warehousing
                           Classification Models   Decision Tree- ID3
Descision Tree
InductionXIX
  ● Elements of a classification tree
        ● Nodes represents features in the data-set, Branches in the tree are
          identified by the values of features, The leaf nodes identified by are the
          class labels.
      Order in which the features are selected
  ●
        ● The tree depends on the order in which the features are selected, There is
          no theoretical justification for this choice.
      Stopping criteria
  ●
        ● There will be a large number of features each feature having several
          possible values results complex classification trees
        ● Criteria 1 : All (or nearly all) of the examples at the node have the
          same class.
        ● Criteria 2 : There are no remaining features to distinguish among the
          examples.
        ● Criteria 3 :The tree has grown to a predefined size limit.
                                CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- ID3
Descision Tree InductionXX
Feature selection measures
? Which which attribute is to be to placed at the root or as internal nodes If
     a dataset consists of n attributes: it is complicated problem
     Random selection may give us bad results with low accuracy
  ● Solution : Use feature selection measures - assign numerical values to
    the various features such that the values reflect the relative importance
    of the various features
  ● Two of the popular feature selection measures are information gain
    and Gini index.
                              CS 402 Data Mining & Warehousing
                        Classification Models   Decision Tree- ID3
Descision Tree
InductionXXI
Entropy
  ● The degree to which a subset of examples contains only a single class is
     known as purity
    Any subset composed of only a single class is called a pure class
    Entropy is a measure of “impurity ” in a dataset.
    If there are only two possible classes, entropy values can range
  ● from 0
    to
    For1n classes, entropy ranges from 0 to log (n).            2
    Minimum entropy value - sample is completely homogeneous
  ● Maximum value - data are as diverse as possible
                             CS 402 Data Mining & Warehousing
                        Classification Models   Decision Tree- ID3
Descision Tree InductionXXII
Entropy
    Consider a segment S of a dataset having c number of class labels
    Let p ibe the proportion of examples in S having the ith class label.
  ● The entropy of S is defined as:
    Special case: If S has only two class labels, say, “yes” and “no”. If p
  ● is the proportion label “yes” then the proportion of label “no” will be
    1 − p. Here entropy of Sis given by:
                             CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionXXIII
                      CS 402 Data Mining & Warehousing
                       Classification Models   Decision Tree- ID3
Descision Tree
InductionXXIV
Entropy of Table 8.1
                            CS 402 Data Mining & Warehousing
                       Classification Models   Decision Tree- ID3
Descision Tree
InductionXXV
Entropy of Table 8.1
                            CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionXXVI
                      CS 402 Data Mining & Warehousing
                 Classification Models
                 Decision Tree- ID3
Descision Tree InductionXXVII
                      CS 402 Data Mining & Warehousing
                        Classification Models   Decision Tree- ID3
Descision Tree
 Entropy of Table 8.2
InductionXXVIII
                            CS 402 Data Mining & Warehousing
                       Classification Models   Decision Tree- ID3
Descision Tree
InductionXXIX
Entropy of Table 8.3
                            CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- ID3
Descision Tree InductionXXX
Information gain
    Let S be a set of examples, A be a feature (or, an attribute)
    S vbe the subset of S with A = v.
  ● V alues(A) be the set of all possible values of A.
  ● Then the information gain of an attribute A relative to the set S,
    denoted by Gain (S, A):
  ● where |S| denotes the number of elements in S.
                              CS 402 Data Mining & Warehousing
                        Classification Models   Decision Tree- ID3
Descision Tree
InductionXXXI
    Consider data in table 8.1
  ● ǁSǁ =10, Entropy(S) =2.2464
                             CS 402 Data Mining & Warehousing
                 Classification Models
                 Decision Tree- ID3
Descision Tree InductionXXXII
                      CS 402 Data Mining & Warehousing
                 Classification Models
                 Decision Tree- ID3
Descision Tree InductionXXXIII
                      CS 402 Data Mining & Warehousing
                 Classification Models
                 Decision Tree- ID3
Descision Tree InductionXXXIV
                      CS 402 Data Mining & Warehousing
                Classification Models
                Decision Tree- ID3
Descision Tree InductionXXXV
                     CS 402 Data Mining & Warehousing
                     Classification Models   Decision Tree- ID3
Descision Tree
InductionXXXVI
 ● Task: Find Computations of Gain(S, aerialanimal) and
   Gain(S, haslegs)
                         CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- ID3
Descision Tree
InductionXXXVII
 Problem:Use ID3 algorithm to construct a decision tree for the data in
Table below
                             CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- ID3
Descision Tree
InductionXXXVIII
 Step 1
   Create a root node for the tree
 ● Total no of features = 4
   Decide which feature is to be
 ● placed at the root node.
   Calculate the information gains corresponding to each of the four
   features.
        Calculation of Entropy (S)
      ● Calculation of Gain (S, outlook), Gain (S, temperature), Gain (S,
        humidity) and Gain (S, wind)
                             CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- ID3
Descision Tree
InductionXXXIX
Calculation of Gain (S, outlook), Gain (S, temperature)
                             CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionXL
                      CS 402 Data Mining & Warehousing
                        Classification Models   Decision Tree- ID3
Descision Tree
InductionXLI
Calculation of Gain (S, temperature)
                             CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionXLII
                      CS 402 Data Mining & Warehousing
                        Classification Models   Decision Tree- ID3
Descision Tree
InductionXLIII
Calculation of Gain (S, humidity) and Gain (S, wind)
Step 2
                             CS 402 Data Mining & Warehousing
                        Classification Models   Decision Tree- ID3
Descision Tree InductionXLIV
  ● Find the highest information gain which is the maximum among
    Gain(S, outlook), Gain(S, temperature), Gain(S, humidity) and
    Gain(S, wind).
  ● Therefore, we place “outlook” at the root node.
                             CS 402 Data Mining & Warehousing
                           Classification Models   Decision Tree- ID3
Descision Tree InductionXLV
Step 3
  ● Follow the steps for Node 1 (S (1) S outlook=sunny ), |S ( 1 ) | =5
  ● Find Entropy(S (1) ), Gain(S (1) , temperature), Gain(S             (1)
                                                                              , humidity),
    Gain(S (1) , wind)
                                CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionXLVI
                      CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionXLVII
                      CS 402 Data Mining & Warehousing
                  Classification Models
                  Decision Tree- ID3
Descision Tree InductionXLVIII
                       CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- ID3
Descision Tree
InductionXLIX
Step 4
  ● Find maximum of Gain(S (1) , temperature), Gain(S (1) , humidity),
           (
    Gain(S 1) , wind) = humidity
  ● Place humidity at node 1, split it into branches acording to the values
    of humidity.
                              CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionL
                      CS 402 Data Mining & Warehousing
                       Classification Models   Decision Tree- ID3
Descision Tree InductionLI
Step 5
    Node 4 have the same class label “no” (Play Tennis= no)
    Node 5 have the same class label “yes” (Play Tennis= Yes)
  ● So we represent Node 4 as a leaf node with value “no” and Node 5 as a
    leaf node with value “yes”.
  ● Node 2 have the same class label “yes”. So we convert Node 2 as a
    leaf node with value “ yes.
  ● For Node 3: S ( 3 ) =S outlook=rain , Maximum gain = Gain(S (3) ,
    humidity)
  ● The branches resulting from splitting this node 3 corresponding to the
    values “high” and “normal” of “humidity” lead to leaf nodes with class
    labels “no” and ”yes”
                            CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionLII
                      CS 402 Data Mining & Warehousing
                        Classification Models   Decision Tree- ID3
Descision Tree InductionLIII
The Algorithm
  ● The algorithm uses information gain to select the most useful
    attribute for classification.
  ● Assume that there are only two class labels, namely, ” + ” and ”
    − ”.
                             CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionLIV
                      CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionLV
                      CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionLVI
                      CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionLVII
                      CS 402 Data Mining & Warehousing
                           Classification Models   Decision Tree- ID3
Descision Tree
InductionLVIII
Advantages of ID3 prediction rules are created from the training data.
    Understandable
  ● Builds the fastest tree.
    Builds a short tree.
  ● Only need to test
  ● enough attributes until
    all data is classified.
    Finding leaf
Disadvantages of nodes
                 ID3 enables test data to be pruned, reducing number of
    tests.
    Data may be over-fitted or overclassified, if a small sample is tested.
  ● Only one attribute at a time is tested for making a decision.
                                CS 402 Data Mining & Warehousing
                          Classification Models   Decision Tree- ID3
Descision Tree InductionLIX
  ● Classifying continuous data may be computationally expensive, as
    many trees must be generated to see where to break the continuum
Tree Pruning
  ● When a decision tree is built, many of the branches will reflect anomalies in
    the training data due to noise or outliers.
    This will lead to the problem of overfitting
  ● Pruning is a technique that reduces the size of decision trees by removing
    sections of the tree that provide little power to classify instances.
  ● Tree pruning methods helps to remove the least reliable branches in the
    decision tree.
                               CS 402 Data Mining & Warehousing
                           Classification Models   Decision Tree- ID3
Descision Tree
InductionLX
Tree Pruning
  ● Pruned trees tend to be smaller and less complex and, thus, easier to
    comprehend.
  ● Pruned trees are usually faster and better at correctly classifying independent
    test data (i.e., of previously unseen tuples) than unpruned trees. There are two
  ● common approaches to tree pruning: prepruning and postpruning.
                                CS 402 Data Mining & Warehousing
                 Classification Models   Decision Tree- ID3
Descision Tree
InductionLXI
                      CS 402 Data Mining & Warehousing
                            Classification Models   Decision Tree- ID3
Descision Tree
InductionLXII
Prepruning
     A tree is “pruned” by halting its construction early
  ● Apply pruning earlier, that is, before it reaches the point where it perfectly
    classifies the training data.
  ● This is done by deciding not to further split or partition the subset of
    training tuples at a given node
  ● Upon halting, the node becomes a leaf.
                                 CS 402 Data Mining & Warehousing
                           Classification Models   Decision Tree- ID3
Descision Tree
InductionLXIII
Prepruning
  ● The leaf may hold the most frequent class among the subset tuples or the
    probability distribution of those tuples.
  ● Usually the goodness of a split is assessed using measures such as statistical
    significance, information gain, Gini index etc.
  ● Here, The partitioning of the given subset is halted when the split falls
    below a prespecified threshold
                                CS 402 Data Mining & Warehousing
                          Classification Models   Decision Tree- ID3
Descision Tree
InductionLXIV
Postpruning
    Allow the tree to overfit the data, and then post-prune the tree. It
    removes subtrees from a “fully grown” tree.
  ● A subtree at a given node is pruned by removing its branches and replacing it
    with a leaf.
  ● The leaf is labeled with the most frequent class among the subtree being
    replaced
                               CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- C4.5
Decision Tree- C4.5I
     C4.5 uses an extension to information gain known as gain ratio
  ● It applies a kind of normalization to information gain using a “split
    information” value defined analogously with Info(D) as:
  ● The gain ratio is defined as
Example:Computation of gain ratio for the attribute income
  ● A test on income splits the data of into three partitions, namely low,
    medium, and high, containing four, six, and four tuples, respectively
  ● To compute the gain ratio of income:
                              CS 402 Data Mining & Warehousing
                       Classification Models   Decision Tree- C4.5
Decision Tree-
C4.5II
    We have Gain(income) = 0.029
  ● Therefore, GainRatio(income) = 0.029/0.926 = 0.031
                            CS 402 Data Mining & Warehousing
                         Classification Models   Decision Tree- C4.5
Decision Tree-
C4.5III
The C4.5 algorithm improves ID3 in the following ways:
  1Handling both continuous and discrete attributes - In order to handle
    continuous attributes, C4.5 creates a threshold and then splits the list
    into those whose attribute value is above the threshold and those that
    are less than or equal to it.
  2Handling training data with missing attribute values - C4.5 allows
    attribute values to be marked as ? for missing. Missing attribute
    values are simply not used in gain and entropy calculations.
3Handling attributes with differing costs.
  4Pruning trees after creation - C4.5 goes back through the tree once it’s
    been created and attempts to remove branches that do not help by
    replacing them with leaf nodes.
                             CS 402 Data Mining & Warehousing
                             Classification Models   Decision Tree- C4.5
Decision Tree-
C4.5IV
ID3 v/s C4.5
                                            ID3                                          C4.5
 Splitting Criteria         Information Gain                               Gain Ratio
 Attribute type supported   Handles on Categorical Attributes              Both Categorical and Numerical
 Missing Values             Do not handle missing values                   Handles missing values
 Pruning                    No pruning is Done                             Error based pruning is used
 Outlier Detection          Susceptible to outliers                        Susceptible to Outliers
                                  CS 402 Data Mining & Warehousing
                        Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierI
  ● Predict class membership probabilities, such as the probability that a
    given tuple belongs to a particular class.
  ● Assumption 1: All the features are independent and are unrelated to
    each other.
  ● ie., Presence or absence of a feature does not influence the presence or
    absence of any other feature.
  ● Assumption 2: The data has class-conditional independence: ie., the
    effect of an attribute value on a given class is independent of the
    values of the other attributes
  ● It is made to simplify the computations involved true in many real
    world problems, in this sense, is considered “naive.”
                             CS 402 Data Mining & Warehousing
                         Classification Models   Naive Bayes Classifier
Naive Bayes ClassifierII
Basic Idea
  ● Consider a training data set consisting of N examples having n
    features.
    Let the features be named as (F1 , ..., Fn ).
  ● A feature vector is of the form (f 1 , f 2 , ..., f n )
    Let the set of class labels be {c , 1c ,2..., c }
  ● Let X be the test instance havingp the feature vector
    X =(x 1 , x 2 , ..., x n ).
                              CS 402 Data Mining & Warehousing
                        Classification Models   Naive Bayes Classifier
Naive Bayes ClassifierIII
Basic Idea
  ● To find the most appropriate class label that should be assigned to the
    test instance X . Compute the following conditional probabilities, and
    choose the maximum among them P (c |1X ) , P (c2 | X ) , ..., P (c | X )
                                                          p
  ● Let the maximum probability be P (c |i X ) . Then, we     choose c as
                                                                        i
    the most appropriate class label for feature vector X.
  ● The direct computation of the probabilities are difficult for a number of
    reasons. The Bayes’ theorem is used
                             CS 402 Data Mining & Warehousing
                        Classification Models   Naive Bayes Classifier
Naive Bayes ClassifierIV
Bayes’ theorem: Computation of probabilities
  ● Using Bayes’ theorem, we have:
  ● Since , the data has class-conditional independence, we note that the
    events ”x 1 | c k ”, ”x 2 | c k ”, ..., ”x n | ck ” are independent
                             CS 402 Data Mining & Warehousing
                          Classification Models   Naive Bayes Classifier
Naive Bayes ClassifierV
Bayes’ theorem: Computation of probabilities
  ● Since the denominator P ( X ) is independent of the class labels, we
    have:
  ● It is enough to find the maximum among the following values:
    P (ck| X ) =P ( x 1| c )k P ( x2 | c k) ...P ( xn | c )k P (ck), where k =
    1, ...,
    p
Algorithm: Naive Bayes
                               CS 402 Data Mining & Warehousing
                      Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierVI
           Figure:Step 1, 2 : Training, Other Steps: Testing
                           CS 402 Data Mining & Warehousing
                         Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierVII
Example 1: Naive Bayes
                              CS 402 Data Mining & Warehousing
                            Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierVIII
Figure:Use naive Bayes algorithm to classify a particular species if its features are
(Slow, Rarely, No)
                                 CS 402 Data Mining & Warehousing
                       Classification Models   Naive Bayes Classifier
Naive Bayes ClassifierIX
    The features are F1 =”Swim”, F2 =”Fly”, F3 =”Crawl”.
  ● The class labels are c1 =”Animal”, c2 =”Bird”, c3 =”Fish”
  ● The test instance x 1 = ”Slow”, x 2 = ”Rarely”, x 3 = ”No”
  ● Construct frequency table:
                            CS 402 Data Mining & Warehousing
                       Classification Models   Naive Bayes Classifier
Naive Bayes ClassifierX
  ● Step 1: Compute following probabilities
                            CS 402 Data Mining & Warehousing
                         Classification Models   Naive Bayes Classifier
Naive Bayes ClassifierXI
  ● Step 2: Construct the table of conditional probabilities:
  ● Conditional probabilities are calculated as follows
                              CS 402 Data Mining & Warehousing
                      Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierXII
  ● Step 3: Compute products qk
                           CS 402 Data Mining & Warehousing
                 Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierXIII
                      CS 402 Data Mining & Warehousing
                Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierXIV
                     CS 402 Data Mining & Warehousing
                       Classification Models   Naive Bayes Classifier
Naive Bayes ClassifierXV
Example 2: Naive Bayes Classfier
                            CS 402 Data Mining & Warehousing
                Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierXVI
                     CS 402 Data Mining & Warehousing            February 27, 2020100 / 112
                         Classification Models    Naive Bayes Classifier
Naive Bayes
ClassifierXVII
    Features: age, income, student, and credit rating
   ● Class labels: c1 =yes, c2 = no
    ●
     Test instance:
   ● Step 1: Compute following probabilities
                              CS 402 Data Mining & Warehousing             February 27, 2020101 / 112
                        Classification Models   Naive Bayes Classifier
Naive Bayes ClassifierXVIII
  ● Step 2: Calculate Conditional probabilities
                             CS 402 Data Mining & Warehousing            February 27, 2020102 / 112
                       Classification Models   Naive Bayes Classifier
Naive Bayes ClassifierXIX
  ● Step 3, 4, 5: Compute products P ( X | buyscomputer = yes), P ( X |
    buyscomputer = no), q1, q2, maxq1, q2, predictclass
                            CS 402 Data Mining & Warehousing            February 27, 2020103 / 112
               Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierXX
Question 3:
                    CS 402 Data Mining & Warehousing            February 27, 2020104 / 112
                        Classification Models   Naive Bayes Classifier
Naive Bayes
For Numerical or continuous valued attributes
ClassifierXXI
  ● Find mean an variance of each numerical attribute for each class label
                             CS 402 Data Mining & Warehousing            February 27, 2020105 / 112
                           Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierXXII
For Numerical or continuous valued attributes
 1.Calculate class probabilities: P (male), P (f e mal e)
2.For given sample, Calculate Conditional probabilities:                       P (x k | c i ) as
     follows
     :
 3. Where µ and σ are the the mean and standard deviation of attribute
     A k , respectively.
 4. Find P ( X | c 1 ), P ( X | c 2 ), q1, q2, max{q1, q2} to predict the class
                                CS 402 Data Mining & Warehousing            February 27, 2020106 / 112
                  Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierXXIII
Q3 Answer
                       CS 402 Data Mining & Warehousing            February 27, 2020107 / 112
                        Classification Models    Naive Bayes Classifier
Naive Bayes ClassifierXXIV
Question 4: Use naive Bayes algorithm to determine whether a red domestic
SUV car is a stolen car or not using the following data (Ans: No)
                             CS 402 Data Mining & Warehousing             February 27, 2020108 / 112
                        Classification Models   Naive Bayes Classifier
Naive Bayes ClassifierXXV
Zero frequency problem in Naive Bayes Classifier
  ● We calculate P ( X | C i ), as the product of the probabilities
    P ( x1 | C i) , P ( x2 | C i), ..., P ( xn | C i)
  ● Suppose if count of any x k is zero; ie., there are no training tuples
    representing x k for the class C i
    Which results in P ( x k| C )i =0
  ● Since it appears in the product, the entire term becomes zero ie.,
    P ( X | Ci ) = 0
  ● ie., A zero probability cancels the effects of all of the other
    probabilities (on C i ) involved in the product.
  ● But, even though, without the zero probability, we may have ended up
    with a high probability, suggesting that X belonged to class C i
                             CS 402 Data Mining & Warehousing            February 27, 2020109 / 112
                        Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierXXVI
  ● Solution: For a large training database, D, adding one to each count
    that we need would only make a negli- gible difference in the
    estimated probability value
    Yet would conveniently avoid the case of probability values of zero
  ● This technique for probability estimation is known as the Laplacian
    correction or Laplace estimator
Example:
  ● Suppose that for the class buys computer = yes in some training
    database, D, containing 1,000 tuples
  ● We have 0 tuples with income = low, 990 tuples with income =
    medium, and 10 tuples with income = high
  ● The probabilities of these events are 0, 0.990 (= 999/1000), and
    0.010 (=10/1000)
                             CS 402 Data Mining & Warehousing            February 27, 2020110 / 112
                        Classification Models   Naive Bayes Classifier
Naive Bayes
ClassifierXXVII
  ● Using the Laplacian correction for the three quantities, we pretend
    that we have 1 more tuple for each income-value pair.
  ● ie;, count(low)=1, count(medium)=991, count(high)=11, Therefore,
    Total tuples = 1003.
  ● Now we can calculate the new probabilities as follows:
  ● The “corrected ” probability estimates are close to their “uncorrected
    ” counterparts, yet the zero probability value is avoided
                             CS 402 Data Mining & Warehousing            February 27, 2020111 / 112
Classification Models   Naive Bayes Classifier
     CS 402 Data Mining & Warehousing            February 27, 2020112 / 112