KEMBAR78
Topic7 Classification | PDF | Statistical Classification | Algorithms
0% found this document useful (0 votes)
4 views104 pages

Topic7 Classification

Uploaded by

oh nambu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views104 pages

Topic7 Classification

Uploaded by

oh nambu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 104

Classification: Basic

Concepts and Methods


Basic concepts

What is classification?

• A bank loan officer

• Analyzing data to learn which loan applications are ‘safe’ /’risky’

• Marketing manager

• Analyzing data to guess whether a customer buy a new computer or not

• Medical researcher

• Analyzing data to predict which one of the three specific treatments a patient should
receive

• These are some of the examples of classification


• A model or classifier is constructed to predict class labels

• ‘safe’ or ‘risky’ for loan application data

• ‘yes’ or ‘no’ for marketing data

• ‘treatment A’, ‘treatment B’, ‘treatment C’ for medical data

• The categories can be represented by discrete values

• Ordering among values has no meaning

• Eg: values 1,2, and 3 to represent treatments A,B, and C

• No ordering implied among the group


• Marketing manager wants to predict
• how much a given customer will spend during a sale?

• Realtor wants to know


• The average house pricing of next year in different residential areas

• A career planner wants to forecast


• The average yearly income of students

• These applications are examples of numeric prediction

• The model predicts

• Continuous-valued function

• Ordered value
• Regression analysis

• Statistical methodology

• Used for numeric prediction

• Ranking is another type of numerical prediction

• The model predicts the ordered values

• Eg:

• A web search engine ranks the relevant web pages with respect to given query

• Higher-ranked webpages are more relevant to the query


• Two major types of prediction problem

• Classification

• numeric prediction
General approach to classification

• Data classification is a two step process

• Learning step

• Classification step
General approach to classification

• Data classification is a two step process

• Learning step

• Classification model is constructed


• Classification step

• The model is used to predict class labels for given data


• Learning step (aka training phase)

• Classification algorithm builds classifier by

• Analyzing/learning from a ‘training set’

• Training set

• Made up of database tuples & their class labels

• Tuple X, is an n-dimensional attribute vector,

• Depicting ‘n’ measurements from ‘n’ attributes


• Each tuple X,

• Is assumed to belong to a predefined class

• Class label attribute

• discrete-value and unordered value

• Categorical(nominal)

• Each value serves as a category /class

• Individual tuples make up a training set (referred as training tuples)

• Randomly sampled from database under analysis


• Data tuples referred as

• Samples

• Examples

• Instances

• Data points

• Objects
• In Supervised learning

• Class label of each training tuple is provided

• Encompasses
• Classification
• Regression
• Ranking

• Contrasts with unsupervised learning (eg: clustering)


• Target value of each training tuple is not known
• Number/set of classes also not known
• Eg:
• If ‘loan_decision’ data is unavailable in the training set
• Clustering determines ‘groups of like tuples’
• Semi-supervised classification

• Build a classifier based on limited number of labelled training tuples &

• Large number of unlabelled training tuples


• Zero-Shot Learning (ZSL)

• If a class label appear after the classification model is built

• During training phase there are no labelled training tuples for such a class label
• Weakly supervised learning

• Semi-supervised

• Zero-shot learning

• Supervision information for training the model is weaker


• The first step of classification process is viewed as the

• learning of mapping or function

• y=f(X)

• Predicting the class label y of a given tuple X

• Mapping or function that separates data classes

• Mapping is represented in the form of


• Classification rules
• Decision trees
• Mathematical formulae
• Classification accuracy
• In the second step
• Model is used for classification
• First,
• Predictive accuracy of the classifier is estimated
• Don’t use training set to measure the accuracy
• Bcoz, the classifier tends to overfit
• During learning it incorporates anomalies that do not represent general
data set
• Use test set
• Made up of test tuples and their associated class labels
• Independent of the training tuples
• Accuracy
• Percentage of test tuples that are correctly classified
• Associated class label is compared with predicted label
• If accuracy is accepted, classifier can be used to predict future data
• Future data also known as ‘unknown’ or ‘previously unseen’ data
Decision Tree Induction

• learning of decision trees from class-labelled training tuples

• Decision tree

• Flow-chart like tree structure

• Internal node (nonleaf node) denotes test on an attribute

• Branch represents the outcome of the test

• Leaf node (terminal node) holds a class label


• Internal nodes are denoted by rectangles

• Leaf nodes are denoted by ovals (or circles)

• Some decision tree algorithms produce only binary trees

• Each internal node branches to exactly two other nodes

• Others produce non-binary trees


• How decision tree used for classification?

• Given tuple X, for which the associated class label is unknown

• The attribute values are tested against the decision tree

• A path is traced from root to a leaf node( holds class prediction for that tuple)

• Decision trees can easily be converted to decision rules

• Eg:

Age Student Credit Rating Class


Youth yes excellent ?
• How decision tree used for classification?
• Construction of decision tree classifier does not require
• Any domain knowledge
• Parameter setting

• Decision tree classifier


• Appropriate for exploratory knowledge discovery
• Handles multidimensional data
• Easy to assimilate by humans
• Simple & fast
• Have good accuracy
• Used in areas such as
• Medicine
• Manufacturing & production
• Financial analysis
• Astronomy
• Molecular biology
Decision Tree Induction

• ID3 Decision tree algorithm


• Developed by J. Ross Quinlan, during late 1970s and early 1980s
• Expansion of the earlier work concept learning systems by
• E.B. Hunt, J. Martin, P.T.Stone
• Quinlan later presented C4.5
• Became benchmark to newer supervised learning algorithms

• Classification and Regression Trees (CART)


• Proposed by group of statisticians
• L. Breiman, J. Friedman, R. Olshen, and C. Stone
• Describes the generation of binary decision trees
• ID3, C4.5 and CART
• Adopt greedy (non-backtracking) approach
• Construct top-down recursive divide-and-conquer strategy
• Starts with a training set of tuples & their associated class labels
• Training set
• Recursively partitioned into smaller subsets as the tree is being built
• Algorithm is called with three parameters

• D
• is a data partition
• Complete set of training tuples and their associated class labels

• Attribute_list
• List of attributes describing the tuples

• Attribute_selection_method
• Heuristic procedure
• Selects the attribute that ‘best’ discriminates the given tuples
• Employs an attribute selection measure
• Information gain/Gini impurity
• Gini impurity
• Enforce the resulting tree to be binary
• Information gain
• Allows multiway splits
• Tree starts as a single node, N, representing the training tuples in D

• If tuples in D are all of same class,


• then node N becomes a leaf
• Leaf is labelled with that class

• Otherwise, the algorithm calls Attribute_selection_method to determine


• ‘splitting criterion’
• Tells which attribute to test at node N
• By determining the ‘best’ way to partition the tuples in D individual classes
• Also tells which branches to grow
• Indicates the splitting attribute
• Also indicates split-point/splitting subset
• The splitting criterion is determined to partition each branch as ‘pure’ as possible
• Pure partition
• If all tuples in it belong to same class
• The node N is labelled with the splitting criterion, which serves as
• Test at the node
• A branch is grown from node N for each of the outcomes of the splitting criterion
• Tuples in D are partitioned accordingly
• Let A be the splitting attribute
• A has v distinct values {a1,a2,…,av} based on training data
• A is discrete-valued
• Outcomes of the test at node N directly correspond to the known values of A
• A branch is created for each known value, aj of A
• Labeled with that value
• Partition Dj is the subset of class-labelled tuples in D having value aj of A
• Since, all tuples in a given partition have same value for A,
• A does not need to be considered in any future partitioning of tuples
• Therefore, it is removed from attribute_list
• A is continuous-valued
• Test at node N has two possible outcomes, corresponding to the conditions
• A<=split_point and A>split_point
• Split_point is the split point returned by attribute_selection_method
• Two branches are grown from N and labelled according to the previous outcomes
• Tuples are partitioned such that D1 holds the subset of class-labelled tuple in D for
which A<=split_point, while D2 holds the rest

• A is discrete-valued and a binary tree


• Test at node N is of the form “A∈ SA?,”,
• where SA is the splitting subset for A, returned by attribute selection method
• Attribute selection measures

• Information Gain
• Gain Ratio
• Gini Impurity
• Other Attribute selection measures

• Information Gain - biased toward multivalued attributes

• Gain Ratio - one partition is smaller than others

• Gini Impurity
• biased toward multivalued attributes
• Faces difficulty when the number of classes are large

• Other attribute selection measure


• CHAID
• C-SEP
• G-Statistic
• How to use RSS (Residual Sum of Squares)
• to find the best split point for continuous class labels
• Tree Pruning

• Addresses overfitting problem

• Use statistical measure to remove the least-reliable branches

• Pruned trees

• Smaller

• Less complex
• Two common pruning approaches

• Prepruning

• Post pruning

• Prepruning

• Tree is pruned by halting its construction early

• Upon halting, node becomes a leaf

• Leaf holds the most frequent class label among the subset of tuples
• Postpruning

• Removes subtrees from a ‘fully grown’ tree

• A subtree at a given node is pruned by removing its branches & replacing it with a leaf

• A leaf is labelled with the most frequent class label among the subtree being replaced

• Replication

Repetition
• Use of multivariate splits can prevent the problems

• Rule –based classifier can be constructed by extracting IF-THEN rules from decision tree
Bayes Classification Methods

• Statistical classifiers

• Predict class membership probabilities

• Probability that a given tuple belongs to a particular class

• Based on Baye’s theorem

• Simple Bayesian classifier is known as naïve Bayesian Classifier

• Exhibits high accuracy and speed when applied to large DBs

• Assumes ‘ class-conditional independence’

• effect of an attribute value on a given class is independent of values of other


attributes
Baye’s Theroem

• Let X be a data tuple

• In Bayesian terms, X is considered as ‘evidence’ defined by ‘n’ attributes

• Let, H be hypothesis such that the data tuple ‘X’ belongs to a specified class C

• For classification problems,

• We want to determine P(H|X)

• Means, probability of tuple X belongs to class C


• P(H|X) is the posterior probability/posteriori probability

• The probability of a hypothesis H being true after considering the given data X

• Eg:
• P(H) is the prior probability, or a priori probability, of H

• This the probability that any given customer will buy a computer

• Regardless of

• Age

• Income

• Any other information

• Posterior probability, P(H|X), is based on more information

• Prior probability, P(H) is independent of X

• Eg: patient has COVID-19


• P(X|H) is the conditional probability of X conditioned on H

• Given that we already know the customer will buy a computer, what is the probability
that they are 35 years old and earn $40,000?

• In classification, P(X|H) is referred to as likelihood

• Difference between P(H|X) and P(X|H)

• P(H|X)

• If a person has fever and cough, what is the probability that they have covid-19?

• P(X|H)

• If a person has covid-19, what is the probability that they will have fever and cough?
• P(X) is the prior probability of X

• Referred to as ‘marginal probability’

• It is the probability that a person from our set of customers is 35 years old and earns
$40000

• Probability that a patient has fever and cough

• P(H), P(X|H) and P(X) may be estimated from the given data

• Baye’s theorem provides a way of calculating the posterior probability


• P(H), P(X|H) and P(X) may be estimated from the given data

• Baye’s theorem provides a way of calculating the posterior probability

• Suppose there are m classes C1,C2,…Cm

• Given a tuple X

• We want to predict which class it belongs to


• Bayes Classifier

• First calculates the posterior probabilities for each of the m classes

• P(Ci|X)(i=1,…m)

• Then predicts that tuple X belongs to the class with the highest posterior
probability

• Eg:

• Given a customer, X of 35 years old and earning $40000

• We want to predict if the customer will buy a computer

• Suppose P(buy computer|X)=0.8 and P(not buy computer|X)=0.2

• Bayes classifier will predict that the customer X will buy a computer
• Bayes Classifier

• Optimal

• Obtains smallest classification error rate compared to all other classifiers

• Probabilistic method

• Could make a wrong prediction for any given tuple

• If P(not buy computer|X)=0.2

• Means, there is 20% chance that the prediction is incorrect

• Predicts class with maximum posterior probability

• Probability of wrong prediction is lowest


• Bayes Classifier

• Plays a foundational role in statistical machine learning community

• Naïve Bayesian classifier

• K-nearest-neighbor

• Logistic regression

• Bayesian network

• Can be viewed as approximated Bayes classifiers

• Many neural network and curve-fitting algorithms output

• Maximum posteriori hypothesis


• To calculate the posterior probabilities

• P(Ci|X)(i=1,2,..,m)

• We need to estimate

• Conditional probabilities P(X|Ci)(i=1,2,..,m)

• Priors P(Ci)(i=1,2,..,m)

• Estimating priors P(Ci)(i=1,2,…,m) is easy

• Estimating conditional probability P(X|Ci)(i=1,2,…,m) is very challenging

• Assume there are n binary attributes A1,A2,A3…,An

• We need to estimate the conditional probability for 2n possible values


• Simple solution is to use naïve Bayesian classifier
• Naïve Bayesian Classification

• Simple Bayesian classifier

• Let

• D be a training set of tuples and their associated class labels

• each tuple is represented by an n-dimensional attribute vector

• X=(x1,x2,x3…,xn), n is the number attributes A1,A2,A3,…,An

• Suppose there are m classes C1,C2,…Cm

• Given a tuple X,

• the classifier will predict that X belongs to the class having the highest
posterior probability, conditioned on X
• Bayesian classifier predicts, tuple X belongs to class Ci, if and only if

• Maximize P(Ci|X)

• The class Ci for which P(Ci|X) is maximized is called maximum posteriori hypothesis

• By Baye’s theorem

• P(X) is constant for all classes,

• we only need to find which class maximizes P(X|Ci)P(Ci)


• If class prior probabilities or not known

• Assume that the classes are equally likely

• P(C1)=P(C2)=….P(Cm)

• maximize P(X|Ci)

• Otherwise maximize P(X|Ci)P(Ci)

• P(Ci)=|Ci,D|/|D|

• Where |Ci,D| is the number of training tuples of class Ci in D


• Computation of P(X|Ci) is made on the assumption

• Class-conditional independence

• Presumes that the attribute’s values are conditionally independent on each other
• For each attribute, we look at whether the attribute is

• Categorical

• Continuous-valued

• To compute P(X|Ci), consider the following

• If Ak is categorical, P(xk|Ci) is the number of tuples of class Ci in D having the value xk for
Ak, divided by |Ci,D|, the number of tuples of class Ci in D

• If Ak is continuous valued, then use


• Example
Therefore the naïve Bayesian classifier predicts buys_computer = yes for tuple X
• Numerical example
• Sometimes P(xk|Ci) may end up with probability 0

• P(student = yes|buys_computer = no) = 0

• Plugging this zero value would return a zero probability for P(X|Ci )

• Even though, without zero probability

• The class may have ended up with a high probability

• 0 probability may cancel the effects of all other probabilities

• Laplacian correction/Laplacian estimator technique may be used


• training database, D, containing 1000 tuples

• 0 tuples with income=low

• 990 tuples with income=medium

• 10 tuples with income=high

• The probabilities are 0, 0.990 and 0.010

• Using the Laplacian correction for the three quantities

• Thus, zero probability is avoided


Model selection using statistical tests of significance
• Suppose we have generated two classification models M1 and M2

• We performed 10-fold cross validation to obtain mean error rate for each

• How to determine which model is best?

• It may seem intuitive to select the model lowest error rate

• But,

• Mean error rates are just estimates of the error

• There can be considerable variance b/w error rates in any given 10-fold CV

• Mean error rates obtained for M1 and M2 may be different


• Difference may not be statistically significant
• To determine if there is any ‘real’ difference in the mean error rates of 2 models
• Employ a test of statistical significance

• Ex:
• The hypothesis is that two models are same (null hypothesis)
Comparing classifiers based on cost-benefit ROC Curves

• Positives (correctly predicting cancerous patient as cancerous)

• True Negatives (correctly predicting a non-cancerous patient as non-cancerous)

• False Positives (incorrectly predicting non-cancerous patient as cancerous)

• False Negatives (incorrectly predicting a cancerous patient as non-cancerous)

• Useful in assessing the costs and benefits associated with a classification model

• Cost associated with a false negative is far greater than false positive

• Can outweigh one type of error over another by assigning a different cost to each

• Can incorporate costs and benefits by computing the average cost per decision
Receiver Operating Characteristic Curves

• Useful tool for comparing two classification models

• Come from signal detection theory

• Shows the trade-off between

• True positive rate (TPR)

• False positive rate (FPR)


• Given a test set and a model

• TPR is the proportion of positive (or “yes”) tuples correctly labelled by the model

• FPR is the proportion of negative (or “no”) tuples that are mislabelled as positive

• TPR=TP/P, which is sensitivity

• FPR=FP/N, which is 1-specificity


Tuple Class Prob. TP FN FP TN TPR FPR
Id
1 P 0.9 1 4 0 5 0.2 0
2 P 0.8 2 3 0 5 0.4 0
3 N 0.7 2 3 1 4 0.4 0.2
4 P 0.6 3 2 1 4 0.6 0.2
5 P 0.55 4 1 1 4 0.8 0.2
6 N 0.54 4 1 2 3 0.8 0.4
7 N 0.53 4 1 3 2 0.8 0.6
8 N 0.51 4 1 4 1 0.8 0.8
9 P 0.50 5 0 4 1 1.0 0.8
10 N 0.40 5 0 5 0 1.0 1.0
• To assess the accuracy of a model, we can measure the area under the curve (AUC)

• Several software packages are able to perform such calculation

• The closer the area is to 0.5, the less accurate the corresponding model is

• A model with perfect accuracy will have an AUC of 1.0.

You might also like