Classification: Basic
Concepts and Methods
Basic concepts
What is classification?
• A bank loan officer
• Analyzing data to learn which loan applications are ‘safe’ /’risky’
• Marketing manager
• Analyzing data to guess whether a customer buy a new computer or not
• Medical researcher
• Analyzing data to predict which one of the three specific treatments a patient should
receive
• These are some of the examples of classification
• A model or classifier is constructed to predict class labels
• ‘safe’ or ‘risky’ for loan application data
• ‘yes’ or ‘no’ for marketing data
• ‘treatment A’, ‘treatment B’, ‘treatment C’ for medical data
• The categories can be represented by discrete values
• Ordering among values has no meaning
• Eg: values 1,2, and 3 to represent treatments A,B, and C
• No ordering implied among the group
• Marketing manager wants to predict
• how much a given customer will spend during a sale?
• Realtor wants to know
• The average house pricing of next year in different residential areas
• A career planner wants to forecast
• The average yearly income of students
• These applications are examples of numeric prediction
• The model predicts
• Continuous-valued function
• Ordered value
• Regression analysis
• Statistical methodology
• Used for numeric prediction
• Ranking is another type of numerical prediction
• The model predicts the ordered values
• Eg:
• A web search engine ranks the relevant web pages with respect to given query
• Higher-ranked webpages are more relevant to the query
• Two major types of prediction problem
• Classification
• numeric prediction
General approach to classification
• Data classification is a two step process
• Learning step
• Classification step
General approach to classification
• Data classification is a two step process
• Learning step
• Classification model is constructed
• Classification step
• The model is used to predict class labels for given data
• Learning step (aka training phase)
• Classification algorithm builds classifier by
• Analyzing/learning from a ‘training set’
• Training set
• Made up of database tuples & their class labels
• Tuple X, is an n-dimensional attribute vector,
• Depicting ‘n’ measurements from ‘n’ attributes
• Each tuple X,
• Is assumed to belong to a predefined class
• Class label attribute
• discrete-value and unordered value
• Categorical(nominal)
• Each value serves as a category /class
• Individual tuples make up a training set (referred as training tuples)
• Randomly sampled from database under analysis
• Data tuples referred as
• Samples
• Examples
• Instances
• Data points
• Objects
• In Supervised learning
• Class label of each training tuple is provided
• Encompasses
• Classification
• Regression
• Ranking
• Contrasts with unsupervised learning (eg: clustering)
• Target value of each training tuple is not known
• Number/set of classes also not known
• Eg:
• If ‘loan_decision’ data is unavailable in the training set
• Clustering determines ‘groups of like tuples’
• Semi-supervised classification
• Build a classifier based on limited number of labelled training tuples &
• Large number of unlabelled training tuples
• Zero-Shot Learning (ZSL)
• If a class label appear after the classification model is built
• During training phase there are no labelled training tuples for such a class label
• Weakly supervised learning
• Semi-supervised
• Zero-shot learning
• Supervision information for training the model is weaker
•
• The first step of classification process is viewed as the
• learning of mapping or function
• y=f(X)
• Predicting the class label y of a given tuple X
• Mapping or function that separates data classes
• Mapping is represented in the form of
• Classification rules
• Decision trees
• Mathematical formulae
• Classification accuracy
• In the second step
• Model is used for classification
• First,
• Predictive accuracy of the classifier is estimated
• Don’t use training set to measure the accuracy
• Bcoz, the classifier tends to overfit
• During learning it incorporates anomalies that do not represent general
data set
• Use test set
• Made up of test tuples and their associated class labels
• Independent of the training tuples
• Accuracy
• Percentage of test tuples that are correctly classified
• Associated class label is compared with predicted label
• If accuracy is accepted, classifier can be used to predict future data
• Future data also known as ‘unknown’ or ‘previously unseen’ data
Decision Tree Induction
• learning of decision trees from class-labelled training tuples
• Decision tree
• Flow-chart like tree structure
• Internal node (nonleaf node) denotes test on an attribute
• Branch represents the outcome of the test
• Leaf node (terminal node) holds a class label
• Internal nodes are denoted by rectangles
• Leaf nodes are denoted by ovals (or circles)
• Some decision tree algorithms produce only binary trees
• Each internal node branches to exactly two other nodes
• Others produce non-binary trees
• How decision tree used for classification?
• Given tuple X, for which the associated class label is unknown
• The attribute values are tested against the decision tree
• A path is traced from root to a leaf node( holds class prediction for that tuple)
• Decision trees can easily be converted to decision rules
• Eg:
Age Student Credit Rating Class
Youth yes excellent ?
• How decision tree used for classification?
• Construction of decision tree classifier does not require
• Any domain knowledge
• Parameter setting
• Decision tree classifier
• Appropriate for exploratory knowledge discovery
• Handles multidimensional data
• Easy to assimilate by humans
• Simple & fast
• Have good accuracy
• Used in areas such as
• Medicine
• Manufacturing & production
• Financial analysis
• Astronomy
• Molecular biology
Decision Tree Induction
• ID3 Decision tree algorithm
• Developed by J. Ross Quinlan, during late 1970s and early 1980s
• Expansion of the earlier work concept learning systems by
• E.B. Hunt, J. Martin, P.T.Stone
• Quinlan later presented C4.5
• Became benchmark to newer supervised learning algorithms
• Classification and Regression Trees (CART)
• Proposed by group of statisticians
• L. Breiman, J. Friedman, R. Olshen, and C. Stone
• Describes the generation of binary decision trees
• ID3, C4.5 and CART
• Adopt greedy (non-backtracking) approach
• Construct top-down recursive divide-and-conquer strategy
• Starts with a training set of tuples & their associated class labels
• Training set
• Recursively partitioned into smaller subsets as the tree is being built
• Algorithm is called with three parameters
• D
• is a data partition
• Complete set of training tuples and their associated class labels
• Attribute_list
• List of attributes describing the tuples
• Attribute_selection_method
• Heuristic procedure
• Selects the attribute that ‘best’ discriminates the given tuples
• Employs an attribute selection measure
• Information gain/Gini impurity
• Gini impurity
• Enforce the resulting tree to be binary
• Information gain
• Allows multiway splits
• Tree starts as a single node, N, representing the training tuples in D
• If tuples in D are all of same class,
• then node N becomes a leaf
• Leaf is labelled with that class
• Otherwise, the algorithm calls Attribute_selection_method to determine
• ‘splitting criterion’
• Tells which attribute to test at node N
• By determining the ‘best’ way to partition the tuples in D individual classes
• Also tells which branches to grow
• Indicates the splitting attribute
• Also indicates split-point/splitting subset
• The splitting criterion is determined to partition each branch as ‘pure’ as possible
• Pure partition
• If all tuples in it belong to same class
• The node N is labelled with the splitting criterion, which serves as
• Test at the node
• A branch is grown from node N for each of the outcomes of the splitting criterion
• Tuples in D are partitioned accordingly
• Let A be the splitting attribute
• A has v distinct values {a1,a2,…,av} based on training data
• A is discrete-valued
• Outcomes of the test at node N directly correspond to the known values of A
• A branch is created for each known value, aj of A
• Labeled with that value
• Partition Dj is the subset of class-labelled tuples in D having value aj of A
• Since, all tuples in a given partition have same value for A,
• A does not need to be considered in any future partitioning of tuples
• Therefore, it is removed from attribute_list
• A is continuous-valued
• Test at node N has two possible outcomes, corresponding to the conditions
• A<=split_point and A>split_point
• Split_point is the split point returned by attribute_selection_method
• Two branches are grown from N and labelled according to the previous outcomes
• Tuples are partitioned such that D1 holds the subset of class-labelled tuple in D for
which A<=split_point, while D2 holds the rest
• A is discrete-valued and a binary tree
• Test at node N is of the form “A∈ SA?,”,
• where SA is the splitting subset for A, returned by attribute selection method
• Attribute selection measures
• Information Gain
• Gain Ratio
• Gini Impurity
• Other Attribute selection measures
• Information Gain - biased toward multivalued attributes
• Gain Ratio - one partition is smaller than others
• Gini Impurity
• biased toward multivalued attributes
• Faces difficulty when the number of classes are large
• Other attribute selection measure
• CHAID
• C-SEP
• G-Statistic
• How to use RSS (Residual Sum of Squares)
• to find the best split point for continuous class labels
• Tree Pruning
• Addresses overfitting problem
• Use statistical measure to remove the least-reliable branches
• Pruned trees
• Smaller
• Less complex
• Two common pruning approaches
• Prepruning
• Post pruning
• Prepruning
• Tree is pruned by halting its construction early
• Upon halting, node becomes a leaf
• Leaf holds the most frequent class label among the subset of tuples
• Postpruning
• Removes subtrees from a ‘fully grown’ tree
• A subtree at a given node is pruned by removing its branches & replacing it with a leaf
• A leaf is labelled with the most frequent class label among the subtree being replaced
•
• Replication
Repetition
• Use of multivariate splits can prevent the problems
• Rule –based classifier can be constructed by extracting IF-THEN rules from decision tree
Bayes Classification Methods
• Statistical classifiers
• Predict class membership probabilities
• Probability that a given tuple belongs to a particular class
• Based on Baye’s theorem
• Simple Bayesian classifier is known as naïve Bayesian Classifier
• Exhibits high accuracy and speed when applied to large DBs
• Assumes ‘ class-conditional independence’
• effect of an attribute value on a given class is independent of values of other
attributes
Baye’s Theroem
• Let X be a data tuple
• In Bayesian terms, X is considered as ‘evidence’ defined by ‘n’ attributes
• Let, H be hypothesis such that the data tuple ‘X’ belongs to a specified class C
• For classification problems,
• We want to determine P(H|X)
• Means, probability of tuple X belongs to class C
• P(H|X) is the posterior probability/posteriori probability
• The probability of a hypothesis H being true after considering the given data X
• Eg:
• P(H) is the prior probability, or a priori probability, of H
• This the probability that any given customer will buy a computer
• Regardless of
• Age
• Income
• Any other information
• Posterior probability, P(H|X), is based on more information
• Prior probability, P(H) is independent of X
• Eg: patient has COVID-19
• P(X|H) is the conditional probability of X conditioned on H
• Given that we already know the customer will buy a computer, what is the probability
that they are 35 years old and earn $40,000?
• In classification, P(X|H) is referred to as likelihood
• Difference between P(H|X) and P(X|H)
• P(H|X)
• If a person has fever and cough, what is the probability that they have covid-19?
• P(X|H)
• If a person has covid-19, what is the probability that they will have fever and cough?
• P(X) is the prior probability of X
• Referred to as ‘marginal probability’
• It is the probability that a person from our set of customers is 35 years old and earns
$40000
• Probability that a patient has fever and cough
• P(H), P(X|H) and P(X) may be estimated from the given data
• Baye’s theorem provides a way of calculating the posterior probability
• P(H), P(X|H) and P(X) may be estimated from the given data
• Baye’s theorem provides a way of calculating the posterior probability
• Suppose there are m classes C1,C2,…Cm
• Given a tuple X
• We want to predict which class it belongs to
• Bayes Classifier
• First calculates the posterior probabilities for each of the m classes
• P(Ci|X)(i=1,…m)
• Then predicts that tuple X belongs to the class with the highest posterior
probability
• Eg:
• Given a customer, X of 35 years old and earning $40000
• We want to predict if the customer will buy a computer
• Suppose P(buy computer|X)=0.8 and P(not buy computer|X)=0.2
• Bayes classifier will predict that the customer X will buy a computer
• Bayes Classifier
• Optimal
• Obtains smallest classification error rate compared to all other classifiers
• Probabilistic method
• Could make a wrong prediction for any given tuple
• If P(not buy computer|X)=0.2
• Means, there is 20% chance that the prediction is incorrect
• Predicts class with maximum posterior probability
• Probability of wrong prediction is lowest
• Bayes Classifier
• Plays a foundational role in statistical machine learning community
• Naïve Bayesian classifier
• K-nearest-neighbor
• Logistic regression
• Bayesian network
• Can be viewed as approximated Bayes classifiers
• Many neural network and curve-fitting algorithms output
• Maximum posteriori hypothesis
• To calculate the posterior probabilities
• P(Ci|X)(i=1,2,..,m)
• We need to estimate
• Conditional probabilities P(X|Ci)(i=1,2,..,m)
• Priors P(Ci)(i=1,2,..,m)
• Estimating priors P(Ci)(i=1,2,…,m) is easy
• Estimating conditional probability P(X|Ci)(i=1,2,…,m) is very challenging
• Assume there are n binary attributes A1,A2,A3…,An
• We need to estimate the conditional probability for 2n possible values
• Simple solution is to use naïve Bayesian classifier
• Naïve Bayesian Classification
• Simple Bayesian classifier
• Let
• D be a training set of tuples and their associated class labels
• each tuple is represented by an n-dimensional attribute vector
• X=(x1,x2,x3…,xn), n is the number attributes A1,A2,A3,…,An
• Suppose there are m classes C1,C2,…Cm
• Given a tuple X,
• the classifier will predict that X belongs to the class having the highest
posterior probability, conditioned on X
• Bayesian classifier predicts, tuple X belongs to class Ci, if and only if
• Maximize P(Ci|X)
• The class Ci for which P(Ci|X) is maximized is called maximum posteriori hypothesis
• By Baye’s theorem
• P(X) is constant for all classes,
• we only need to find which class maximizes P(X|Ci)P(Ci)
• If class prior probabilities or not known
• Assume that the classes are equally likely
• P(C1)=P(C2)=….P(Cm)
• maximize P(X|Ci)
• Otherwise maximize P(X|Ci)P(Ci)
• P(Ci)=|Ci,D|/|D|
• Where |Ci,D| is the number of training tuples of class Ci in D
• Computation of P(X|Ci) is made on the assumption
• Class-conditional independence
• Presumes that the attribute’s values are conditionally independent on each other
• For each attribute, we look at whether the attribute is
• Categorical
• Continuous-valued
• To compute P(X|Ci), consider the following
• If Ak is categorical, P(xk|Ci) is the number of tuples of class Ci in D having the value xk for
Ak, divided by |Ci,D|, the number of tuples of class Ci in D
• If Ak is continuous valued, then use
• Example
Therefore the naïve Bayesian classifier predicts buys_computer = yes for tuple X
• Numerical example
• Sometimes P(xk|Ci) may end up with probability 0
• P(student = yes|buys_computer = no) = 0
• Plugging this zero value would return a zero probability for P(X|Ci )
• Even though, without zero probability
• The class may have ended up with a high probability
• 0 probability may cancel the effects of all other probabilities
• Laplacian correction/Laplacian estimator technique may be used
• training database, D, containing 1000 tuples
• 0 tuples with income=low
• 990 tuples with income=medium
• 10 tuples with income=high
• The probabilities are 0, 0.990 and 0.010
• Using the Laplacian correction for the three quantities
• Thus, zero probability is avoided
Model selection using statistical tests of significance
• Suppose we have generated two classification models M1 and M2
• We performed 10-fold cross validation to obtain mean error rate for each
• How to determine which model is best?
• It may seem intuitive to select the model lowest error rate
• But,
• Mean error rates are just estimates of the error
• There can be considerable variance b/w error rates in any given 10-fold CV
• Mean error rates obtained for M1 and M2 may be different
• Difference may not be statistically significant
• To determine if there is any ‘real’ difference in the mean error rates of 2 models
• Employ a test of statistical significance
• Ex:
• The hypothesis is that two models are same (null hypothesis)
Comparing classifiers based on cost-benefit ROC Curves
• Positives (correctly predicting cancerous patient as cancerous)
• True Negatives (correctly predicting a non-cancerous patient as non-cancerous)
• False Positives (incorrectly predicting non-cancerous patient as cancerous)
• False Negatives (incorrectly predicting a cancerous patient as non-cancerous)
• Useful in assessing the costs and benefits associated with a classification model
• Cost associated with a false negative is far greater than false positive
• Can outweigh one type of error over another by assigning a different cost to each
• Can incorporate costs and benefits by computing the average cost per decision
Receiver Operating Characteristic Curves
• Useful tool for comparing two classification models
• Come from signal detection theory
• Shows the trade-off between
• True positive rate (TPR)
• False positive rate (FPR)
• Given a test set and a model
• TPR is the proportion of positive (or “yes”) tuples correctly labelled by the model
• FPR is the proportion of negative (or “no”) tuples that are mislabelled as positive
• TPR=TP/P, which is sensitivity
• FPR=FP/N, which is 1-specificity
Tuple Class Prob. TP FN FP TN TPR FPR
Id
1 P 0.9 1 4 0 5 0.2 0
2 P 0.8 2 3 0 5 0.4 0
3 N 0.7 2 3 1 4 0.4 0.2
4 P 0.6 3 2 1 4 0.6 0.2
5 P 0.55 4 1 1 4 0.8 0.2
6 N 0.54 4 1 2 3 0.8 0.4
7 N 0.53 4 1 3 2 0.8 0.6
8 N 0.51 4 1 4 1 0.8 0.8
9 P 0.50 5 0 4 1 1.0 0.8
10 N 0.40 5 0 5 0 1.0 1.0
• To assess the accuracy of a model, we can measure the area under the curve (AUC)
• Several software packages are able to perform such calculation
• The closer the area is to 0.5, the less accurate the corresponding model is
• A model with perfect accuracy will have an AUC of 1.0.