KEMBAR78
Chap6 ClassificationBasic | PDF | Sensitivity And Specificity | Receiver Operating Characteristic
0% found this document useful (0 votes)
84 views83 pages

Chap6 ClassificationBasic

Chapter 6 covers the fundamental concepts of classification in machine learning, including supervised and unsupervised learning, decision tree induction, and Bayes classification methods. It discusses model construction, validation, and testing, as well as techniques to improve classification accuracy. The chapter emphasizes the importance of model evaluation and selection in the classification process.

Uploaded by

alsanahesapmesap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views83 pages

Chap6 ClassificationBasic

Chapter 6 covers the fundamental concepts of classification in machine learning, including supervised and unsupervised learning, decision tree induction, and Bayes classification methods. It discusses model construction, validation, and testing, as well as techniques to improve classification accuracy. The chapter emphasizes the importance of model evaluation and selection in the classification process.

Uploaded by

alsanahesapmesap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Chapter 6.

Classification: Basic Concepts


 Classification: Basic Concepts

 Decision Tree Induction

 Bayes Classification Methods

 Lazy Learners (or learning from your neighbors)

 Linear Classifiers

 Model Evaluation and Selection

 Techniques to Improve Classification Accuracy

 Summary

1
Supervised vs. Unsupervised Learning (1)
 Supervised learning (classification)
 Supervision: The training data such as observations or measurements are
accompanied by labels indicating the classes which they belong to
 New data is classified based on the models built from the training set
Training Data with class label:
Outlook Temp Humidity Windy Play Golf
Training Model
Rainy Hot High False No
Instances Learning
Rainy Hot High True No

Overcast Hot High False Yes

Sunny Mild High False Yes

Sunny Cool Normal False Yes Positive


Sunny Cool Normal True No

Overcast Cool Normal True Yes Test Prediction


Rainy Mild High False No Instances Model
Negative
2
Supervised vs. Unsupervised Learning (2)
 Unsupervised learning (clustering)
 The class labels of training data are unknown
 Given a set of observations or measurements, establish the possible existence
of classes or clusters in the data

3
Prediction Problems: Classification vs. Numeric Prediction
 Classification
 Predict categorical class labels (discrete or nominal)
 Construct a model based on the training set and the class labels (the values in a
classifying attribute) and use it in classifying new data
 Numeric prediction
 Model continuous-valued functions (i.e., predict unknown or missing values)

4
Classification—Model Construction, Validation and Testing
 Model Construction and Training
 Model: Represented as decision trees, rules, mathematical formulas, or other forms
 Assumption: Each sample belongs to a predefined class /class label
 Training Set: The set of samples used for model construction

5
Classification—Model Construction, Validation and Testing
 Model Validation and Testing:
 Test: Estimate accuracy of the model
 The known label of test sample VS. the classified result from the model
 Accuracy: % of test set samples that are correctly classified by the model
 Test set is independent of training set
 Validation: If the test set is used to select or refine models, it is called validation (or
development) (test) set
 Model Deployment: If the accuracy is acceptable, use the model to classify new data

6
Chapter 6. Classification: Basic Concepts
 Classification: Basic Concepts

 Decision Tree Induction

 Bayes Classification Methods

 Lazy Learners (or learning from your neighbors)

 Linear Classifiers

 Model Evaluation and Selection

 Techniques to Improve Classification Accuracy

 Summary

7
Decision Tree Induction: An Example
 Decision tree construction: Training data set: Play Golf?
Outlook Temp Humidity Windy Play Golf
 A top-down, recursive, divide-and-
Rainy Hot High False No
conquer process
Rainy Hot High True No
 Resulting tree: Overcast Hot High False Yes
outlook?
Sunny Mild High False Yes

Sunny Rainy Sunny Cool Normal False Yes


Sunny Cool Normal True No
Overcast
Overcast Cool Normal True Yes
windy? Yes Humidity? Rainy Mild High False No
Rainy Cool Normal False Yes
Sunny Mild Normal False Yes
False True Normal High Rainy Mild Normal True Yes
Overcast Mild High True Yes
Yes No Yes No Overcast Hot Normal False Yes
Sunny Mild High True No
8 https://www.saedsayad.com/decision_tree.htm
Decision Tree Induction: Algorithm
Basic algorithm
 Tree is constructed in a top-down, recursive, divide-and-conquer manner
 At start, all the training examples are at the root
 Examples are partitioned recursively based on selected attributes
 On each node, attributes are selected based on the training examples on that
node, and a heuristic or statistical measure (e.g., information gain, Gini index)

9
Decision Tree Induction: Algorithm
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning
 There are no samples left
 Prediction
 Majority voting is employed for classifying the leaf

10
How to Handle Continuous-Valued Attributes?
 Method 1: Discretize continuous values and treat them as categorical values
 E.g., age: < 20, 20..30, 30..40, 40..50, > 50
 Method 2: Determine the best split point for continuous-valued attribute A
 Sort:, e.g. 15, 18, 21, 22, 24, 25, 29, 31, …
 Possible split point: (ai+ai+1)/2
 e.g., (15+18)/2 = 16.5, 19.5, 21.5, 23, 24.5, 27, 30, …
 The point with the maximum information gain for A is selected as the split-
point for A
 Split: Based on split point P
 The set of tuples in D satisfying A ≤ P vs. those with A > P

11
Pro’s and Con’s
Pro’s
 Easy to explain (even for non-expert)
 Easy to implement (many software)
 Efficient
 Can tolerant missing data
 White box
 No need to normalize data
 Non-parametric: No assumption on data distribution, no assumption on
attribute independency
 Can work on various attribute types

12
Con’s
Con’s
 Unstable. Sensitive to noise
 Accuracy may be not good enough (depending on your data)
 The optimal splitting is NP. Greedy algorithms are used
 Overfitting

13
Splitting Measures: Information Gain
Entropy (Information Theory)
 A measure of uncertainty associated with a random number
 Calculation: For a discrete random variable Y taking m distinct values {y1, y2, …, ym}

 Interpretation
 Higher entropy → higher uncertainty
 Lower entropy → lower uncertainty
 Conditional entropy

m=2

14
Information Gain: An Attribute Selection Measure
 Select the attribute with the highest information gain (used in typical
decision tree induction algorithm: ID3/C4.5)
 Let pi be the probability that an arbitrary tuple in D belongs to class C i,
estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Info( D)   pi log 2 ( pi )
i 1

 Information needed (after using A to split D into v partitions) to classify D:


v | D |
InfoA ( D)    Info( D j )
j

j 1 | D |

 Information gained by branching on attribute A


Gain(A)  Info(D)  InfoA(D)
15
Example: Attribute Selection with Information Gain
Play outlook yes no I(yes, no)
Outlook Temp Humidity Windy outlook?
Golf rainy 2 3 0.971
Rainy Hot High False No overcast 4 0 0 Sunny Rainy
Rainy Hot High True No sunny 3 2 0.971
Overcast
Overcast Hot High False Yes
Sunny Mild High False Yes
9 9 5 5
Sunny Cool Normal False Yes Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940
14 14 14 14
Sunny Cool Normal True No 5 4 5
Overcast Cool Normal True Yes 𝐼𝑛𝑓𝑜𝑜𝑢𝑡𝑙𝑜𝑜𝑘 𝐷 = 𝐼 2,3 + 𝐼 4,0 + 𝐼 3,2 = 0.694
14 14 14
Rainy Mild High False No
Rainy Cool Normal False Yes 5
14
𝐼(2,3) means “outlook=rainy” has 5 out of 14 samples, with 2
Sunny Mild Normal False Yes
Rainy Mild Normal True Yes yes’es and 3 no’s. Hence 𝐺𝑎𝑖𝑛(𝑜𝑢𝑡𝑙𝑜𝑜𝑘) = 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜𝑜𝑢𝑡𝑙𝑜𝑜𝑘 (𝐷) = 0.246

Overcast Mild High True Yes


Overcast Hot Normal False Yes
Sunny Mild High True No

16
Example: Attribute Selection with Information Gain
Play Temp Yes No I(Yes, No)
Outlook Temp Humidity Windy
Golf Similarly, we can get
Hot 2 2 ?
Rainy Hot High False No Mild 4 2 ? 𝐺𝑎𝑖𝑛 𝑇𝑒𝑚𝑝 = 0.029,
Rainy Hot High True No Cool 3 1 ? 𝐺𝑎𝑖𝑛 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.151,
Overcast Hot High False Yes 𝐺𝑎𝑖𝑛 𝑊𝑖𝑛𝑑𝑦 = 0.048
Windy Yes No I(Yes, No)
Sunny Mild High False Yes
True ? ? ?
Sunny Cool Normal False Yes
False ? ? ?
Sunny Cool Normal True No
Overcast Cool Normal True Yes Humidity Yes No I(Yes, No)
Rainy Mild High False No Normal 6 1 ?
Rainy Cool Normal False Yes High 3 4 ?
Sunny Mild Normal False Yes
Rainy Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Sunny Mild High True No

17
Gain Ratio: A Refined Measure for Attribute Selection
 Information gain measure is biased towards attributes with a large number of
values (e.g. ID)
 Gain ratio: Overcomes the problem (as a normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( )
j 1 | D| | D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
 The attribute with the maximum gain ratio is selected as the splitting attribute
 Gain ratio is used in a popular algorithm C4.5 (a successor of ID3) by R. Quinlan
 Example
4 4 6 6 4 4
 SplitInfotemp D = − log 2 − log 2 − log 2 = 1.557
14 14 14 14 14 14
 GainRatio(temp) = 0.029/1.557 = 0.019

18
Chapter 6. Classification: Basic Concepts
 Classification: Basic Concepts

 Decision Tree Induction

 Bayes Classification Methods

 Lazy Learners (or learning from your neighbors)

 Linear Classifiers

 Model Evaluation and Selection

 Techniques to Improve Classification Accuracy

 Summary

19
Bayes’ Theorem: Basics
 Total probability Theorem:
p B = ෍ p B Ai p(Ai )
i
 Bayes’ Theorem:
p 𝐗H P H
p H|𝐗 = ∝p 𝐗H P H
p(𝐗)

posteriori probability likelihood prior probability

What we should choose What we just see What we knew previously

 X: a data sample (“evidence”) Prediction can be done based on Bayes’ Theorem:


 H: X belongs to class C Classification is to derive the maximum posteriori
20
Bayes’ Theorem Example: Picnic Day
 The morning is cloudy 
 What is the chance of rain? P(Rain | Cloud) = ?
 50% of all rainy days start off cloudy. P(Cloud | Rain) = 50%
 Cloudy mornings are common (40% of days start cloudy) P(Cloud) = 40%
 This is usually a dry month (only 3 of 30 days tend to be rainy) P(Rain) = 10%
P(Rain | Cloud) = P(Rain) P(Cloud | Rain) / P(Cloud) = 10% * 50% / 40% = 12.5%

 The chance of rain is probably not as high as expected 


 Bayes’ Theorem allows us to tell back and forth between posterior and likelihood
(e.g., P(Rain | Cloud) and P(Cloud | Rain)), tests the reality, which is the most
important trick in Bayesian Inference

21
Naïve Bayes Classifier: Making a Naïve Bayes Assumption
 Based on the Bayes’ Theorem, we can derive a Bayes Classifier to compute the
posterior probability of classifying an object X to a class C
 P (C|X) ∝ P(X|C)P(C) = P(x1|C)P(x2|x1,C)...P(xn|x1,...,C)P(C)

 A naïve bayes assumption to simplify the complex dependencies: features are


conditionally independent, given the class label!
 P (C|X) ∝ P(X|C)P(C) ≈ P(x1|C)P(x2|C)...P(xn|C)P(C)

 Super efficient: each feature only conditions on the class (boils down to sample
counting)
 Achieves surprisingly comparable performance
22
Naïve Bayes Classifier: Categorical vs. Continuous Valued Features
 If feature xk is categorical, p(xk = vk |Ci ) is the # of tuples in Ci with xk = vk ,
divided by |Ci, D| (# of tuples of Ci in D)

p X|𝐶𝑖 = ςk p xk Ci ) = p x1 Ci ) ∙ p x2 Ci ) ∙∙∙∙∙ p xn Ci )

 If feature xk is continuous-valued, p(xk = vk |Ci ) is usually computed based on


Gaussian distribution with a mean μ and standard deviation σ
2
𝑥−𝜇𝐶𝑖
1 −
p xk = vk Ci = 𝑁 xk μCi , σCi = 𝑒 2𝜎 2
2πσCi

23
Naïve Bayes Classifier Example 1: Training Dataset

Class:
play golf= ‘yes’
play golf = ‘no’

24
Naïve Bayes Classifier Example P(Yes | Sunny)

25
Naïve Bayes Classifier Example: P(No | Sunny)

26
Naïve Bayes Classifier Example: Likelihood Tables

27
Naïve Bayes Classifier Example: Likelihood Tables

/ P(x)

______________________
P(x)
/ P(x)
______________________
P(x)

P(x) = P( x | yes ) * P (yes) + P( x | no ) * P (no)

28
Avoiding the Zero-Probability Problem
 Naïve Bayesian prediction requires each conditional probability be non-zero
 Otherwise, the predicted probability will be zero
p X|𝐶𝑖 = ς𝑘 𝑝 𝑥𝑘 𝐶𝑖 ) = 𝑝 𝑥1 𝐶𝑖 ) ∙ 𝑝 𝑥2 𝐶𝑖 ) ∙∙∙∙∙ 𝑝 𝑥𝑛 𝐶𝑖 )
 Example. Suppose a dataset with 1,000 tuples:
income = low (0), income= medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 (or a small integer) to each case
Prob(income = low) = 1/(1000 + 3)
Prob(income = medium) = (990 + 1)/(1000 + 3)
Prob(income = high) = (10 + 1)/(1000 + 3)
 The “corrected” probability estimates are close to their “uncorrected”
counterparts
29
Naïve Bayes Classifier: Strength vs. Weakness
Strength
 Performance: A naïve Bayesian classifier, has comparable performance with
decision tree and selected neural network classifiers
 Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct—prior knowledge can be combined with
observed data

30
Naïve Bayes Classifier: Strength vs. Weakness
Weakness
 Assumption: attributes conditional independence, therefore loss of accuracy
 E.g., Patient’s Profile: (age, family history),
 Patient’s Symptoms: (fever, cough),
 Patient’s Disease: (lung cancer, diabetes).
 Dependencies among these cannot be modeled by Naïve Bayes Classifier
 How to deal with these dependencies?
Use Bayesian Belief Networks (chapter 7)

31
Chapter 6. Classification: Basic Concepts
 Classification: Basic Concepts

 Decision Tree Induction

 Bayes Classification Methods

 Lazy Learners (or learning from your neighbors)

 Linear Classifiers

 Model Evaluation and Selection

 Techniques to Improve Classification Accuracy

 Summary

32
Lazy vs. Eager Learning
 Lazy vs. eager learning
 Lazy learning (e.g., instance-based learning): Simply stores training data (or only
minor processing) and waits until it is given a test tuple
 Eager learning (the above discussed methods): Given a set of training tuples,
constructs a classification model before receiving new (e.g., test) data to classify
 Lazy: less time in training but more time in predicting
 Accuracy
 Lazy method effectively uses a richer hypothesis space since it uses many local
linear functions to form an implicit global approximation to the target function
 Eager: must commit to a single hypothesis that covers the entire instance space

33
Lazy Learner: Instance-Based Methods
 Instance-based learning:
 Store training examples and delay the processing (“lazy evaluation”) until a
new instance must be classified
 Typical approaches
 k-nearest neighbor approach
 Instances represented as points in a Euclidean space.
 Locally weighted regression
 Constructs local approximation
 Case-based reasoning
 Uses symbolic representations and knowledge-based inference

34
The k-Nearest Neighbor Algorithm
 All instances correspond to points in the n-D space
 The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)
 Target function could be discrete- or real- valued
 For discrete-valued, k-NN returns the most common value among the k training
examples nearest to xq
 Vonoroi diagram: the decision surface induced by 1-NN for a typical set of
training examples

_
_
_ _ .
+
_
. +
xq +
. . .
_ + .
35
Discussion on the k-NN Algorithm
 k-NN for real-valued prediction for a given unknown tuple
 Returns the mean values of the k nearest neighbors
 Distance-weighted nearest neighbor algorithm
 Weight the contribution of each of the k neighbors according to their distance
to the query xq 1
 Give greater weight to closer neighbors 𝑤=
𝑑(𝑥𝑞 , 𝑥𝑖 )2
 Robust to noisy data by averaging k-nearest neighbors
 Curse of dimensionality: distance between neighbors could be dominated by
irrelevant attributes
 To overcome it, axes stretch or elimination of the least relevant attributes

36
Selection of k for kNN
The number of neighbors k
 Small k: overfitting (high var., low bias)
 Big k: bringing too many irrelevant points (high bias, low var.)

37 http://scott.fortmann-roe.com/docs/BiasVariance.html
Case-Based Reasoning (CBR)
 CBR: Uses a database of problem solutions to solve new problems
 Store symbolic description (tuples or cases)—not points in a Euclidean space
 Applications: Customer-service (product-related diagnosis), legal ruling
 Methodology
 Instances represented by rich symbolic descriptions (e.g., function graphs)
 Search for similar cases, multiple retrieved cases may be combined
 Tight coupling between case retrieval, knowledge-based reasoning, and problem
solving
 Challenges
 Find a good similarity metric
 Indexing based on syntactic similarity measure, and when failure, backtracking,
and adapting to additional cases
38
Chapter 6. Classification: Basic Concepts
 Classification: Basic Concepts

 Decision Tree Induction

 Bayes Classification Methods

 Lazy Learners (or learning from your neighbors)

 Linear Classifiers

 Model Evaluation and Selection

 Techniques to Improve Classification Accuracy

 Summary

39
Linear Regression Problem: Example
 Mapping from independent attributes to continuous value: x => y
 {living area} => Price of the house
 {college; major; GPA} => Future Income
Price of houses

Living Area
40
Linear Regression Problem: Model
 Linear regression
 Data: n independent objects
 Observed Value: 𝑦𝑖 , 𝑖 = 1,2,3, ⋯ , 𝑛
𝑇
 p-dimensional attributes: 𝑥𝑖 = 𝑥𝑖1 , 𝑥𝑖2 , ⋯ , 𝑥𝑖𝑝 , 𝑖 = 1,2,3 ⋯ , 𝑛

 Model:
 Weight vector: 𝑤 = 𝑤1 , 𝑤2 , ⋯ , 𝑤𝑝
 𝑦𝑖 = 𝑤 𝑇 𝑥𝑖 + 𝑏
 The weight vector w and bias b are the model parameter learnt by data

41
Linear Regression Model: Solution
Least Square Method
𝑛
 Cost / Loss Function: L 𝑤, 𝑏 = Σ𝑖=1 𝑦𝑖 − 𝑤𝑥𝑖 − 𝑏 2
𝑛 2
 Optimization Goal: argmin L 𝑤, 𝑏 = Σ𝑖=1 𝑦𝑖 − 𝑤𝑥𝑖 − 𝑏
(w,b)

 Closed-form solution:
Σ𝑛 ത
𝑖=1 𝑥𝑖 (𝑦𝑖 −𝑦) 1𝑛
 𝑤= 2 𝑏 = Σ𝑖=1 (𝑦𝑖 − 𝑤𝑥𝑖 )
Σ𝑛 𝑥
𝑖=1 𝑖
2
−𝑛 Σ 𝑛
𝑖=1 𝑥𝑖
𝑛

42
Logistic Regression: General Ideas
 How to solve “classification” problems by regression?
 Key idea of Logistic Regression
 We need to transform the real value Y into a probability value ∈ [0,1]
 Sigmoid function (differentiable function) :
1 𝑒𝑧
 𝜎 𝑧 = 1+𝑒 −𝑧
= 𝑒 𝑧 +1
 Projects (−∞, +∞) to [0, 1]
 Not only LR uses this function, but also Sigmoid
neural network, deep learning Function
 The projected value changes sharply
around zero point
y
 Notice that ln 1−𝑦 = 𝑤 𝑇 𝑥 + 𝑏

43
Logistic Regression: An Example
Suppose we only consider the year as feature
 Data points are converted by sigmoid function (“activation” function)

1 (Tenured)

year
6

44
Logistic Regression: Model
 The prediction function to learn
 Probability that Y=1:
𝑛
 𝑝 𝑌 = 1 𝑋 = 𝑥; 𝒘) = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 𝑤0 + σ𝑖=1 𝑤𝑖 ⋅ 𝑥𝑖
 𝒘 = 𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑛 are the parameters

 A single data object with attributes 𝑥𝑖 and class label 𝑦𝑖


 Suppose the probability of 𝑝 𝑦 ෝ𝑖 = 1 𝑥𝑖 , 𝑤 = 𝑝𝑖 , then 𝑝 𝑦ෝ𝑖 = 0 𝑥𝑖 , 𝑤 = 1 − 𝑝𝑖
𝑦𝑖 1−𝑦𝑖
 𝑝 𝑦ෝ𝑖 = 𝑦𝑖 = 𝑝𝑖 1 − 𝑝𝑖

 Maximum Likelihood Estimation


𝑦𝑖 1−𝑦𝑖
𝑦 1−𝑦𝑖 exp 𝑤 𝑇 𝑥𝑖 1
 𝐿= Π𝑖 𝑝𝑖 𝑖 1 − 𝑝𝑖 = Π𝑖
1+exp 𝑤 𝑇 𝑥𝑖 1+exp 𝑤 𝑇 𝑥𝑖
45
Logistic Regression: Optimization
 Maximum Likelihood Estimation
𝑦𝑖 1−𝑦𝑖
𝑦𝑖 1−𝑦𝑖 exp 𝑤 𝑇 𝑥𝑖 1
 𝐿= Π𝑖 𝑝𝑖 1 − 𝑝𝑖 = Π𝑖
1+exp 𝑤 𝑇 𝑥𝑖 1+exp 𝑤 𝑇 𝑥𝑖
 Log likelihood:
𝑁

𝑙 𝑤 = ෍ 𝑦𝑖 log 𝑝 𝑌 = 1 𝑋 = 𝑥𝑖 ; 𝒘 + 1 − 𝑦𝑖 log 1 − 𝑝 𝑌 = 1 𝑋 = 𝑥𝑖 ; 𝒘
𝑖=1
𝑁

= ෍ 𝑦𝑖 𝑥𝑖𝑇 𝒘 − log(1 + exp 𝒘𝑇 𝑥𝑖 )


𝑖=1

There’s no closed form solution


 Gradient Descent/Ascent
46
Gradient Descent
 Gradient Descent is an iterative optimization algorithm for finding the minimum
of a function (e.g., the negative log likelihood)
 For a function F(x) at a point a, F(x) decreases fastest if we go in the direction of
the negative gradient of a Step size

When the gradient is zero, we


arrive at the local minimum

47
Gradient Descent
[footnote: It is gradient ascent for maximizing log likelihood]

Note we have switched the index i in the previous slides to l, and use i to refer to different components of weight vector w
48
Gradient Descent
[footnote: It is gradient ascent for maximizing log likelihood]

49
Gradient Descent
[footnote: It is gradient ascent for maximizing log likelihood]

50
Chapter 6. Classification: Basic Concepts
 Classification: Basic Concepts

 Decision Tree Induction

 Bayes Classification Methods

 Lazy Learners (or learning from your neighbors)

 Linear Classifiers

 Model Evaluation and Selection

 Techniques to Improve Classification Accuracy

 Summary

51
Model Evaluation and Selection
 Evaluation metrics
 How can we measure accuracy?
 Other metrics to consider?
 Use validation test set of class-labeled tuples instead of training set when assessing
accuracy
 Methods for estimating a classifier’s accuracy
 Holdout method
 Cross-validation
 Bootstrap (not covered)
 Comparing classifiers:
 ROC Curves
52
Classifier Evaluation Metrics: Confusion Matrix
 Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
 In a confusion matrix w. m classes, CMi,j indicates # of tuples in class i that
were labeled by the classifier as class j
 May have extra rows/columns to provide totals
 Example of Confusion Matrix:

Actual class\Predicted class play_golf = yes play_golf = no Total


play_golf = yes 6954 46 7000
play_golf = no 412 2588 3000
Total 7366 2634 10000
53
Classifier Evaluation Metrics: Accuracy, Error Rate,
Sensitivity and Specificity
A\P C ¬C
C TP FN P Real-world
Class imbalance problem
¬C FP TN N truth  One class may be rare
P’ N’ All  E.g., fraud, or HIV-positive
 Significant majority of the negative class and
Predictions
minority of the positive class
 Classifier accuracy, or  Measures handle the class imbalance problem
recognition rate  Sensitivity (recall): True positive recognition
 Percentage of test set tuples rate
that are correctly classified  Sensitivity = TP/P
Accuracy = (TP + TN)/All  Specificity: True negative recognition rate
 Error rate: 1 – accuracy, or  Specificity = TN/N
Error rate = (FP + FN)/All
54
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All

 Precision: Exactness: what % of tuples that the


classifier labeled as positive are actually positive?
TP
P = Precision =
TP + FP

 Recall: Completeness: what % of positive tuples did


the classifier label as positive?
TP
R = Recall =
TP + FN
 Range: [0, 1]
https://en.wikipedia.org/wiki/Precision_and_recall
55
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
 The “inverse” relationship between precision & recall
 We want one number to say if a classifier is good or not
 F measure (or F-score): harmonic mean of precision and recall
 In general, it is the weighted measure of precision & recall
1 β2 + 1 P ∗ R
F𝛽 = = Assigning β times as much
1 1 β 2P + R weight to recall as to precision
𝛼 ∙ P + (1 − 𝛼) ∙ R

 F1-measure (balanced F-measure)


 That is, when β = 1,
2P ∗ R
F1 =
P+R

56
Classifier Evaluation Metrics: Example
 Use the same confusion matrix, calculate the measure just introduced
Actual Class\Predicted class cancer = yes cancer = no Total
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000

 Sensitivity =
 Specificity =
 Accuracy =
 Error rate =
 Precision =
 Recall =
 F1 =
57
Classifier Evaluation Metrics: Example
 Use the same confusion matrix, calculate the measure just introduced
Actual Class\Predicted class cancer = yes cancer = no Total
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000

 Sensitivity = TP/P = 90/300 = 30%


 Specificity = TN/N = 9560/9700 = 98.56%
 Accuracy = (TP + TN)/All = (90+9560)/10000 = 96.50%
 Error rate = (FP + FN)/All = (140 + 210)/10000 = 3.50%
 Precision = TP/(TP + FP) = 90/(90 + 140) = 90/230 = 39.13%
 Recall = TP/ (TP + FN) = 90/(90 + 210) = 90/300 = 30.00%
 F1 = 2 P × R /(P + R) = 2 × 39.13% × 30.00%/(39.13% + 30%) = 33.96%
58
Training Error VS Testing Error

underfitting

overfitting

59 59
Classifier Evaluation: Holdout
 Holdout method

 Given data is randomly partitioned into two independent sets

 Training set (e.g., 2/3) for model construction

 Test set (e.g., 1/3) for accuracy estimation

 Repeated random sub-sampling validation: a variation of holdout

 Repeat holdout k times, accuracy = avg. of the accuracies obtained

60
Classifier Evaluation: Cross-Validation
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
 At i-th iteration, use Di as test set and others as training set
 Leave-one-out: k folds where k = # of tuples, for small sized data
 *Stratified cross-validation*: folds are stratified so that class
distribution, in each fold is approximately the same as that in the
initial data

61
Model Selection: ROC Curves

True positive rate


 ROC (Receiver Operating Characteristics) curves:
for visual comparison of classification models
 Originated from signal detection theory
 Shows the trade-off between the true positive
rate and the false positive rate
 The area under the ROC curve (AUC: Area Under False positive rate
Curve) is a measure of the accuracy of the model  Vertical axis represents the
 Rank the test tuples in decreasing order: the one true positive rate (TP/P)
that is most likely to belong to the positive class  Horizontal axis rep. the false
appears at the top of the list positive rate (FP/N)
 The closer to the diagonal line (i.e., the closer the  The plot also shows a diagonal
line
area is to 0.5), the less accurate is the model
 A model with perfect accuracy
will have an area of 1.0
62
Chapter 6. Classification: Basic Concepts
 Classification: Basic Concepts

 Decision Tree Induction

 Bayes Classification Methods

 Lazy Learners (or learning from your neighbors)

 Linear Classifiers

 Model Evaluation and Selection

 Techniques to Improve Classification Accuracy

 Summary

63
Techniques to Improve Classification Accuracy
 Introducing Ensemble Methods

 Bagging

 Boosting

 Random Forests

 Imbalanced Classification

64
Ensemble Methods: Increasing the Accuracy
Ensemble methods
 Use a combination of models to increase accuracy
 Combine a series of k learned models, M1, M2, …, Mk,
with the aim of creating an improved model M*

65
Ensemble Methods: Increasing the Accuracy
 What are the requirements to generate an improved model?
 Example: majority voting
x1 x2 x3 x1 x2 x3 x1 x2 x3
M1 ✓ ✓ ✗ M1 ✓ ✓ ✗ M1 ✓ ✗ ✗
Base model
performance M2 ✗ ✓ ✓ M2 ✓ ✓ ✗ M2 ✗ ✓ ✗
M3 ✓ ✗ ✓ M3 ✓ ✓ ✗ M3 ✗ ✗ ✓

Ensemble Voting ✓ ✓ ✓ Voting ✓ ✓ ✗ Voting ✗ ✗ ✗


performance Ensemble Ensemble Ensemble
Case 1: Case 2: Case 3:
Ensemble has positive effect Ensemble has no effect Ensemble has negative effect

 Base models should be


 Accurate
 Diverse

66
Ensemble Methods: Increasing the Accuracy
Popular ensemble methods
 Bagging: Trains each model using a subset of the training set, and models
learned in parallel
 Boosting: Trains each new model instance to emphasize the training instances
that previous models mis-classified, and models learned in order

Bagging Boosting
67
Bagging: Bootstrap Aggregation
 Analogy: Diagnosis based on multiple doctors’ majority vote

Training
 For i = 1 to k
 create bootstrap sample, Di, by sampling D with replacement;
 use Di and the learning scheme to derive a model, Mi ;

Classification: classify an unknown sample X


 let each of the k models classify X and return the majority vote

Prediction:
 To predict continuous variables, use average prediction instead of vote

68
Boosting
 Analogy: Consult several doctors, based on a combination of weighted diagnoses—
weight assigned based on the previous diagnosis accuracy
 How boosting works?
 A series of k classifiers are iteratively learned
 After a classifier Mi is learned, set the subsequent classifier, Mi+1, to pay more
attention to the training tuples that were misclassified by Mi
 The final M* combines the votes of each individual classifier, where the weight
of each classifier's vote is a function of its accuracy
 Boosting algorithm can be extended for numeric prediction

69
Adaboost (Freund and Schapire, 1997)
3. Update weights
based on current
1. Assign initial model
weights to each
training tuple
{wn(1)} {wn(2)} … {wn(k)}

2. Train base M1 M2 … Mk
classifier on
weighted dataset
4. After base classifiers
Two ‘weighting’ strategy: are trained, they are
𝑘
1. Assign weights to each training combined to give the
example 𝑀 ∗ 𝑥 = sign ෍ 𝛼𝑖 𝑀𝑖 𝑥
final classifier
2. Sample dataset based on weight 𝑖=1

distribution
70
Adaboost (Freund and Schapire, 1997)
 Input: Training set 𝐷 = 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑛 , 𝑦𝑛
 Initialize all weights {𝑤𝑛1 } to 1/N
 For round i = 1 to k,
 Fit a classifier 𝑀𝑖 based on weighted error function
𝑁
𝑖
𝐽𝑚 = ෍ 𝑤𝑛 𝐼 𝑀𝑖 𝑥𝑛 ≠ 𝑦𝑛
𝑛=1
𝑖
 Evaluate error rate 𝜖𝑖 = 𝐽𝑚 / σ 𝑤𝑛 (stop iteration if 𝜖𝑖 < threshold)
1 1−𝜖𝑖
and the base model 𝑀𝑖 ’s vote 𝛼𝑖 = ln
2 𝜖𝑖
 Update weights
(𝑖+1) (𝑖)
𝑤𝑛 = 𝑤𝑛 exp{𝛼𝑖 ⋅ 𝐼 𝑀𝑖 𝑥𝑛 ≠ 𝑦𝑛 }

 The final model is given by voting based on {𝛼𝑛 }


71
Gradient Boosting
 Operates on:
 A differentiable loss function
 A weak learner to make predictions (usually trees)
 An additive model to add weak learners to minimize the loss function
 Each time adds an additional weak learner

Previous model
New weak learner

 Scalable implementation: XGBoost

72
Random Forest: Basic Concepts
 Random Forest (first proposed by L. Breiman in 2001)
 Bagging with decision trees as base models
 Data bagging
 Use a subset of training data by sampling with replacement for each tree
 Feature bagging Advantage of decision trees – more diversity

 At each node use a random selection of attributes as candidates and split by


the best attribute among them

 During classification, each tree votes and the most popular class is returned

73
Random Forest
 Two Methods to construct Random Forest:
 Forest-RI (random input selection): Randomly select, at each node, F attributes
as candidates for the split at the node. The CART methodology is used to grow
the trees to maximum size
 Forest-RC (random linear combinations): Creates new attributes (or features)
that are a linear combination of the existing attributes (reduces the correlation
between individual classifiers)
 Comparable in accuracy to Adaboost, but more robust to errors and outliers
 Insensitive to the number of attributes selected for consideration at each split, and
faster than typical bagging or boosting

74
Ensemble Methods Recap
 Random forest and XGBoost are the most commonly used algorithms for tabular data
 Pros
 Good performance for tabular data, requires no data scaling
 Can scale to large datasets
 Can handle missing data to some extent
 Cons
 Can overfit to training data if not tuned properly
 Lack of interpretability (compared to decision trees)

75
Classification of Class-Imbalanced Data Sets
 Traditional methods assume a balanced distribution of classes and equal error
costs. But in real world situations, we may face imbalanced data sets, which has
rare positive examples but numerous negative ones.

 Medical diagnosis: Medical screening for a condition


is usually performed on a large population of people
without the condition, to detect a small minority
with it (e.g., HIV prevalence in the USA is ~0.4%)
 Fraud detection: About 2% of credit card accounts
are defrauded per year. (Most fraud detection domains are heavily imbalanced.)
 Product defect, accident (oil-spill), disk drive failures, etc.
76
Classification of Class-Imbalanced Data Sets
 Typical methods on imbalanced data (Balance the training set)
 Oversampling: Oversample the minority class.
 Under-sampling: Randomly eliminate tuples from majority class
 Synthesizing: Synthesize new minority classes

77
Classification of Class-Imbalanced Data Sets
Typical methods on imbalanced data (At the algorithm level)
 Threshold-moving: Move the decision threshold, t, so that the rare class tuples
are easier to classify, and hence, less chance of costly false negative errors
 Class weight adjusting: Since false negative costs more than false positive, we
can give larger weight to false negative
 Ensemble techniques: Ensemble multiple classifiers introduced in the following
chapter

Threshold-
moving

78
Evaluate imbalanced data classifier
 Can we use Accuracy to evaluate imbalanced data classifier?
 Accuracy simply counts the number of errors. If a data set has 2% positive
labels and 98% negative labels, a classifier that map all inputs to negative
class will get an accuracy of 98%!
 ROC Curve

79
Summary
 Classification: Model construction from a set of training data
 Effective and scalable methods
 Decision tree induction, Bayes classification methods, lazy learners, linear classifier

 No single method has been found to be superior over all others for all data sets

 Evaluation metrics: Accuracy, sensitivity, specificity, precision, recall, F measure

 Model evaluation: Holdout, cross-validation, bootstrapping, ROC curves (AUC)

 Techniques to improve classification accuracy: Ensemble Methods (bagging, boosting,


random forests), imbalanced classification

80
References (1)
 C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
 A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
 U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
 Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
 J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
 J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001
 J. Pan and D. Manocha. Bi-level locality sensitive hashing for k-nearest neighbor
computation. IEEE ICDE 2012
81
References (2)
 T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity, and
training time of thirty-three old and new classification algorithms. Machine Learning, 2000
 J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research,
Blackwell Business, 1994
 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. EDBT'96
 T. M. Mitchell. Machine Learning. McGraw Hill, 1997
 S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey,
Data Mining and Knowledge Discovery 2(4): 345-389, 1998
 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986
 J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993
 J. R. Quinlan. Bagging, boosting, and c4.5. AAAI’96
 E. Alpaydin. Introduction to Machine Learning (2nd ed.). MIT Press, 2011
82
References (3)
 R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98
 J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96
 J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann, 1990
 P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005
 S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.
Morgan Kaufman, 1991
 S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997
 I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques,
2ed. Morgan Kaufmann, 2005
 T. Chen and C. Guestrin. Xgboost: A Scalable Tree Boosting System. ACM SIGKDD 2016
 C. Aggarwal. Data Classication. Morgan Springer, 2015
83

You might also like