Classification Algorithms
Linear Regression
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
1
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels are unknown
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
2
Prediction Problems:
Classification vs. Numeric
Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
Numeric Prediction
models continuous-valued functions, i.e., predicts unknown
or missing values
3
Classification—A Two-Step
Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or mathematical
formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result from
the model
Accuracy rate is the percentage of test set samples that are correctly
classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
4
Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
5
Process (2): Using the Model in
Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
6
Chapter 8. Classification: Basic
Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
7
Decision Tree Induction: An
Example
age income student credit_rating buys_computer
<=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
Resulting tree: 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
student? yes credit rating?
no yes excellent fair
no yes no yes
8
Algorithm for Decision Tree
Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized
in advance)
Examples are Examples partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
There are no samples left
9
Brief Review of Entropy
10
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Expected information (entropy) needed to classify a tuple in D:
Information needed (after using A to split D into v partitions) to
m
classify D:
Info( D )=−∑ pi log 2 ( pi )
i= 1
Information gained by branching on attribute A
v
|D j|
Info A ( D )=∑ × Info( D j )
j= 1 |D|
G ain ( A )= Info ( D )− Info A ( D )
11
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Expected information (entropy) needed to classify a tuple in D:
Information needed (after using A to split
m D into v partitions) to
classify D: Info( D )=−∑ pi log 2 ( pi )
i= 1
Information gained by branching on attribute A
v
|D j|
Info A ( D )=∑ × Info( D j )
j= 1 |D|
G ain ( A )= Info ( D )− Info A ( D )
12
Attribute Selection: Information
Gain
Class P: buys_computer = “yes” 5 4
Info age ( D )= I (2,3 )+ I ( 4,0)
Class N: buys_computer = “no” 14 14
5
+ I (3,2)=0 . 694
9 9 5 5 14
Info( D )=I (9,5 )=− log 2 ( )− log 2 ( )=0 . 940
14 14 14 14
5
age income student credit_rating buys_computer I (2,3 ) means “age <=30” has 5 out of
<=30 high no fair no 14
14 samples, with 2 yes’es and 3
<=30 high no excellent no
31…40 high no fair yes no’s. Hence
>40 medium no fair yes
>40 low yes fair yes G ain ( age )= Info ( D )− Info age ( D )= 0 . 246
>40 low yes excellent no
31…40 low yes excellent yes Similarly,
<=30 medium no fair no
<=30 low yes fair yes Gain( income )=0 . 029
>40 medium yes fair yes
<=30 medium yes excellent yes Gain( student )=0 . 151
31…40 medium no excellent yes Gain( credit rating )=0 . 048
31…40 high yes fair yes
>40 medium no excellent no 13
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise or
outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early ̵ do not split a node if this
would result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
Use a set of data different from the training data to decide
which is the “best pruned tree”
14
Chapter 8. Classification: Basic
Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
15
Bayesian Classification:
Why?
A statistical classifier: performs probabilistic prediction, i.e., predicts
class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian classifier,
has comparable performance with decision tree and selected neural
network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct — prior
knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision making
against which other methods can be measured
16
Bayes’ Theorem: Basics
M
Total probability Theorem: P( B )=∑ P (B|A i )P ( A i )
i=1
P ( X |H ) P ( H )
Bayes’ Theorem: P ( H |X )= =P ( X |H )× P ( H )/ P ( X )
P( X )
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
P(H) (prior probability): the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
17
Prediction Based on Bayes’
Theorem
Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
P ( X |H ) P ( H )
P ( H |X )= =P ( X |H )× P ( H )/ P ( X )
P( X )
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
18
Classification Is to Derive the Maximum
Posteriori
Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
P( X|C i ) P(C i )
P( C i|X )=
P( X )
Since P(X) is constant for all classes, only
needs to be maximized
P ( C i|X )=P ( X |C i ) P ( C i )
19
Naïve Bayes Classifier: Training
Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
20
Naïve Bayes Classifier: An
Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 age
<=30
income studentcredit_rating
high no fair
buys_computer
no
<=30 high no excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40
>40
high
medium
no fair
no fair
yes
yes
Compute P(X|Ci) for each class >40
>40
low
low
yes fair
yes excellent
yes
no
31…40 low yes excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 <=30
<=30
medium
low
no fair
yes fair
no
yes
>40 medium yes fair yes
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 <=30 medium yes excellent yes
31…40 medium no excellent yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 31…40
>40
high
medium
yes fair
no excellent
yes
no
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
21
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Comparison of Decision Tree and
Baysian Classification
23