Classification: Basic Concepts -
Bayes Classification Methods
Session 07 and 08
Learning Outcomes
• LO 3 : Design data warehouse and data mining model.
• LO 4 : Analyse the implementation of data warehouse
and data mining techniques which appropriate to the
need.
Acknowledgments
These slides have been adapted from Han,
J., Kamber, M., & Pei, Y. (2012). Data
Mining: Concepts and Technique. 3rd
edition. Morgan Kaufman. San Francisco.
Chapter 9
Bina Nusantara
Topic
• Bayes Theorem
• Naïve Bayesian Classification
• Model Evaluation and Selection
4
Bina Nusantara
Bayes Theorem
Bina Nusantara University 5
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
• Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
6
be measured
Bayes Classifier
• A probabilistic framework for solving classification
problems
• Conditional Probability: P ( A, C )
P (C | A)
P ( A)
P ( A, C )
• Bayes theorem: P( A | C )
P (C )
P( A | C ) P(C )
P(C | A)
P( A)
Bayesian Theorem: Basics
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (posteriori probability), the
probability that the hypothesis holds given the observed data
sample X
• P(H) (prior probability), the initial probability
– E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelyhood), the probability of observing the sample X,
given that the hypothesis holds
– E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
8
Bayesian Theorem
• Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes theorem
P(H | X) P(X | H )P(H ) P(X | H ) P(H ) / P(X)
P(X)
• Informally, this can be written as
posteriori = likelihood x prior/evidence
• Predicts X belongs to C2 iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
• Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
9
Example of Bayes Theorem
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
• If a patient has stiff neck, what’s the probability he/she has
meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 0.0002
P( S ) 1 / 20
Bayesian Classifiers
• Consider each attribute and class label as random
variables
• Given a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly from data?
Bayesian Classifiers
• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for all
values of C using the Bayes theorem
P( A A A | C ) P(C )
P(C | A A A ) 1 2 n
P( A A A )
1 2 n
1 2 n
– Choose value of C that maximizes
P(C | A1, A2, …, An)
– Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
• How to estimate P(A1, A2, …, An | C )?
Naïve Bayesian Classification
Bina Nusantara University 13
Naïve Bayesian Classification
The naıve Bayesian classifier, or simple Bayesian classifier, works
as follows:
1. Let D be a training set of tuples and their associated class
labels. As usual, each tuple is represented by an n-dimensional
attribute vector, X = (x1, x2, … , xn) depicting n measurements
made on the tuple from n attributes, respectively, A1, A2, …. ,
An.
2. Suppose that there are m classes, C1, C2, : : : , Cm. Given a
tuple, X, the classifier will predict that X belongs to the class
having the highest posterior probability, conditioned on X. That
is, the na¨ıve Bayesian classifier predicts that tuple X belongs
to the class Ci if and only if
P(Ci |X) >(P(Cj |X) for 1 j m, j i
Bina Nusantara University 14
Naïve Bayesian Classification
Thus, we maximize P(Ci |X). The class Ci for which P(Ci
|X) is maximized is called the maximum posteriori
hypothesis. By Bayes’ theorem
Bina Nusantara University 15
Naïve Bayesian Classification
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = 31…40 high no fair yes
‘yes’ >40 medium no fair yes
C2:buys_computer = >40 low yes fair yes
‘no’ >40 low yes excellent no
31…40 low yes excellent yes
Data sample <=30 medium no fair no
X = (age <=30, <=30 low yes fair yes
Income = medium, >40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
17 >40 medium no excellent no
Naïve Bayesian Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
18
Avoiding the 0-Probability Problem
• Naïve Bayesian prediction requires each conditional prob.
be non-zero. Otherwise, the predicted prob. will be zero
n
P
(X )
|C
i P(
xk|C
)
i
k1
Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10),
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their
“uncorrected” counterparts
Naïve Bayesian Classifier: Comments
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore
loss of accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history,
etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by
Naïve Bayesian Classifier
• How to deal with these dependencies? Bayesian Belief
Networks (Chapter 9)
20
Play-tennis example:
estimating P(xi|C)
outlook
Outlook Temperature Humidity Windy Class P(sunny|p) = 2/9 P(sunny|n) = 3/5
sunny hot high false N
sunny hot high true N
P(overcast|p) = 4/9 P(overcast|n) = 0
overcast hot high false P P(rain|p) = 3/9 P(rain|n) = 2/5
rain mild high false P
rain cool normal false P temperature
rain cool normal true N
overcast cool normal true P P(hot|p) = 2/9 P(hot|n) = 2/5
sunny mild high false N
sunny cool normal false P P(mild|p) = 4/9 P(mild|n) = 2/5
rain mild normal false P
sunny mild normal true P P(cool|p) = 3/9 P(cool|n) = 1/5
overcast mild high true P
overcast hot normal false P
humidity
rain mild high true N P(high|p) = 3/9 P(high|n) = 4/5
P(p) = 9/14 P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(n) = 5/14
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
December 17, 2019 Data Mining: Concepts and Techniques 21
Likelihood Table
Example:
The posterior probability can be calculated by first, constructing a frequency
table for each attribute against the target. Then, transforming the frequency
tables to likelihood tables and finally use the Naive Bayesian equation to
calculate the posterior probability for each class. The class with the highest
posterior probability is the outcome of prediction.
Play-tennis example: classifying
• An unseen sample X = <rain, hot, high, false>
• P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286
• Sample X is classified in class n (don’t play)
December 17, 2019 Data Mining: Concepts and Techniques 23
c a Example
l
c a l ofus Naïve Bayes Classifier
o ri o ri u o
t eg t eg n tin s s
• Class: P(C) = Nc/N
a a o c la
c c c
Tid Refund Marital Taxable
– e.g., P(No) = 7/10,
Status Income Evade P(Yes) = 3/10
1 Yes Single 125K No
• For discrete attributes:
2 No Married 100K No
P(Ai | Ck) = |Aik|/ Nc k
3 No Single 70K No
4 Yes Married 120K No – where |Aik| is number of
5 No Divorced 95K Yes instances having attribute
6 No Married 60K No Ai and belongs to class Ck
7 Yes Divorced 220K No
– Examples:
8 No Single 85K Yes
P(Status=Married|No) = 4/7
9 No Married 75K No
P(Refund=Yes|Yes)=0
10 No Single 90K Yes
10
Example of Naïve Bayes Classifier
Given a Test Record:
X (Refund No, Married, Income 120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7 P(X|Class=No) = P(Refund=No|Class=No)
P(Refund=No|No) = 4/7 P(Married| Class=No)
P(Refund=Yes|Yes) = 0 P(Income=120K| Class=No)
P(Refund=No|Yes) = 1 = 4/7 4/7 0.0072 = 0.0024
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7 P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Marital Status=Single|Yes) = 2/7 P(Married| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7 P(Income=120K| Class=Yes)
P(Marital Status=Married|Yes) = 0 = 1 0 1.2 10-9 = 0
For taxable income:
If class=No: sample mean=110 Since P(X|No)P(No) > P(X|Yes)P(Yes)
sample variance=2975 Therefore P(No|X) > P(Yes|X)
If class=Yes: sample mean=90
sample variance=25 => Class = No
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class:
<=30 high no excellent no
C1:buys_computer =
31…40 high no fair yes
‘yes’
>40 medium no fair yes
C2:buys_computer = >40 low yes fair yes
‘no’ >40 low yes excellent no
31…40 low yes excellent yes
Data sample <=30 medium no fair no
X = (age <=30, <=30 low yes fair yes
Income = medium, >40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
December 17, 2019 Data Mining: Concepts and Techniques 26
Naïve Bayesian Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
December 17, 2019 Data Mining: Concepts and Techniques 27
Model Evaluation and Selection
Bina Nusantara University 28
Model Evaluation
• Metrics for Performance Evaluation
– How to evaluate the performance of a model?
• Methods for Performance Evaluation
– How to obtain reliable estimates?
• Methods for Model Comparison
– How to compare the relative performance among
competing models?
Model Evaluation
• Metrics for Performance Evaluation
– How to evaluate the performance of a model?
• Methods for Performance Evaluation
– How to obtain reliable estimates?
• Methods for Model Comparison
– How to compare the relative performance among
competing models?
Metrics for Performance
Evaluation
• Focus on the predictive capability of a model
– Rather than how fast it takes to classify or build
models, scalability, etc.
• Confusion Matrix:
PREDICTED CLASS
a: TP (true positive)
b: FN (false
Class=Yes Class=No
negative)
ACTUAL c: FP (false
CLASS Class=Yes a b
positive)
d: TN (true
Class=No c d negative)
Metrics for Performance
Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
• Most widely-used metric:
ad TP TN
Accuracy
a b c d TP TN FP FN
Metrics for Evaluating Classifier
Performance
• Measures for assessing how good or how “accurate” your
classifier is at predicting the class label of tuples.
Metrics for Evaluating Classifier
Performance
• The accuracy of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by the
classifier
• Sensitivity is also referred to as the true positive (recognition)
rate (i.e., the proportion of positive tuples that are correctly
identified)
• Specificity is the true negative rate (i.e., the proportion of
negative tuples that are correctly identified)
• Precision can be thought of as a measure of exactness (i.e.,
what percentage of tuples labeled as positive are actually such),
• Recall is a measure of completeness (what percentage of
positive tuples are labeled as such). If recall seems familiar,
that’s because it is the same as sensitivity (or the true positive
rate).
Metrics for Evaluating Classifier
Performance
• True positives (TP) :
These refer to the positive tuples that were correctly labeled by the
classifier. Let TP be the number of true positives.
• True negatives (TN) :
These are the negative tuples that were correctly labeled by the
classifier. Let TN be the number of true negatives.
• False positives (FP) :
These are the negative tuples that were incorrectly labeled as positive
(e.g., tuples of class buys computer D no for which the classifier
predicted buys computer D yes). Let FP be the number of false
positives.
• False negatives (FN) :
These are the positive tuples that were mislabeled as negative (e.g.,
tuples of class buys computer D yes for which the classifier predicted
buys computer D no). Let FN be the number of false negatives.
Confusion Metrics – Example (1)
Results from a Corresponding
Classification Algorithms matrix for the given table
Vertex Actual Predicted Predicted
ID Class Class Class
1 + + + -
2 + + Actual + 4 1 C=5
3 + + Class - 2 1 D=3
4 + + A=6 B=2 T=8
5 + -
• True positive = 4
6 - + • False positive = 1
7 - + • True Negative = 1
8 - - • False Negative =2
36
Confusion Metrics Performance Metrics –
Example (1)
1. Accuracy is proportion of correct predictions
2. Error rate is proportion of incorrect predictions
3. Recall is the proportion of “+” data points predicted as “+”
4. Precision is the proportion of data points predicted as “+” that are truly “+”
37
Metrics for Evaluating Classifier
Performance – Example (2)
• The confusion matrix is a useful tool for analyzing how
well your classifier can recognize tuples of different
classes.
Macro-averaged measure and Micro-
averaged measure
• When multiple class labels are to be retrieved, averaging
the evaluation measures can give a view on the general
results. There are two names to refer to averaged results:
micro-averaged and macro-averaged results.
http://www.clips.uantwerpen.be/~vincent/pdf/microaverage.pdf
Macro-averaged measure
• L = {λj : j = 1...q} is the set of all labels. [. . . ]
• Consider a binary evaluation measure B(tp, tn, fp, fn) that
is calculated based on the number of true positives (tp),
true negatives (tn), false positives (fp) and false negatives
(fn).
• Let tpλ, fpλ, tnλ and fnλ be the number of true positives,
false positives, true negatives and false negatives after
binary evaluation for a label λ. [. . . ]
Macro-averaged measure
• Micro averaged measure = (100) / (100 + 20) = 0.83
Micro-averaged measure
Holdout Method and
Random Subsampling
• The holdout method is what we have alluded to so far in our discussions about accuracy.
– In this method, the given data are randomly partitioned into two independent sets, a training set
and a test set.
– Typically, two-thirds of the data are allocated to the training set, and the remaining one-
third is allocated to the test set.
– The training set is used to derive the model.
– The model’s accuracy is then estimated with the test set The estimate is pessimistic because
only a portion of the initial data is used to derive the model.
• Random subsampling is a variation of the holdout method in which the holdout method is repeated k
times. The overall accuracy estimate is taken as the average of the
• accuracies obtained from each iteration.
Cross-Validation
• In k-fold cross-validation, the initial data are randomly
partitioned into k mutually exclusive subsets or “folds,”
D1, D2, : : : , Dk, each of approximately equal size.
• Training and testing is performed k times. In iteration i,
partition Di is reserved as the test set, and the remaining
partitions are collectively used to train the model.
Cross-validation
Cross-validation also called rotation estimation, is a way
to analyze how a predictive data mining model will perform
on an unknown dataset, i.e., how well the model generalizes
Strategy:
1. Divide up the dataset into two non-overlapping subsets
2. One subset is called the “test” and the other the “training”
3. Build the model using the “training” dataset
4. Obtain predictions of the “test” set
5. Utilize the “test” set predictions to calculate all the
performance metrics
Typically cross-validation is performed for multiple iterations,
selecting a different non-overlapping test and training set each time
45
Types of Cross-validation
• hold-out: Random 1/3rd of the data is used as
test and remaining 2/3rd as training
• k-fold: Divide the data into k partitions, use one
partition as test and remaining k-1 partitions for
training
• Leave-one-out: Special case of k-fold, where k=1
Note: Selection of data points is typically done in stratified manner, i.e.,
the class distribution in the test set is similar to the training set
46
Other Resources
• Performance Metrics for Graph Mining Tasks, Oak Ridge
National Laboratory
• http://www.saedsayad.com/naive_bayesian.htm
Example
• https://www.geeksforgeeks.org/naive-bayes-classifiers/
• https://www.machinelearningplus.com/predictive-modeling/how-
naive-bayes-algorithm-works-with-example-and-full-code/
• https://monkeylearn.com/blog/practical-explanation-naive-bayes-
classifier/