KEMBAR78
Bayesian Classification, Nearest | PDF | Bayesian Inference | Statistical Classification
0% found this document useful (0 votes)
8 views46 pages

Bayesian Classification, Nearest

The document discusses classification methods in data mining, focusing on Bayesian classification and the Naïve Bayes classifier, which uses Bayes' theorem for probabilistic predictions. It explains how to estimate probabilities from data and highlights the advantages and disadvantages of Naïve Bayes, including its independence assumption. Additionally, it covers k-Nearest Neighbor classification, detailing its lazy learning approach and the importance of selecting an appropriate value for k.

Uploaded by

Sumit Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views46 pages

Bayesian Classification, Nearest

The document discusses classification methods in data mining, focusing on Bayesian classification and the Naïve Bayes classifier, which uses Bayes' theorem for probabilistic predictions. It explains how to estimate probabilities from data and highlights the advantages and disadvantages of Naïve Bayes, including its independence assumption. Additionally, it covers k-Nearest Neighbor classification, detailing its lazy learning approach and the importance of selecting an appropriate value for k.

Uploaded by

Sumit Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Data Mining

Classification Methods
Bayesian Classification, Nearest
Neighbor, Ensemble Methods
Bayesian Classification: Why?

● A statistical classifier: performs probabilistic prediction,


i.e., predicts class membership probabilities
● Foundation: Based on Bayes’ Theorem.
● Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
● Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with
observed data

* Data Mining: Concepts and Techniques 2


Bayes’ Rule

Who is who in Bayes’ rule


Example of Bayes Theorem
● Given:
● A doctor knows that meningitis causes stiff neck 50% of the time
● Prior probability of any patient having meningitis is 1/50,000
● Prior probability of any patient having stiff neck is 1/20

● If a patient has stiff neck, what’s the probability


he/she has meningitis?
Choosing Hypotheses
● Maximum Likelihood
hypothesis:

● Generally we want the most


probable hypothesis given
training data.This is the
maximum a posteriori
hypothesis:
● Useful observation: it does not
depend on the denominator
P(d)
Bayesian Classifiers
● Consider each attribute and class label as random
variables

● Given a record with attributes (A1, A2,…,An)


● Goal is to predict class C
● Specifically, we want to find the value of C that maximizes
P(C| A1, A2,…,An )

● Can we estimate P(C| A1, A2,…,An ) directly from


data?
Bayesian Classifiers
● Approach:
● compute the posterior probability P(C | A1, A2, …, An) for all values
of C using the Bayes theorem

● Choose value of C that maximizes


P(C | A1, A2, …, An)

● Equivalent to choosing value of C that maximizes


P(A1, A2, …, An|C) P(C)

● How to estimate P(A1, A2, …, An | C )?


Naïve Bayes Classifier
● Assume independence among attributes Ai when class is
given:
● P(A , A , …, A |C) = P(A | C ) P(A | C )… P(A | C )
1 2 n 1 j 2 j n j
● Can estimate P(A | C ) for all A and C .
i j i j
● This is a simplifying assumption which may be violated in
reality
● The Bayesian classifier that uses the Naïve Bayes assumption
and computes the MAP hypothesis is called Naïve Bayes
classifier
How to Estimate Probabilities
from Data?
● Class: P(C) = Nc/N
● e.g., P(No) = 7/10,
P(Yes) = 3/10

● For discrete attributes:


P(Ai | Ck) = |Aik|/ Nc
k
● where |Aik| is number of
instances having attribute Ai and
belongs to class Ck
● Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
How to Estimate Probabilities
from Data?
● For continuous attributes:
● Discretize the range into bins
● one ordinal attribute per bin
● violates independence assumption
● Two-way split: (A < v) or (A > v)
● choose only one of the two splits as new attribute
● Probability density estimation:
● Assume attribute follows a normal distribution
● Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
● Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
How to Estimate Probabilities from
Data?
● Normal distribution:

● One for each (Ai,ci) pair

● For (Income, Class=No):


● If Class=No
● sample mean = 110

● sample variance = 2975


Naïve Bayesian Classifier:
Training Dataset

Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’

New Data:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)

* Data Mining: Concepts and Techniques 12


Naïve Bayesian Classifier:
An Example

Given X (age=youth, income=medium, student=yes, credit=fair)


Maximize P(X|Ci)P(Ci), for i=1,2

First step: Compute P(C) The prior probability of each class can be
computed based on the training tuples:
P(buys_computer=yes)=9/14=0.643
P(buys_computer=no)=5/14=0.357
Naïve Bayesian Classifier:
An Example

Given X (age=youth, income=medium, student=yes, credit=fair)


Maximize P(X|Ci)P(Ci), for i=1,2

Second step: compute P(X|Ci)


P(X|buys_computer=yes)= P(age=youth|buys_computer=yes)x
P(income=medium|buys_computer=yes) x
P(student=yes|buys_computer=yes)x
P(credit_rating=fair|buys_computer=yes)
= 0.044
P(age=youth|buys_computer=yes)=0.222
P(income=medium|buys_computer=yes)=0.444
P(student=yes|buys_computer=yes)=6/9=0.667
P(credit_rating=fair|buys_computer=yes)=6/9=0.667
Naïve Bayesian Classifier:
An Example

Given X (age=youth, income=medium, student=yes, credit=fair)


Maximize P(X|Ci)P(Ci), for i=1,2

Second step: compute P(X|Ci)


P(X|buys_computer=no)= P(age=youth|buys_computer=no)x
P(income=medium|buys_computer=no) x
P(student=yes|buys_computer=no) x
P(credit_rating=fair|buys_computer=no)
= 0.019
P(age=youth|buys_computer=no)=3/5=0.666
P(income=medium|buys_computer=no)=2/5=0.400
P(student=yes|buys_computer=no)=1/5=0.200
P(credit_rating=fair|buys_computer=no)=2/5=0.400
Naïve Bayesian Classifier:
An Example

Given X (age=youth, income=medium, student=yes, credit=fair)


Maximize P(X|Ci)P(Ci), for i=1,2

We have computed in the first and second steps:


P(buys_computer=yes)=9/14=0.643
P(buys_computer=no)=5/14=0.357
P(X|buys_computer=yes)= 0.044
P(X|buys_computer=no)= 0.019

Third step: compute P(X|Ci)P(Ci) for each class


P(X|buys_computer=yes)P(buys_computer=yes)=0.044 x 0.643=0.028
P(X|buys_computer=no)P(buys_computer=no)=0.019 x 0.357=0.007
The naïve Bayesian Classifier predicts X belongs to class (“buys_computer =
yes”)
Example
Training set :
(Öğrenme Kümesi)

Given a Test Record:


Example of Naïve Bayes Classifier
Given a Test Record:

● P(X|Class=No) = P(Refund=No|Class=No)
× P(Married| Class=No)
× P(Income=120K| Class=No)
= 4/7 × 4/7 × 0.0072 = 0.0024

● P(X|Class=Yes) = P(Refund=No| Class=Yes)


× P(Married| Class=Yes)
× P(Income=120K| Class=Yes)
= 1 × 0 × 1.2 × 10-9 = 0

Since P(X|No)P(No) > P(X|Yes)P(Yes)


Therefore P(No|X) > P(Yes|X)
=> Class = No
Avoiding the 0-Probability Problem

● If one of the conditional probability is zero, then the


entire expression becomes zero
● Probability estimation:

c: number of classes
p: prior probability
m: parameter
Naïve Bayes (Summary)
● Advantage
● Robust to isolated noise points
● Handle missing values by ignoring the instance during probability
estimate calculations
● Robust to irrelevant attributes
● Disadvantage
● Assumption: class conditional independence, which may cause
loss of accuracy
● Independence assumption may not hold for some attribute.
Practically, dependencies exist among variables
● Use other techniques such as Bayesian Belief Networks (BBN)
Remember

● Bayes’ rule can be turned into a classifier


● Maximum A Posteriori (MAP) hypothesis estimation
incorporates prior knowledge; Max Likelihood (ML) doesn’t
● Naive Bayes Classifier is a simple but effective Bayesian
classifier for vector data (i.e. data with several attributes)
that assumes that attributes are independent given the
class.
● Bayesian classification is a generative approach to
classification
Classification Paradigms
● In fact, we can categorize three fundamental approaches
to classification:
● Generative models: Model p(x|Ck) and P(Ck) separately
and use the Bayes theorem to find the posterior
probabilities P(Ck|x)
● E.g. Naive Bayes, Gaussian Mixture Models, Hidden Markov
Models,…
● Discriminative models:
● Determine P(Ck|x) directly and use in decision
● E.g. Linear discriminant analysis, SVMs, NNs,…
● Find a discriminant function f that maps x onto a class
label directly without calculating probabilities

Slide from B.Yanik


Bayesian Belief Networks
● Bayesian belief network allows a subset of the variables to
be conditionally independent
● A graphical model of causal relationships (neden sonuç
ilişkilerini simgeleyen bir çizge tabanlı model)
● Represents dependency among the variables
● Gives a specification of joint probability distribution
❑ Nodes: random variables

X Y ❑ Links: dependency
❑ X and Y are the parents of Z, and Y is the
parent of P
Z
P ❑ No dependency between Z and P
❑ Has no loops or cycles

* Data Mining: Concepts and Techniques 23


Bayesian Belief Network: An Example

Family Smoke The conditional probability table


History r (CPT) for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1


LungC Emphy
~LC 0.2 0.5 0.3 0.9
ancer sema

CPT shows the conditional probability for


each possible combination of its parents
Derivation of the probability of a
Positiv Dyspne particular combination of values of X,
eXRay a from CPT:

Bayesian Belief Networks

* Data Mining: Concepts and Techniques 24


Training Bayesian Networks
● Several scenarios:
● Given both the network structure and all variables observable:
learn only the CPTs
● Network structure known, some hidden variables: gradient
descent (greedy hill-climbing) method, analogous to neural
network learning
● Network structure unknown, all variables observable: search
through the model space to reconstruct network topology
● Unknown structure, all hidden variables: No good algorithms
known for this purpose

● Ref. D. Heckerman: Bayesian networks for data mining

* Data Mining: Concepts and Techniques 25


Lazy Learners
● The classification algorithms presented before are eager
learners
● Construct a model before receiving new tuples to classify
● Learned models are ready and eager to classify previously
unseen tuples
● Lazy learners
● The learner waits till the last minute before doing any model
construction
● In order to classify a given test tuple
● Store training tuples
● Wait for test tuples
● Perform generalization based on similarity between test and the
stored training tuples
Lazy vs Eager
Eager Learners Lazy Learners

• Do lot of work on training data • Do less work on training data

• Do less work when test tuples are • Do more work when test tuples are
presented presented
Basic k-Nearest Neighbor Classification
● Given training data
● Define a distance metric between points in input space
D(x1,xi)
● E.g., Eucledian distance, Weighted Eucledian, Mahalanobis
distance, TFIDF, etc.

● Training method:
● Save the training examples
● At prediction time:
● Find the k training examples (x1,y1),…(xk,yk) that are closest
to the test example x given the distance D(x1,xi)
● Predict the most frequent class among those yi’s.
Nearest-Neighbor Classifiers
● Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

● To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
K-Nearest Neighbor Model

● Classification:

● Regression:
K-Nearest Neighbor Model: Weighted
by Distance

● Classification:

● Regression:

31
Definition of Nearest Neighbor

K-nearest neighbors of a record x are data points that


have the k smallest distance to x
The decision boundary implemented
by 3NN
The boundary is always the perpendicular bisector
of the line between two points (Voronoi tessellation)

Slide by Hinton
Nearest Neighbor Classification…

● Choosing the value of k:


● If k is too small, sensitive to noise points
● If k is too large, neighborhood may include points from other
classes
Determining the value of k

● In typical applications k is in units or tens rather than in


hundreds or thousands
● Higher values of k provide smoothing that reduces the
risk of overfitting due to noise in the training data
● Value of k can be chosen based on error rate measures
● We should also avoid over-smoothing by choosing k=n,
where n is the total number of tuples in the training data
set
Determining the value of k
● Given training examples
● Use N fold cross validation
● Search over K = (1,2,3,…,Kmax). Choose search size Kmax
based on compute constraints
● Calculated the average error for each K:
● Calculate predicted class for each training point

(using all other points to build the model)


● Average over all training examples
● Pick K to minimize the cross validation error
Example

Example from J. Gamper


Choosing k

Slide from J. Gamper


Nearest neighbor Classification…
● k-NN classifiers are lazy learners
● It does not build models explicitly
● Unlike eager learners such as decision tree induction and
rule-based systems
● Adv: No training time
● Disadv:
● Testing time can be long, classifying unknown records are
relatively expensive
● Curse of Dimensionality : Can be easily fooled in high
dimensional spaces
● Dimensionality reduction techniques are often used
Ensemble Methods

● One of the eager methods => builds model over


the training set

● Construct a set of classifiers from the training


data

● Predict class label of previously unseen records


by aggregating predictions made by multiple
classifiers
General Idea
Why does it work?

● Suppose there are 25 base classifiers


● Each classifier has error rate, ε = 0.35
● Assume classifiers are independent
● Probability that the ensemble classifier makes a wrong
prediction:
Examples of Ensemble Methods

● How to generate an ensemble of classifiers?


● Bagging

● Boosting

● Random Forests
Bagging: Bootstrap AGGregatING
● Bootstrap: data resampling
● Generate multiple training sets
● Resample the original training data
● With replacement

● Data sets have different “specious” patterns


● Sampling with replacement
● Each sample has probability (1 – 1/n)n of being selected

● Build classifier on each bootstrap sample


● Specious patterns will not correlate
● Underlying true pattern will be common to many
● Combine the classifiers: Label new test examples by a majority vote
among classifiers
Boosting

● An iterative procedure to adaptively change


distribution of training data by focusing more on
previously misclassified records
● Initially, all N records are assigned equal weights
● Unlike bagging, weights may change at the end of
boosting round
● The final classifier is the weighted combination of
the weak classifiers.
Boosting

● Records that are wrongly classified will have their


weights increased
● Records that are classified correctly will have
their weights decreased

• Example 4 is hard to classify


• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds

You might also like