Bayes Classifiers
We are about to see some of the mathematical formalisms and
examples, but keep in mind the basic idea.
Find out the probability of the previously unseen instance
belonging to each class, then simply pick the most probable class.
Bayes Classifiers
• Bayesian classifiers use Bayes theorem, which says
p(cj | d ) = p(d | cj ) p(cj)
p(d)
• p(cj | d) = probability of instance d being in class cj,
This is what we are trying to compute
• p(d | cj) = probability of generating instance d given class cj,
We can imagine that being in class cj, causes you to have feature d
with some probability
• p(cj) = probability of occurrence of class cj,
This is just how frequent the class cj, is in our database
• p(d) = probability of instance d occurring
This can actually be ignored, since it is the same for all classes
Assume that we have two classes (Note: “Drew
c1 = male, and c2 = female. can be a male
or female
name”)
We have a person whose gender we do
not know, say “drew” or d. Drew Barrymore
Classifying drew as male or female is
equivalent to asking is it more probable
that drew is male or female, I.e which is
greater p(male | drew) or p(female | drew)
Drew Carey
What is the probability of being called
“drew” given that you are a male?
What is the probability
of being a male?
p(male | drew) = p(drew | male ) p(male)
What is the probability of
p(drew)
being named “drew”?
(actually irrelevant, since it is
that same for all classes)
This is Officer Drew. Is Officer Drew a
Male or Female?
Luckily, we have a small
database with names and
gender.
We can use it to apply Bayes Name gender
rule…
Drew Male
Officer Drew Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Nina Female
Sergio Male
Name gender
Drew Male
Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Officer Drew Nina Female
Sergio Male
p(male | drew) = 1/3 * 3/8 = 0.125
3/8 3/8 Officer Drew is
more likely to be
p(female | drew) = 2/5 * 5/8 = 0.250 a Female.
3/8 3/8
Officer Drew IS a female!
Officer Drew
p(male | drew) = 1/3 * 3/8 = 0.125
3/8 3/8
p(female | drew) = 2/5 * 5/8 = 0.250
3/8 3/8
So far we have only considered Bayes p(cj | d) = p(d | cj ) p(cj)
Classification when we have one
attribute (the “name”). But we may p(d)
have many features.
How do we use all the features?
Name Over 170CM Eye Hair length gender
Drew No Blue Short Male
Claudia Yes Brown Long Female
Drew No Blue Long Female
Drew No Blue Long Female
Alberto Yes Brown Short Male
Karin No Blue Long Female
Nina Yes Brown Short Female
Sergio Yes Blue Long Male
• To simplify the task, naïve Bayesian classifiers assume
attributes have independent distributions, and thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)
The probability of
class cj generating
instance d, equals….
The probability of class cj
generating the observed
value for feature 1,
multiplied by..
The probability of class cj
generating the observed
value for feature 2,
multiplied by..
• To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions, and
thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)
p(officer drew|cj) = p(over_170cm = yes|cj) * p(eye =blue|cj) * ….
Officer Drew
is blue-eyed, p(officer drew| Female) = 2/5 * 3/5 * ….
over 170cm
tall, and has p(officer drew| Male) = 2/3 * 2/3 * ….
long hair
The Naive Bayes classifiers
is often represented as this
type of graph…
cj
Note the direction of the
arrows, which state that
each class causes certain
features, with a certain
probability
p(d1|cj) p(d2|cj) … p(dn|cj)
Naïve Bayes is fast and cj
space efficient
We can look up all the probabilities
with a single scan of the database and
store them in a (small) table…
p(d1|cj) p(d2|cj) … p(dn|cj)
gender Over190cm gender Long Hair gender
Male Yes 0.15 Male Yes 0.05 Male
No 0.85 No 0.95
Female Yes 0.01 Female Yes 0.70 Female
No 0.99 No 0.30
Naïve Bayes is NOT sensitive to irrelevant features...
Suppose we are trying to classify a persons gender based
on several features, including eye color. (Of course, eye
color is completely irrelevant to a persons gender)
p(Jessica |cj) = p(eye = brown|cj) * p( wears_dress = yes|cj) * ….
p(Jessica | Female) = 9,000/10,000 * 9,975/10,000 * ….
p(Jessica | Male) = 9,001/10,000 * 2/10,000 * ….
Almost the same!
However, this assumes that we have good enough estimates of
the probabilities, so the more data the better.
An obvious point. I have used a
simple two class problem, and
cj
two possible values for each
example, for my previous
examples. However we can have
an arbitrary number of classes, or
feature values
p(d1|cj) p(d2|cj) … p(dn|cj)
Animal Mass >10kg Animal Color Animal
Cat Yes 0.15 Cat Black 0.33 Cat
No 0.85 White 0.23
Dog Yes 0.91 Brown 0.44 Dog
No 0.09 Dog Black 0.97
Pig
Pig Yes 0.99 White 0.03
No 0.01 Brown 0.90
Pig Black 0.04
White 0.01
Problem! Naïve Bayesian
p(d|cj)
Classifier
Naïve Bayes assumes
independence of
features…
p(d1|cj) p(d2|cj) p(dn|cj)
gender Over 6 gender Over 200
foot pounds
Male Yes 0.15 Male Yes 0.11
No 0.85 No 0.80
Female Yes 0.01 Female Yes 0.05
No 0.99 No 0.95
Solution Naïve Bayesian
p(d|cj)
Classifier
Consider the
relationships between
attributes…
p(d1|cj) p(d2|cj) p(dn|cj)
gender Over 6 gender Over 200 pounds
foot
Male Yes and Over 6 foot 0.11
Male Yes 0.15
No and Over 6 foot 0.59
No 0.85
Yes and NOT Over 6 foot 0.05
Female Yes 0.01
No and NOT Over 6 foot 0.35
No 0.99
Solution Naïve Bayesian
p(d|cj)
Classifier
Consider the
relationships between
attributes…
p(d1|cj) p(d2|cj) p(dn|cj)
But how do we find the set of connecting arcs??
Advantages/Disadvantages of Naïve Bayes
• Advantages:
– Fast to train (single scan). Fast to classify
– Not sensitive to irrelevant features
– Handles real and discrete data
– Handles streaming data well
• Disadvantages:
– Assumes independence of features