Modeling Social Data, Lecture 8: Classification

classiﬁcation
chris.wiggins@columbia.edu
2017-03-10

example: spam/ham
(cf. jake’s great deck on this)

Learning by example
Jake Hofman (Columbia University) Classiﬁcation: Naive Bayes February 27, 2015 2 / 16

Learning by example
• How did you solve this problem?
• Can you make this process explicit (e.g. write code to do so)?

classification?
build a theory of 3’s?

1-slide summary of classification
• banana or orange?
what would Gauss do?

what would Gauss do?
length
height

length
height
pricesmell
time of purchase

length
height
game theory:
“assume the worst”

length
height
large deviation theory:
“maximum margin”

length
height
pricesmell
time of purchase
boosting (1997)
SVMs (1990s)

“acgt” & gene 45 down?
“cat” & gene 11 up?
“tag” &
gene 34 up?
“gataca” &
gene 37 down?
learn predictive
features from data
“gaga” & gene
1066 up?
“gataca” &
gene 37 down?
• up- or down- regulated?
“gaga” & gene 137 up?

example@NYT in CAR (computer assisted reporting)
Figure 1: Tabuchi article

example in CAR (computer assisted reporting)
◮ cf. Friedman’s “Statistical models and Shoe Leather”1

◮ Takata airbag fatalities

◮ 2219 labeled2 examples from 33,204 comments

◮ 2219 labeled2 examples from 33,204 comments
◮ cf. Box’s “Science and Statistics”3

computer assisted reporting
◮ Impact
Figure 3: impact

review: regression as probability

classiﬁcation as probability
binary/dichotomous/boolean features + NB

digression: bayes rule
generalize, maintain linerarity

Diagnoses a la Bayes1
• You’re testing for a rare disease:
• 1% of the population is infected
• You have a highly sensitive and speciﬁc test:
• 99% of sick patients test positive
• 99% of healthy patients test negative
• Given that a patient tests positive, what is probability the
patient is sick?
1
Wiggins, SciAm 2006

Diagnoses a la Bayes
Population
10,000 ppl
1% Sick
100 ppl
99% Test +
99 ppl
1% Test -
1 per
99% Healthy
9900 ppl
1% Test +
99 ppl
99% Test -
9801 ppl

Population
10,000 ppl
1% Sick
100 ppl
99% Test +
99 ppl
1% Test -
1 per
99% Healthy
9900 ppl
1% Test +
99 ppl
99% Test -
9801 ppl
So given that a patient tests positive (198 ppl), there is a 50%
chance the patient is sick (99 ppl)!

Population
10,000 ppl
1% Sick
100 ppl
99% Test +
99 ppl
1% Test -
1 per
99% Healthy
9900 ppl
1% Test +
99 ppl
99% Test -
9801 ppl
The small error rate on the large healthy population produces
many false positives.

Natural frequencies a la Gigerenzer2
2
http://bit.ly/ggbbc

Inverting conditional probabilities
Bayes’ Theorem
Equate the far right- and left-hand sides of product rule
p (y|x) p (x) = p (x, y) = p (x|y) p (y)
and divide to get the probability of y given x from the probability
of x given y:
p (y|x) =
p (x|y) p (y)
p (x)
where p (x) =
P
y∈ΩY
p (x|y) p (y) is the normalization constant.

Given that a patient tests positive, what is probability the patient
is sick?
p (sick|+) =
99/100
z }| {
p (+|sick)
1/100
z }| {
p (sick)
p (+)
| {z }
99/1002+99/1002=198/1002
=
99
198
=
1
2
where p (+) = p (+|sick) p (sick) + p (+|healthy) p (healthy).

(Super) Naive Bayes
We can use Bayes’ rule to build a one-word spam classiﬁer:
p (spam|word) =
p (word|spam) p (spam)
p (word)
where we estimate these probabilities with ratios of counts:
ˆp(word|spam) =
# spam docs containing word
# spam docs
ˆp(word|ham) =
# ham docs containing word
# ham docs
ˆp(spam) =
# spam docs
# docs
ˆp(ham) =
# ham docs
# docs

(Super) Naive Bayes
$ ./enron_naive_bayes.sh meeting
1500 spam examples
3672 ham examples
16 spam examples containing meeting
153 ham examples containing meeting
estimated P(spam) = .2900
estimated P(ham) = .7100
estimated P(meeting|spam) = .0106
estimated P(meeting|ham) = .0416
P(spam|meeting) = .0923

(Super) Naive Bayes
$ ./enron_naive_bayes.sh money
1500 spam examples
3672 ham examples
194 spam examples containing money
50 ham examples containing money
estimated P(money|spam) = .1293
estimated P(money|ham) = .0136
P(spam|money) = .7957

(Super) Naive Bayes
$ ./enron_naive_bayes.sh enron
1500 spam examples
3672 ham examples
0 spam examples containing enron
1478 ham examples containing enron
estimated P(enron|spam) = 0
estimated P(enron|ham) = .4025
P(spam|enron) = 0

Naive Bayes
Represent each document by a binary vector ~x where xj = 1 if the
j-th word appears in the document (xj = 0 otherwise).
Modeling each word as an independent Bernoulli random variable,
the probability of observing a document ~x of class c is:
p (~x|c) =
Y
j
✓
xj
jc (1 − ✓jc)1−xj
where ✓jc denotes the probability that the j-th word occurs in a
document of class c.

Naive Bayes
Using this likelihood in Bayes’ rule and taking a logarithm, we have:
log p (c|~x) = log
p (~x|c) p (c)
p (~x)
=
X
j
xj log
✓jc
1 − ✓jc
+
X
j
log(1 − ✓jc) + log
✓c
p (~x)
where ✓c is the probability of observing a document of class c.

(a) big picture: surrogate convex loss functions

general
Figure 4: Reminder: Surrogate Loss Functions

boosting
Figure 5: ‘Cited by 12599’

tangent: logistic function as surrogate loss function
◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R

◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))

◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
◮ − log2 p({y}N
1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))

◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
◮ − log2 p({y}N
◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.

◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
◮ − log2 p({y}N
◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
◮ ∴ maximizing log-likelihood is minimizing a surrogate convex
loss function for classiﬁcation

◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
◮ − log2 p({y}N
◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
◮ ∴ maximizing log-likelihood is minimizing a surrogate convex
loss function for classiﬁcation
◮ but i log2 1 + e−yi wT h(xi ) not as easy as i e−yi wT h(xi )

boosting 1
L exponential surrogate loss function, summed over examples:
◮ L[F] = i exp (−yi F(xi ))

boosting 1
◮ = i exp −yi
t
t′ wt′ ht′ (xi ) ≡ Lt(wt)

boosting 1
◮ = i exp −yi
t
◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1}

boosting 1
◮ = i exp −yi
t
◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1}
◮ label y ∈ {−1, +1}

boosting 1
◮ Lt+1(wt; w) ≡ i dt
i exp (−yi wht+1(xi ))
Punchlines: sparse, predictive, interpretable, fast (to execute), and
easy to extend, e.g., trees, ﬂexible hypotheses spaces, L1, L∞
4, . . .

boosting 1
◮ = y=h′ dt
i e−w + y=h′ dt
i e+w ≡ e−w D+ + e+w D−
4, . . .

boosting 1
◮ = y=h′ dt
i e−w + y=h′ dt
i e+w ≡ e−w D+ + e+w D−
◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D−
4, . . .

boosting 1
◮ = y=h′ dt
i e−w + y=h′ dt
i e+w ≡ e−w D+ + e+w D−
◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where
0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1
4, . . .

boosting 1
◮ = y=h′ dt
i e−w + y=h′ dt
i e+w ≡ e−w D+ + e+w D−
◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where
0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1
◮ update example weights dt+1
i = dt
i e∓w
4, . . .
4
Duchi + Singer “Boosting with structural sparsity” ICML ’09

Modeling Social Data, Lecture 8: Classification

More Related Content

Similar to Modeling Social Data, Lecture 8: Classification

More from jakehofman

Recently uploaded

Modeling Social Data, Lecture 8: Classification