KEMBAR78
Modeling Social Data, Lecture 8: Classification | PDF
classification
chris.wiggins@columbia.edu
2017-03-10
wat?
example: spam/ham
(cf. jake’s great deck on this)
Learning by example
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
Learning by example
• How did you solve this problem?
• Can you make this process explicit (e.g. write code to do so)?
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
classification?
build a theory of 3’s?
1-slide summary of classification
• banana or orange?
what would Gauss do?
1-slide summary of classification
what would Gauss do?
length
height
• banana or orange?
1-slide summary of classification
length
height
pricesmell
time of purchase
• banana or orange?
1-slide summary of classification
length
height
game theory:
“assume the worst”
• banana or orange?
1-slide summary of classification
length
height
large deviation theory:
“maximum margin”
• banana or orange?
1-slide summary of classification
length
height
large deviation theory:
“maximum margin”
• banana or orange?
1-slide summary of classification
length
height
pricesmell
time of purchase
boosting (1997)
SVMs (1990s)
• banana or orange?
1-slide summary of classification
“acgt” & gene 45 down?
“cat” & gene 11 up?
“tag” &
gene 34 up?
“gataca” &
gene 37 down?
learn predictive
features from data
“gaga” & gene
1066 up?
“gataca” &
gene 37 down?
• up- or down- regulated?
“gaga” & gene 137 up?
example: bad bananas
example@NYT in CAR (computer assisted reporting)
Figure 1: Tabuchi article
example in CAR (computer assisted reporting)
◮ cf. Friedman’s “Statistical models and Shoe Leather”1
example in CAR (computer assisted reporting)
◮ cf. Friedman’s “Statistical models and Shoe Leather”1
◮ Takata airbag fatalities
example in CAR (computer assisted reporting)
◮ cf. Friedman’s “Statistical models and Shoe Leather”1
◮ Takata airbag fatalities
◮ 2219 labeled2 examples from 33,204 comments
example in CAR (computer assisted reporting)
◮ cf. Friedman’s “Statistical models and Shoe Leather”1
◮ Takata airbag fatalities
◮ 2219 labeled2 examples from 33,204 comments
◮ cf. Box’s “Science and Statistics”3
computer assisted reporting
◮ Impact
Figure 3: impact
conjecture: cost function?
fallback: probability
review: regression as probability
classification as probability
binary/dichotomous/boolean features + NB
digression: bayes rule
generalize, maintain linerarity
Learning by example
• How did you solve this problem?
• Can you make this process explicit (e.g. write code to do so)?
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
Diagnoses a la Bayes1
• You’re testing for a rare disease:
• 1% of the population is infected
• You have a highly sensitive and specific test:
• 99% of sick patients test positive
• 99% of healthy patients test negative
• Given that a patient tests positive, what is probability the
patient is sick?
1
Wiggins, SciAm 2006
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 3 / 16
Diagnoses a la Bayes
Population
10,000 ppl
1% Sick
100 ppl
99% Test +
99 ppl
1% Test -
1 per
99% Healthy
9900 ppl
1% Test +
99 ppl
99% Test -
9801 ppl
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
Diagnoses a la Bayes
Population
10,000 ppl
1% Sick
100 ppl
99% Test +
99 ppl
1% Test -
1 per
99% Healthy
9900 ppl
1% Test +
99 ppl
99% Test -
9801 ppl
So given that a patient tests positive (198 ppl), there is a 50%
chance the patient is sick (99 ppl)!
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
Diagnoses a la Bayes
Population
10,000 ppl
1% Sick
100 ppl
99% Test +
99 ppl
1% Test -
1 per
99% Healthy
9900 ppl
1% Test +
99 ppl
99% Test -
9801 ppl
The small error rate on the large healthy population produces
many false positives.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
Natural frequencies a la Gigerenzer2
2
http://bit.ly/ggbbc
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 5 / 16
Inverting conditional probabilities
Bayes’ Theorem
Equate the far right- and left-hand sides of product rule
p (y|x) p (x) = p (x, y) = p (x|y) p (y)
and divide to get the probability of y given x from the probability
of x given y:
p (y|x) =
p (x|y) p (y)
p (x)
where p (x) =
P
y∈ΩY
p (x|y) p (y) is the normalization constant.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 6 / 16
Diagnoses a la Bayes
Given that a patient tests positive, what is probability the patient
is sick?
p (sick|+) =
99/100
z }| {
p (+|sick)
1/100
z }| {
p (sick)
p (+)
| {z }
99/1002+99/1002=198/1002
=
99
198
=
1
2
where p (+) = p (+|sick) p (sick) + p (+|healthy) p (healthy).
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 7 / 16
(Super) Naive Bayes
We can use Bayes’ rule to build a one-word spam classifier:
p (spam|word) =
p (word|spam) p (spam)
p (word)
where we estimate these probabilities with ratios of counts:
ˆp(word|spam) =
# spam docs containing word
# spam docs
ˆp(word|ham) =
# ham docs containing word
# ham docs
ˆp(spam) =
# spam docs
# docs
ˆp(ham) =
# ham docs
# docs
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 8 / 16
(Super) Naive Bayes
$ ./enron_naive_bayes.sh meeting
1500 spam examples
3672 ham examples
16 spam examples containing meeting
153 ham examples containing meeting
estimated P(spam) = .2900
estimated P(ham) = .7100
estimated P(meeting|spam) = .0106
estimated P(meeting|ham) = .0416
P(spam|meeting) = .0923
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 9 / 16
(Super) Naive Bayes
$ ./enron_naive_bayes.sh money
1500 spam examples
3672 ham examples
194 spam examples containing money
50 ham examples containing money
estimated P(spam) = .2900
estimated P(ham) = .7100
estimated P(money|spam) = .1293
estimated P(money|ham) = .0136
P(spam|money) = .7957
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 10 / 16
(Super) Naive Bayes
$ ./enron_naive_bayes.sh enron
1500 spam examples
3672 ham examples
0 spam examples containing enron
1478 ham examples containing enron
estimated P(spam) = .2900
estimated P(ham) = .7100
estimated P(enron|spam) = 0
estimated P(enron|ham) = .4025
P(spam|enron) = 0
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 11 / 16
Naive Bayes
Represent each document by a binary vector ~x where xj = 1 if the
j-th word appears in the document (xj = 0 otherwise).
Modeling each word as an independent Bernoulli random variable,
the probability of observing a document ~x of class c is:
p (~x|c) =
Y
j
✓
xj
jc (1 − ✓jc)1−xj
where ✓jc denotes the probability that the j-th word occurs in a
document of class c.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 12 / 16
Naive Bayes
Using this likelihood in Bayes’ rule and taking a logarithm, we have:
log p (c|~x) = log
p (~x|c) p (c)
p (~x)
=
X
j
xj log
✓jc
1 − ✓jc
+
X
j
log(1 − ✓jc) + log
✓c
p (~x)
where ✓c is the probability of observing a document of class c.
Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 13 / 16
(a) big picture: surrogate convex loss functions
general
Figure 4: Reminder: Surrogate Loss Functions
boosting
Figure 5: ‘Cited by 12599’
tangent: logistic function as surrogate loss function
◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
tangent: logistic function as surrogate loss function
◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
tangent: logistic function as surrogate loss function
◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
◮ − log2 p({y}N
1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
tangent: logistic function as surrogate loss function
◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
◮ − log2 p({y}N
1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
tangent: logistic function as surrogate loss function
◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
◮ − log2 p({y}N
1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
◮ ∴ maximizing log-likelihood is minimizing a surrogate convex
loss function for classification
tangent: logistic function as surrogate loss function
◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
◮ − log2 p({y}N
1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
◮ ∴ maximizing log-likelihood is minimizing a surrogate convex
loss function for classification
◮ but i log2 1 + e−yi wT h(xi ) not as easy as i e−yi wT h(xi )
boosting 1
L exponential surrogate loss function, summed over examples:
◮ L[F] = i exp (−yi F(xi ))
boosting 1
L exponential surrogate loss function, summed over examples:
◮ L[F] = i exp (−yi F(xi ))
◮ = i exp −yi
t
t′ wt′ ht′ (xi ) ≡ Lt(wt)
boosting 1
L exponential surrogate loss function, summed over examples:
◮ L[F] = i exp (−yi F(xi ))
◮ = i exp −yi
t
t′ wt′ ht′ (xi ) ≡ Lt(wt)
◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1}
boosting 1
L exponential surrogate loss function, summed over examples:
◮ L[F] = i exp (−yi F(xi ))
◮ = i exp −yi
t
t′ wt′ ht′ (xi ) ≡ Lt(wt)
◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1}
◮ label y ∈ {−1, +1}
boosting 1
L exponential surrogate loss function, summed over examples:
◮ Lt+1(wt; w) ≡ i dt
i exp (−yi wht+1(xi ))
Punchlines: sparse, predictive, interpretable, fast (to execute), and
easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
4, . . .
boosting 1
L exponential surrogate loss function, summed over examples:
◮ Lt+1(wt; w) ≡ i dt
i exp (−yi wht+1(xi ))
◮ = y=h′ dt
i e−w + y=h′ dt
i e+w ≡ e−w D+ + e+w D−
Punchlines: sparse, predictive, interpretable, fast (to execute), and
easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
4, . . .
boosting 1
L exponential surrogate loss function, summed over examples:
◮ Lt+1(wt; w) ≡ i dt
i exp (−yi wht+1(xi ))
◮ = y=h′ dt
i e−w + y=h′ dt
i e+w ≡ e−w D+ + e+w D−
◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D−
Punchlines: sparse, predictive, interpretable, fast (to execute), and
easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
4, . . .
boosting 1
L exponential surrogate loss function, summed over examples:
◮ Lt+1(wt; w) ≡ i dt
i exp (−yi wht+1(xi ))
◮ = y=h′ dt
i e−w + y=h′ dt
i e+w ≡ e−w D+ + e+w D−
◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D−
◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where
0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1
Punchlines: sparse, predictive, interpretable, fast (to execute), and
easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
4, . . .
boosting 1
L exponential surrogate loss function, summed over examples:
◮ Lt+1(wt; w) ≡ i dt
i exp (−yi wht+1(xi ))
◮ = y=h′ dt
i e−w + y=h′ dt
i e+w ≡ e−w D+ + e+w D−
◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D−
◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where
0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1
◮ update example weights dt+1
i = dt
i e∓w
Punchlines: sparse, predictive, interpretable, fast (to execute), and
easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
4, . . .
4
Duchi + Singer “Boosting with structural sparsity” ICML ’09
svm

Modeling Social Data, Lecture 8: Classification

  • 1.
  • 2.
  • 3.
  • 4.
    Learning by example JakeHofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
  • 5.
    Learning by example •How did you solve this problem? • Can you make this process explicit (e.g. write code to do so)? Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
  • 6.
  • 7.
    1-slide summary ofclassification • banana or orange? what would Gauss do?
  • 8.
    1-slide summary ofclassification what would Gauss do? length height • banana or orange?
  • 9.
    1-slide summary ofclassification length height pricesmell time of purchase • banana or orange?
  • 10.
    1-slide summary ofclassification length height game theory: “assume the worst” • banana or orange?
  • 11.
    1-slide summary ofclassification length height large deviation theory: “maximum margin” • banana or orange?
  • 12.
    1-slide summary ofclassification length height large deviation theory: “maximum margin” • banana or orange?
  • 13.
    1-slide summary ofclassification length height pricesmell time of purchase boosting (1997) SVMs (1990s) • banana or orange?
  • 14.
    1-slide summary ofclassification “acgt” & gene 45 down? “cat” & gene 11 up? “tag” & gene 34 up? “gataca” & gene 37 down? learn predictive features from data “gaga” & gene 1066 up? “gataca” & gene 37 down? • up- or down- regulated? “gaga” & gene 137 up?
  • 15.
  • 16.
    example@NYT in CAR(computer assisted reporting) Figure 1: Tabuchi article
  • 17.
    example in CAR(computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”1
  • 18.
    example in CAR(computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”1 ◮ Takata airbag fatalities
  • 19.
    example in CAR(computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”1 ◮ Takata airbag fatalities ◮ 2219 labeled2 examples from 33,204 comments
  • 20.
    example in CAR(computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”1 ◮ Takata airbag fatalities ◮ 2219 labeled2 examples from 33,204 comments ◮ cf. Box’s “Science and Statistics”3
  • 21.
    computer assisted reporting ◮Impact Figure 3: impact
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    Learning by example •How did you solve this problem? • Can you make this process explicit (e.g. write code to do so)? Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16
  • 28.
    Diagnoses a laBayes1 • You’re testing for a rare disease: • 1% of the population is infected • You have a highly sensitive and specific test: • 99% of sick patients test positive • 99% of healthy patients test negative • Given that a patient tests positive, what is probability the patient is sick? 1 Wiggins, SciAm 2006 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 3 / 16
  • 29.
    Diagnoses a laBayes Population 10,000 ppl 1% Sick 100 ppl 99% Test + 99 ppl 1% Test - 1 per 99% Healthy 9900 ppl 1% Test + 99 ppl 99% Test - 9801 ppl Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
  • 30.
    Diagnoses a laBayes Population 10,000 ppl 1% Sick 100 ppl 99% Test + 99 ppl 1% Test - 1 per 99% Healthy 9900 ppl 1% Test + 99 ppl 99% Test - 9801 ppl So given that a patient tests positive (198 ppl), there is a 50% chance the patient is sick (99 ppl)! Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
  • 31.
    Diagnoses a laBayes Population 10,000 ppl 1% Sick 100 ppl 99% Test + 99 ppl 1% Test - 1 per 99% Healthy 9900 ppl 1% Test + 99 ppl 99% Test - 9801 ppl The small error rate on the large healthy population produces many false positives. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16
  • 32.
    Natural frequencies ala Gigerenzer2 2 http://bit.ly/ggbbc Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 5 / 16
  • 33.
    Inverting conditional probabilities Bayes’Theorem Equate the far right- and left-hand sides of product rule p (y|x) p (x) = p (x, y) = p (x|y) p (y) and divide to get the probability of y given x from the probability of x given y: p (y|x) = p (x|y) p (y) p (x) where p (x) = P y∈ΩY p (x|y) p (y) is the normalization constant. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 6 / 16
  • 34.
    Diagnoses a laBayes Given that a patient tests positive, what is probability the patient is sick? p (sick|+) = 99/100 z }| { p (+|sick) 1/100 z }| { p (sick) p (+) | {z } 99/1002+99/1002=198/1002 = 99 198 = 1 2 where p (+) = p (+|sick) p (sick) + p (+|healthy) p (healthy). Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 7 / 16
  • 35.
    (Super) Naive Bayes Wecan use Bayes’ rule to build a one-word spam classifier: p (spam|word) = p (word|spam) p (spam) p (word) where we estimate these probabilities with ratios of counts: ˆp(word|spam) = # spam docs containing word # spam docs ˆp(word|ham) = # ham docs containing word # ham docs ˆp(spam) = # spam docs # docs ˆp(ham) = # ham docs # docs Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 8 / 16
  • 36.
    (Super) Naive Bayes $./enron_naive_bayes.sh meeting 1500 spam examples 3672 ham examples 16 spam examples containing meeting 153 ham examples containing meeting estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(meeting|spam) = .0106 estimated P(meeting|ham) = .0416 P(spam|meeting) = .0923 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 9 / 16
  • 37.
    (Super) Naive Bayes $./enron_naive_bayes.sh money 1500 spam examples 3672 ham examples 194 spam examples containing money 50 ham examples containing money estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(money|spam) = .1293 estimated P(money|ham) = .0136 P(spam|money) = .7957 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 10 / 16
  • 38.
    (Super) Naive Bayes $./enron_naive_bayes.sh enron 1500 spam examples 3672 ham examples 0 spam examples containing enron 1478 ham examples containing enron estimated P(spam) = .2900 estimated P(ham) = .7100 estimated P(enron|spam) = 0 estimated P(enron|ham) = .4025 P(spam|enron) = 0 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 11 / 16
  • 39.
    Naive Bayes Represent eachdocument by a binary vector ~x where xj = 1 if the j-th word appears in the document (xj = 0 otherwise). Modeling each word as an independent Bernoulli random variable, the probability of observing a document ~x of class c is: p (~x|c) = Y j ✓ xj jc (1 − ✓jc)1−xj where ✓jc denotes the probability that the j-th word occurs in a document of class c. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 12 / 16
  • 40.
    Naive Bayes Using thislikelihood in Bayes’ rule and taking a logarithm, we have: log p (c|~x) = log p (~x|c) p (c) p (~x) = X j xj log ✓jc 1 − ✓jc + X j log(1 − ✓jc) + log ✓c p (~x) where ✓c is the probability of observing a document of class c. Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 13 / 16
  • 41.
    (a) big picture:surrogate convex loss functions
  • 42.
    general Figure 4: Reminder:Surrogate Loss Functions
  • 43.
  • 44.
    tangent: logistic functionas surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
  • 45.
    tangent: logistic functionas surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
  • 46.
    tangent: logistic functionas surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
  • 47.
    tangent: logistic functionas surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
  • 48.
    tangent: logistic functionas surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R. ◮ ∴ maximizing log-likelihood is minimizing a surrogate convex loss function for classification
  • 49.
    tangent: logistic functionas surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R. ◮ ∴ maximizing log-likelihood is minimizing a surrogate convex loss function for classification ◮ but i log2 1 + e−yi wT h(xi ) not as easy as i e−yi wT h(xi )
  • 50.
    boosting 1 L exponentialsurrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi ))
  • 51.
    boosting 1 L exponentialsurrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt)
  • 52.
    boosting 1 L exponentialsurrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt) ◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1}
  • 53.
    boosting 1 L exponentialsurrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt) ◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1} ◮ label y ∈ {−1, +1}
  • 54.
    boosting 1 L exponentialsurrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 4, . . .
  • 55.
    boosting 1 L exponentialsurrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 4, . . .
  • 56.
    boosting 1 L exponentialsurrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 4, . . .
  • 57.
    boosting 1 L exponentialsurrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− ◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where 0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1 Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 4, . . .
  • 58.
    boosting 1 L exponentialsurrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− ◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where 0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1 ◮ update example weights dt+1 i = dt i e∓w Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 4, . . . 4 Duchi + Singer “Boosting with structural sparsity” ICML ’09
  • 59.