KEMBAR78
Lecture 11 Logistic | PDF | Statistical Classification | Logistic Regression
0% found this document useful (0 votes)
6 views19 pages

Lecture 11 Logistic

Uploaded by

jsandeep.esummit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views19 pages

Lecture 11 Logistic

Uploaded by

jsandeep.esummit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Logistic Regression

Rishabh Iyer
University of Texas at Dallas

based on the slides of Nick Rouzzi and Vibhav Gogate


Last Time

• Supervised learning via naive Bayes

• Use MLE to estimate a distribution 𝑝 𝑥, 𝑦 = 𝑝 𝑦 𝑝(𝑥|𝑦)

• Classify by looking at the conditional distribution, 𝑝(𝑦|𝑥)

• Today: logistic regression

2
Logistic Regression
• Learn 𝑝(𝑌|𝑋) directly from the data

• Assume a particular functional form, e.g., a linear classifier


𝑝 𝑌 = 1 𝑥 = 1 on one side and 0 on the other

• Not differentiable…
𝑝(𝑌 = 1|𝑥) = 0
• Makes it difficult to learn

• Can’t handle noisy labels

𝑝(𝑌 = 1|𝑥) = 1
3
Logistic Regression

• Learn 𝑝(𝑦|𝑥) directly from the data

• Assume a particular functional


form
1
𝑝 𝑌 = −1 𝑥 =
1 + exp 𝑤 𝑇 𝑥 + 𝑏

exp 𝑤 𝑇 𝑥 + 𝑏
𝑝 𝑌=1𝑥 =
1 + exp 𝑤 𝑇 𝑥 + 𝑏

4
Logistic Function in 𝑚 Dimensions

1
𝑝 𝑌 = −1 𝑥 =
1 + exp 𝑤 𝑇 𝑥 + 𝑏

Can be applied to
discrete and
continuous features

5
Functional Form: Two classes
• Given some 𝑤 and 𝑏, we can classify a new point 𝑥 by assigning
the label 1 if 𝑝 𝑌 = 1 𝑥 > 𝑝(𝑌 = −1|𝑥) and −1 otherwise

• This leads to a linear classification rule:

• Classify as a 1 if 𝑤 𝑇 𝑥 + 𝑏 > 0

• Classify as a −1 if 𝑤 𝑇 𝑥 + 𝑏 < 0

6
Learning the Weights

• To learn the weights, we maximize the conditional likelihood


𝑁

𝑤 ∗ , 𝑏 ∗ = arg max ෑ 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)


𝑤,𝑏
𝑖=1

• This is the not the same strategy that we used in the case of
naive Bayes

• For naive Bayes, we maximized the log-likelihood

7
Generative vs. Discriminative Classifiers
Generative classifier: Discriminative classifiers:
(e.g., Naïve Bayes) (e.g., Logistic Regression)
• Assume some functional form • Assume some functional
for 𝑝(𝑥|𝑦), 𝑝(𝑦) form for 𝑝(𝑦|𝑥)
• Estimate parameters of 𝑝(𝑥|𝑦), • Estimate parameters of
𝑝(𝑦) directly from training data 𝑝(𝑦|𝑥) directly from training
• Use Bayes rule to calculate data
𝑝 𝑦𝑥 • This is a discriminative model
• This is a generative model • Directly learn 𝑝(𝑦|𝑥)
• Indirect computation of 𝑝(𝑌|𝑋) • But cannot obtain a sample of
through Bayes rule the data as 𝑝(𝑥) is not
available
• As a result, can also generate a
sample of the data, • Useful for discriminating labels
𝑝(𝑥) = 𝑦 𝑝 𝑦 𝑝(𝑥|𝑦)
8
Learning the Weights
𝑁

ℓ 𝑤, 𝑏 = ln ෑ 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑖=1
𝑁

= ෍ ln 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑖=1
𝑁 𝑖
𝑦 +1 𝑖
𝑦 𝑖 +1
=෍ ln 𝑝(𝑌 = 1|𝑥 , 𝑤, 𝑏) + 1 − ln 𝑝(𝑌 = −1|𝑥 𝑖 , 𝑤, 𝑏)
2 2
𝑖=1
𝑁
𝑦 𝑖 +1 𝑝 𝑌 = 1 𝑥 𝑖 , 𝑤, 𝑏 𝑖
=෍ ln + ln 𝑝(𝑌 = −1|𝑥 , 𝑤, 𝑏)
𝑖=1
2 𝑝 𝑌 = −1 𝑥 𝑖 , 𝑤, 𝑏
𝑁
𝑦 𝑖 + 1 𝑇 (𝑖)
=෍ 𝑤 𝑥 + 𝑏 − ln 1 + exp 𝑤 𝑇 𝑥 𝑖
+𝑏
2
𝑖=1

9
Learning the Weights
𝑁

ℓ 𝑤, 𝑏 = ln ෑ 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑖=1
𝑁

= ෍ ln 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑖=1
𝑁 𝑖
𝑦 +1 𝑖
𝑦 𝑖 +1
=෍ ln 𝑝(𝑌 = 1|𝑥 , 𝑤, 𝑏) + 1 − ln 𝑝(𝑌 = −1|𝑥 𝑖 , 𝑤, 𝑏)
2 2
𝑖=1
𝑁
𝑦 𝑖 +1 𝑝 𝑌 = 1 𝑥 𝑖 , 𝑤, 𝑏 𝑖
=෍ ln + ln 𝑝(𝑌 = −1|𝑥 , 𝑤, 𝑏)
𝑖=1
2 𝑝 𝑌 = −1 𝑥 𝑖 , 𝑤, 𝑏
𝑁
𝑦 𝑖 + 1 𝑇 (𝑖)
=෍ 𝑤 𝑥 + 𝑏 − ln 1 + exp 𝑤 𝑇 𝑥 𝑖
+𝑏
2
𝑖=1

This is concave in 𝑤 and 𝑏: take


derivatives and solve!
10
Learning the Weights
𝑁

ℓ 𝑤, 𝑏 = ln ෑ 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑖=1
𝑁

= ෍ ln 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑖=1
𝑁 𝑖
𝑦 +1 𝑖
𝑦 𝑖 +1
=෍ ln 𝑝(𝑌 = 1|𝑥 , 𝑤, 𝑏) + 1 − ln 𝑝(𝑌 = −1|𝑥 𝑖 , 𝑤, 𝑏)
2 2
𝑖=1
𝑁
𝑦 𝑖 +1 𝑝 𝑌 = 1 𝑥 𝑖 , 𝑤, 𝑏 𝑖
=෍ ln + ln 𝑝(𝑌 = −1|𝑥 , 𝑤, 𝑏)
𝑖=1
2 𝑝 𝑌 = −1 𝑥 𝑖 , 𝑤, 𝑏
𝑁
𝑦 𝑖 + 1 𝑇 (𝑖)
=෍ 𝑤 𝑥 + 𝑏 − ln 1 + exp 𝑤 𝑇 𝑥 𝑖
+𝑏
2
𝑖=1

No closed form solution 

11
Learning the Weights

• Can apply gradient ascent to maximize the conditional likelihood


𝑁
𝜕ℓ 𝑦 𝑖 +1
=෍ − 𝑝(𝑌 = 1|𝑥 𝑖 , 𝑤, 𝑏)
𝜕𝑏 2
𝑖=1
𝑁
𝑖
𝜕ℓ (𝑖) 𝑦 +1
= ෍ 𝑥𝑗 − 𝑝(𝑌 = 1|𝑥 𝑖 , 𝑤, 𝑏)
𝜕𝑤𝑗 2
𝑖=1

12
Priors
• Can define priors on the weights to prevent overfitting

• Normal distribution, zero mean, identity covariance

1 𝑤𝑗2
𝑝 𝑤 =ෑ exp − 2
2𝜋𝜎 2 2𝜎
𝑗

• “Pushes” parameters towards zero

• Regularization

• Helps avoid very large weights and overfitting

13
Priors as Regularization
• The log-MAP objective with this Gaussian prior is then

𝑁 𝑁
𝑖 𝑖 𝑖 𝑖
𝜆 2
ln ෑ 𝑝 𝑦 𝑥 , 𝑤, 𝑏 𝑝 𝑤 𝑝(𝑏) = ෍ ln 𝑝 𝑦 𝑥 , 𝑤, 𝑏 − 𝑤 2
2
𝑖=1 𝑖

• Quadratic penalty: drives weights towards zero

• Adds a negative linear term to the gradients

• Different priors can produce different kinds of regularization

14
Priors as Regularization
• The log-MAP objective with this Gaussian prior is then

𝑁 𝑁
𝑖 𝑖 𝑖 𝑖
𝜆 2
ln ෑ 𝑝 𝑦 𝑥 , 𝑤, 𝑏 𝑝 𝑤 𝑝(𝑏) = ෍ ln 𝑝 𝑦 𝑥 , 𝑤, 𝑏 − 𝑤 2
2
𝑖=1 𝑖

• Quadratic penalty: drives weights towards zero

• Adds a negative linear term to the gradients

• Different priors can produce different kinds of


regularization
Somtimes called an ℓ2
regularizer

15
Regularization

ℓ1 ℓ2
16
Naïve Bayes vs. Logistic Regression
• Non-asymptotic analysis (for Gaussian NB)
• Convergence rate of parameter estimates as size of training
data tends to infinity (𝑛 = # of attributes in 𝑋)

• Naïve Bayes needs 𝑂(log 𝑛) samples


• NB converges quickly to its (perhaps less helpful)
asymptotic estimates
• Logistic Regression needs 𝑂(𝑛) samples
• LR converges more slowly but makes no
independence assumptions (typically less biased)

17
[Ng & Jordan, 2002]
NB vs. LR (on UCI datasets)

Naïve bayes Sample size 𝑚


Logistic Regression
18 [Ng & Jordan, 2002]
LR in General
• Suppose that 𝑦 ∈ {1, … , 𝑅}, i.e., that there are 𝑅 different class
labels

• Can define a collection of weights and biases as follows

• Choose a vector of biases and a matrix of weights such that


for 𝑦 ≠ 𝑅
exp 𝑏𝑘 + σ𝑖 𝑤𝑘𝑖 𝑥𝑖
𝑝 𝑌=𝑘𝑥 =
1 + σ𝑗<𝑅 exp 𝑏𝑗 + σ𝑖 𝑤𝑗𝑖 𝑥𝑖
and
1
𝑝 𝑌=𝑅𝑥 =
1 + σ𝑗<𝑅 exp 𝑏𝑗 + σ𝑖 𝑤𝑗𝑖 𝑥𝑖

19

You might also like