Logistic Regression
Rishabh Iyer
University of Texas at Dallas
based on the slides of Nick Rouzzi and Vibhav Gogate
Last Time
• Supervised learning via naive Bayes
• Use MLE to estimate a distribution 𝑝 𝑥, 𝑦 = 𝑝 𝑦 𝑝(𝑥|𝑦)
• Classify by looking at the conditional distribution, 𝑝(𝑦|𝑥)
• Today: logistic regression
2
Logistic Regression
• Learn 𝑝(𝑌|𝑋) directly from the data
• Assume a particular functional form, e.g., a linear classifier
𝑝 𝑌 = 1 𝑥 = 1 on one side and 0 on the other
• Not differentiable…
𝑝(𝑌 = 1|𝑥) = 0
• Makes it difficult to learn
• Can’t handle noisy labels
𝑝(𝑌 = 1|𝑥) = 1
3
Logistic Regression
• Learn 𝑝(𝑦|𝑥) directly from the data
• Assume a particular functional
form
1
𝑝 𝑌 = −1 𝑥 =
1 + exp 𝑤 𝑇 𝑥 + 𝑏
exp 𝑤 𝑇 𝑥 + 𝑏
𝑝 𝑌=1𝑥 =
1 + exp 𝑤 𝑇 𝑥 + 𝑏
4
Logistic Function in 𝑚 Dimensions
1
𝑝 𝑌 = −1 𝑥 =
1 + exp 𝑤 𝑇 𝑥 + 𝑏
Can be applied to
discrete and
continuous features
5
Functional Form: Two classes
• Given some 𝑤 and 𝑏, we can classify a new point 𝑥 by assigning
the label 1 if 𝑝 𝑌 = 1 𝑥 > 𝑝(𝑌 = −1|𝑥) and −1 otherwise
• This leads to a linear classification rule:
• Classify as a 1 if 𝑤 𝑇 𝑥 + 𝑏 > 0
• Classify as a −1 if 𝑤 𝑇 𝑥 + 𝑏 < 0
6
Learning the Weights
• To learn the weights, we maximize the conditional likelihood
𝑁
𝑤 ∗ , 𝑏 ∗ = arg max ෑ 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑤,𝑏
𝑖=1
• This is the not the same strategy that we used in the case of
naive Bayes
• For naive Bayes, we maximized the log-likelihood
7
Generative vs. Discriminative Classifiers
Generative classifier: Discriminative classifiers:
(e.g., Naïve Bayes) (e.g., Logistic Regression)
• Assume some functional form • Assume some functional
for 𝑝(𝑥|𝑦), 𝑝(𝑦) form for 𝑝(𝑦|𝑥)
• Estimate parameters of 𝑝(𝑥|𝑦), • Estimate parameters of
𝑝(𝑦) directly from training data 𝑝(𝑦|𝑥) directly from training
• Use Bayes rule to calculate data
𝑝 𝑦𝑥 • This is a discriminative model
• This is a generative model • Directly learn 𝑝(𝑦|𝑥)
• Indirect computation of 𝑝(𝑌|𝑋) • But cannot obtain a sample of
through Bayes rule the data as 𝑝(𝑥) is not
available
• As a result, can also generate a
sample of the data, • Useful for discriminating labels
𝑝(𝑥) = 𝑦 𝑝 𝑦 𝑝(𝑥|𝑦)
8
Learning the Weights
𝑁
ℓ 𝑤, 𝑏 = ln ෑ 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑖=1
𝑁
= ln 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑖=1
𝑁 𝑖
𝑦 +1 𝑖
𝑦 𝑖 +1
= ln 𝑝(𝑌 = 1|𝑥 , 𝑤, 𝑏) + 1 − ln 𝑝(𝑌 = −1|𝑥 𝑖 , 𝑤, 𝑏)
2 2
𝑖=1
𝑁
𝑦 𝑖 +1 𝑝 𝑌 = 1 𝑥 𝑖 , 𝑤, 𝑏 𝑖
= ln + ln 𝑝(𝑌 = −1|𝑥 , 𝑤, 𝑏)
𝑖=1
2 𝑝 𝑌 = −1 𝑥 𝑖 , 𝑤, 𝑏
𝑁
𝑦 𝑖 + 1 𝑇 (𝑖)
= 𝑤 𝑥 + 𝑏 − ln 1 + exp 𝑤 𝑇 𝑥 𝑖
+𝑏
2
𝑖=1
9
Learning the Weights
𝑁
ℓ 𝑤, 𝑏 = ln ෑ 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑖=1
𝑁
= ln 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑖=1
𝑁 𝑖
𝑦 +1 𝑖
𝑦 𝑖 +1
= ln 𝑝(𝑌 = 1|𝑥 , 𝑤, 𝑏) + 1 − ln 𝑝(𝑌 = −1|𝑥 𝑖 , 𝑤, 𝑏)
2 2
𝑖=1
𝑁
𝑦 𝑖 +1 𝑝 𝑌 = 1 𝑥 𝑖 , 𝑤, 𝑏 𝑖
= ln + ln 𝑝(𝑌 = −1|𝑥 , 𝑤, 𝑏)
𝑖=1
2 𝑝 𝑌 = −1 𝑥 𝑖 , 𝑤, 𝑏
𝑁
𝑦 𝑖 + 1 𝑇 (𝑖)
= 𝑤 𝑥 + 𝑏 − ln 1 + exp 𝑤 𝑇 𝑥 𝑖
+𝑏
2
𝑖=1
This is concave in 𝑤 and 𝑏: take
derivatives and solve!
10
Learning the Weights
𝑁
ℓ 𝑤, 𝑏 = ln ෑ 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑖=1
𝑁
= ln 𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝑤, 𝑏)
𝑖=1
𝑁 𝑖
𝑦 +1 𝑖
𝑦 𝑖 +1
= ln 𝑝(𝑌 = 1|𝑥 , 𝑤, 𝑏) + 1 − ln 𝑝(𝑌 = −1|𝑥 𝑖 , 𝑤, 𝑏)
2 2
𝑖=1
𝑁
𝑦 𝑖 +1 𝑝 𝑌 = 1 𝑥 𝑖 , 𝑤, 𝑏 𝑖
= ln + ln 𝑝(𝑌 = −1|𝑥 , 𝑤, 𝑏)
𝑖=1
2 𝑝 𝑌 = −1 𝑥 𝑖 , 𝑤, 𝑏
𝑁
𝑦 𝑖 + 1 𝑇 (𝑖)
= 𝑤 𝑥 + 𝑏 − ln 1 + exp 𝑤 𝑇 𝑥 𝑖
+𝑏
2
𝑖=1
No closed form solution
11
Learning the Weights
• Can apply gradient ascent to maximize the conditional likelihood
𝑁
𝜕ℓ 𝑦 𝑖 +1
= − 𝑝(𝑌 = 1|𝑥 𝑖 , 𝑤, 𝑏)
𝜕𝑏 2
𝑖=1
𝑁
𝑖
𝜕ℓ (𝑖) 𝑦 +1
= 𝑥𝑗 − 𝑝(𝑌 = 1|𝑥 𝑖 , 𝑤, 𝑏)
𝜕𝑤𝑗 2
𝑖=1
12
Priors
• Can define priors on the weights to prevent overfitting
• Normal distribution, zero mean, identity covariance
1 𝑤𝑗2
𝑝 𝑤 =ෑ exp − 2
2𝜋𝜎 2 2𝜎
𝑗
• “Pushes” parameters towards zero
• Regularization
• Helps avoid very large weights and overfitting
13
Priors as Regularization
• The log-MAP objective with this Gaussian prior is then
𝑁 𝑁
𝑖 𝑖 𝑖 𝑖
𝜆 2
ln ෑ 𝑝 𝑦 𝑥 , 𝑤, 𝑏 𝑝 𝑤 𝑝(𝑏) = ln 𝑝 𝑦 𝑥 , 𝑤, 𝑏 − 𝑤 2
2
𝑖=1 𝑖
• Quadratic penalty: drives weights towards zero
• Adds a negative linear term to the gradients
• Different priors can produce different kinds of regularization
14
Priors as Regularization
• The log-MAP objective with this Gaussian prior is then
𝑁 𝑁
𝑖 𝑖 𝑖 𝑖
𝜆 2
ln ෑ 𝑝 𝑦 𝑥 , 𝑤, 𝑏 𝑝 𝑤 𝑝(𝑏) = ln 𝑝 𝑦 𝑥 , 𝑤, 𝑏 − 𝑤 2
2
𝑖=1 𝑖
• Quadratic penalty: drives weights towards zero
• Adds a negative linear term to the gradients
• Different priors can produce different kinds of
regularization
Somtimes called an ℓ2
regularizer
15
Regularization
ℓ1 ℓ2
16
Naïve Bayes vs. Logistic Regression
• Non-asymptotic analysis (for Gaussian NB)
• Convergence rate of parameter estimates as size of training
data tends to infinity (𝑛 = # of attributes in 𝑋)
• Naïve Bayes needs 𝑂(log 𝑛) samples
• NB converges quickly to its (perhaps less helpful)
asymptotic estimates
• Logistic Regression needs 𝑂(𝑛) samples
• LR converges more slowly but makes no
independence assumptions (typically less biased)
17
[Ng & Jordan, 2002]
NB vs. LR (on UCI datasets)
Naïve bayes Sample size 𝑚
Logistic Regression
18 [Ng & Jordan, 2002]
LR in General
• Suppose that 𝑦 ∈ {1, … , 𝑅}, i.e., that there are 𝑅 different class
labels
• Can define a collection of weights and biases as follows
• Choose a vector of biases and a matrix of weights such that
for 𝑦 ≠ 𝑅
exp 𝑏𝑘 + σ𝑖 𝑤𝑘𝑖 𝑥𝑖
𝑝 𝑌=𝑘𝑥 =
1 + σ𝑗<𝑅 exp 𝑏𝑗 + σ𝑖 𝑤𝑗𝑖 𝑥𝑖
and
1
𝑝 𝑌=𝑅𝑥 =
1 + σ𝑗<𝑅 exp 𝑏𝑗 + σ𝑖 𝑤𝑗𝑖 𝑥𝑖
19