0% found this document useful (0 votes)

6 views79 pages

Logistic Regression

The document discusses the differences between generative and discriminative classifiers, highlighting logistic regression as a discriminative classifier that estimates conditional probabilities. It outlines the process of logistic regression, including training with stochastic gradient descent and using cross-entropy loss for optimization. Additionally, it explains the use of the sigmoid function to convert linear combinations of features into probabilities for classification tasks.

Uploaded by

fpar570

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views79 pages

Logistic Regression

Uploaded by

fpar570

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Background: Generative and

Logistic Discriminative Classifiers

Regressio
n
Logistic Regression
Important analytic tool in natural and
social sciences
Baseline supervised machine learning
tool for classification
Is also the foundation of neural
networks
Generative and Discriminative
Classifiers
Naive Bayes is a generative classifier

by contrast:

Logistic regression is a discriminative

classifier
Generative and Discriminative
Classifiers
Suppose we're distinguishing cat from dog images

imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?

Also build a model for dog images

Now given a new image:

Run both models and see which one fits better
Discriminative Classifier
Just try to distinguish dogs from cats

Oh look, dogs have collars!

Let's ignore everything else
A generative models try to model how data is placed throughout the
space while discriminative models draw boundaries in the data
space.
A discriminative model directly learns the conditional probability
distribution P(y|x). Recall that generative model learns the joint
probability P(x,y) and then transform it to P(y|x) by using the Bayes
rule.
Naive bayes classifiers and hidden markov models are examples of
generative classifiers.
Logistic regression, SVM, and tree based classifiers (e.g. decision
tree) are examples of discriminative classifiers.
in
Generative vs Discriminative Classifiers
Naive Bayes

Logistic Regression
posterior
P(c|d)
8
Components of a probabilistic machine learning
classifier
Given m input/output pairs (x(i),y(i)):
The two phases of logistic regression

Training: we learn weights w and b using stochastic

gradient descent and cross-entropy loss.

Test: Given a test example x we compute p(y|x)

using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability
Background: Generative and
Logistic Discriminative Classifiers
Regressio
n
Classification in Logistic Regression
Logistic
Regressio
n
Classification Reminder

Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition
Binary Classification in Logistic Regression
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
Logistic Regression for one observation
x
How to do classification
For each feature xi, weight wi tells us importance of xi
◦ (Plus we'll have a bias b)
We'll sum up all the weighted features and the bias

If this sum is high, we say y=1; if low, then y=0

But we want a probabilistic classifier

We need to formalize “sum is high”.

We’d like a principled classifier that gives us a
probability, just like Naive Bayes did
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)
The problem: z isn't a probability, it's just a
number!

Solution: use a function of z that goes from 0 to 1

The very useful sigmoid or logistic function

21
Idea of logistic regression

We’ll compute w∙x+b

And then we’ll pass it through the
sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability
Making probabilities with sigmoids
By the way:

Because
Turning a probability into a classifier

0.5 here is called the decision boundary

The probabilistic classifier

P(y=1)

wx + b
Turning a probability into a classifier

if w∙x+b > 0
if w∙x+b ≤ 0
Classification in Logistic Regression
Logistic
Regressio
n
Logistic Regression: a text example
Logistic on sentiment classification
Regressio
n
Sentiment example: does y=1 or y=0?

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .

30
31
for any classification task: period
disambiguation
End of sentence
This ends in a period.
The house at 465 Main St. is new.
Not end

32
Classifying sentiment for input x

Suppose w =
b = 0.1 33
Classifying sentiment for input x

34
Classification in (binary) logistic regression:
summary
Given:
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
◦ x1= count( "awesome")
◦ x2 = log(number of words in review)
◦ A vector w of weights [w1, w2, …, wn]
◦ wi for each feature fi
Logistic Regression: a text example
Logistic on sentiment classification
Regressio
n
Learning: Cross-Entropy Loss
Logistic
Regressio
n
Wait, where did the W’s come from?

38
Learning components

A loss function:
◦ cross-entropy loss

An optimization algorithm:
◦ stochastic gradient descent
Intuition of negative log likelihood loss
= cross-entropy loss

A case of conditional maximum likelihood

estimation
We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x
Deriving cross-entropy loss for a single
observation x
Deriving cross-entropy loss for a single observation
x
Goal: maximize probability of the correct label p(y|x)
Maximize:
Now take the log of both sides (mathematically handy)
Maximize:

Whatever values maximize log p(y|x) will also maximize

p(y|x)
Deriving cross-entropy loss for a single observation
x
Goal: maximize probability of the correct label p(y|x)
Maximize:

Minimize:
Let's see if this works for our sentiment
example
We want loss to be:
• smaller if the model estimate is close to correct
• bigger if model is confused
Let's first suppose the true label of this is y=1 (positive)

It's hokey . There are virtually no surprises , and the writing is second-rate
. So why was it so enjoyable ? For one thing , the cast is great .
Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you
.
Let's see if this works for our sentiment
example
True value is y=1. How well is our model doing?

Pretty well! What's the loss?

Let's see if this works for our sentiment
example
Suppose true value instead was y=0.

What's the loss?

Let's see if this works for our sentiment
example
The loss when model was right (if true y=1)

Is lower than the loss when model was wrong (if true y=0):

Sure enough, loss was bigger when model was wrong!

Cross-Entropy Loss
Logistic
Regressio
n
Stochastic Gradient Descent
Logistic
Regressio
n
Our goal: minimize the loss
Intuition of gradient descent
How do I get to the bottom of this river canyon?

Look around me 360∘

Find the direction of
steepest slope down
x Go that way
Our goal: minimize the loss
For logistic regression, loss function is convex
• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

So we'll move positive

Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

So we'll move positive

Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.

Gradient Descent: Find the gradient of the loss

function at the current point and move in the
opposite direction.
How much do we move in that direction ?
Now let's consider N dimensions
We want to know where in the N-dimensional
space (of the N parameters that make up θ ) we
should move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.
Imagine 2 dimensions, w and b

Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
Real gradients
Are much longer; lots and lots of weights
For each dimension wi the gradient component i
tells us the slope with respect to that variable.
◦ “How much would a small change in wi influence the
total loss function L?”
◦ We express the slope as a partial derivative ∂ of the loss
∂wi
The gradient is then defined as a vector of these
partials.
What are these partial derivatives for logistic
regression?

The loss function

The elegant derivative of this function (see textbook 5.8 for derivation)
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
Stochastic Gradient Descent
Logistic
Regressio
n
Stochastic Gradient Descent:
Logistic An example and more details
Regressio
n
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
Example of gradient descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2

where
Gradient vector has 3 dimensions:
Example of gradient descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2

where
Gradient vector has 3 dimensions:
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector

θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector

θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector

θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector

θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;

Note that enough negative examples would eventually make w2 negative

Mini-batch training
Stochastic gradient descent chooses a single
random example at a time.
That can result in choppy movements
More common to compute gradient over batches of
training instances.
Batch training: entire dataset
Mini-batch training: m examples (512, or 1024)
Stochastic Gradient Descent:
Logistic An example and more details
Regressio
n

Logisticregression 2021
No ratings yet
Logisticregression 2021
78 pages
Ch03 LogisticRegression
No ratings yet
Ch03 LogisticRegression
79 pages
Logistic Regression
No ratings yet
Logistic Regression
78 pages
Multimedia Application L9
No ratings yet
Multimedia Application L9
43 pages
5 LR Apr 7 2021
No ratings yet
5 LR Apr 7 2021
93 pages
Logistic Regression for NLP
No ratings yet
Logistic Regression for NLP
64 pages
5 LR Apr 7 2021
No ratings yet
5 LR Apr 7 2021
94 pages
Logistic Regression
No ratings yet
Logistic Regression
91 pages
NLP for Machine Learning Enthusiasts
No ratings yet
NLP for Machine Learning Enthusiasts
53 pages
Logistic Regression
No ratings yet
Logistic Regression
51 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
W2 Ann
No ratings yet
W2 Ann
12 pages
L14 Logistic Regression
No ratings yet
L14 Logistic Regression
22 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
Deep Learning Week 204-4
No ratings yet
Deep Learning Week 204-4
1 page
Lec12 Logreg
No ratings yet
Lec12 Logreg
41 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
Logistic Regressions
No ratings yet
Logistic Regressions
11 pages
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
No ratings yet
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
25 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
09 23ECE216 LogisticRegression
No ratings yet
09 23ECE216 LogisticRegression
40 pages
Lec 02 LogisticReg
No ratings yet
Lec 02 LogisticReg
33 pages
Lecture 5 - Logistic Regression
No ratings yet
Lecture 5 - Logistic Regression
28 pages
Intro to Logistic Regression
No ratings yet
Intro to Logistic Regression
4 pages
Logistic Regression
No ratings yet
Logistic Regression
6 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Lecture 07
No ratings yet
Lecture 07
26 pages
23 LogisticRegression
No ratings yet
23 LogisticRegression
67 pages
Lecture W3
No ratings yet
Lecture W3
28 pages
Logistic Regression
No ratings yet
Logistic Regression
4 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
No ratings yet
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
21 pages
W8 - Logistic Regression
No ratings yet
W8 - Logistic Regression
18 pages
06 LogisticRegression
No ratings yet
06 LogisticRegression
29 pages
Ed3book - Jan72023 87 110
No ratings yet
Ed3book - Jan72023 87 110
24 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
3-LG Eval
No ratings yet
3-LG Eval
52 pages
Logistic Regression Notes
No ratings yet
Logistic Regression Notes
25 pages
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
No ratings yet
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
25 pages
Neural Network
No ratings yet
Neural Network
14 pages
Regression vs Classification Algorithms
100% (1)
Regression vs Classification Algorithms
13 pages
CSCI-43646364 S25 - Lecture 4
No ratings yet
CSCI-43646364 S25 - Lecture 4
92 pages
Logistic Regression - Byimran
No ratings yet
Logistic Regression - Byimran
35 pages
Logistic Regression Explained
No ratings yet
Logistic Regression Explained
15 pages
Logistic Regression
No ratings yet
Logistic Regression
26 pages
Lesson 12 Logistic Regression
No ratings yet
Lesson 12 Logistic Regression
49 pages
Lecture 11 Logistic
No ratings yet
Lecture 11 Logistic
19 pages
Unit 3-ML
No ratings yet
Unit 3-ML
99 pages
Lec 05
No ratings yet
Lec 05
53 pages
Exp 2
No ratings yet
Exp 2
7 pages
Fileml
No ratings yet
Fileml
54 pages
Week 7
No ratings yet
Week 7
21 pages
Logistic Regression
No ratings yet
Logistic Regression
19 pages
Lecture 16 - Classification
No ratings yet
Lecture 16 - Classification
43 pages
Final ML
No ratings yet
Final ML
54 pages
Week 3 - Lecture Slides - Logistic Regression
No ratings yet
Week 3 - Lecture Slides - Logistic Regression
54 pages
Logistic Regression & Classification
No ratings yet
Logistic Regression & Classification
30 pages
Industrial Network Optimization
No ratings yet
Industrial Network Optimization
6 pages
Measuring The Dimensions of Serendipity in Digital Environments
No ratings yet
Measuring The Dimensions of Serendipity in Digital Environments
6 pages
SWOT Analysis of Samsung Corporation LTD
100% (1)
SWOT Analysis of Samsung Corporation LTD
5 pages
Daniel K. Schneider
No ratings yet
Daniel K. Schneider
363 pages
PC Assembly & Disassembly Guide
No ratings yet
PC Assembly & Disassembly Guide
27 pages
Spooky2 Users Guide 20200917
100% (2)
Spooky2 Users Guide 20200917
241 pages
ACP Tutorials PDF
No ratings yet
ACP Tutorials PDF
10 pages
Evaluating Project Performance Using Baseline Schedules
No ratings yet
Evaluating Project Performance Using Baseline Schedules
5 pages
BGP for Network Professionals
No ratings yet
BGP for Network Professionals
12 pages
Parametric Sweeps in Ads
No ratings yet
Parametric Sweeps in Ads
7 pages
Transportation Analytics
No ratings yet
Transportation Analytics
47 pages
Oee Pocket Guide
No ratings yet
Oee Pocket Guide
4 pages
ANTLMonitoring - MRBTS-525536 - MED - Bello Sauces - WN19 - FL19 - GF19 - SBTS19B - ENB - 0000 - 001696 - 000000 - 20210211-1157
No ratings yet
ANTLMonitoring - MRBTS-525536 - MED - Bello Sauces - WN19 - FL19 - GF19 - SBTS19B - ENB - 0000 - 001696 - 000000 - 20210211-1157
6 pages
Toshiba 2SK2961 N-Channel MOSFET
No ratings yet
Toshiba 2SK2961 N-Channel MOSFET
7 pages
Annexe 3.5 - Article GREED... (2024)
No ratings yet
Annexe 3.5 - Article GREED... (2024)
8 pages
Functional Safety Certificate: ICO3S, ICO4S, ICO4D, ICO4N and SOV 1 To 6
100% (1)
Functional Safety Certificate: ICO3S, ICO4S, ICO4D, ICO4N and SOV 1 To 6
5 pages
Nor Azimah Khalid FSKM, Uitm Shah Alam
No ratings yet
Nor Azimah Khalid FSKM, Uitm Shah Alam
39 pages
EN25F80 8 Megabit Serial Flash Memory With 4kbytes Uniform Sector
No ratings yet
EN25F80 8 Megabit Serial Flash Memory With 4kbytes Uniform Sector
32 pages
Learning in A Networked Society: Yael Kali Ayelet Baram-Tsabari Amit M. Schejter Editors
No ratings yet
Learning in A Networked Society: Yael Kali Ayelet Baram-Tsabari Amit M. Schejter Editors
260 pages
CMSC 130 Syllabus
No ratings yet
CMSC 130 Syllabus
2 pages
Iljin Catalogue
No ratings yet
Iljin Catalogue
15 pages
Air Cargo Terminal Workload Optimization
No ratings yet
Air Cargo Terminal Workload Optimization
93 pages
Test Strategy
100% (2)
Test Strategy
28 pages
E8 - Full DC R32
No ratings yet
E8 - Full DC R32
2 pages
Project Design Brief (G2)
No ratings yet
Project Design Brief (G2)
1 page
Automatic Differentiation in Solid Mechanics
No ratings yet
Automatic Differentiation in Solid Mechanics
7 pages
LG K8 (2017) - Schematic Diagarm PDF
No ratings yet
LG K8 (2017) - Schematic Diagarm PDF
141 pages
CMOS 4000 Series IC List
No ratings yet
CMOS 4000 Series IC List
6 pages
Bangladesh Auditor Application Form
No ratings yet
Bangladesh Auditor Application Form
2 pages
1.1 Introduction To Windows Server
No ratings yet
1.1 Introduction To Windows Server
42 pages

Logistic Regression

Uploaded by

Logistic Regression

Uploaded by

Background: Generative and

Logistic Discriminative Classifiers

Logistic regression is a discriminative

Also build a model for dog images

Now given a new image:

Oh look, dogs have collars!

Training: we learn weights w and b using stochastic

Test: Given a test example x we compute p(y|x)

If this sum is high, we say y=1; if low, then y=0

We need to formalize “sum is high”.

Solution: use a function of z that goes from 0 to 1

We’ll compute w∙x+b

0.5 here is called the decision boundary

A case of conditional maximum likelihood

Whatever values maximize log p(y|x) will also maximize

Pretty well! What's the loss?

What's the loss?

Sure enough, loss was bigger when model was wrong!

Look around me 360∘

So we'll move positive

So we'll move positive

Gradient Descent: Find the gradient of the loss

The loss function

Now that we have a gradient, we compute the new parameter vector

Now that we have a gradient, we compute the new parameter vector

Now that we have a gradient, we compute the new parameter vector

Now that we have a gradient, we compute the new parameter vector

Note that enough negative examples would eventually make w2 negative

You might also like