0% found this document useful (0 votes)

35 views65 pages

2EL1730 ML Lecture02 Linear and Logistic Regression

Uploaded by

Zakaria Mennioui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views65 pages

2EL1730 ML Lecture02 Linear and Logistic Regression

Uploaded by

Zakaria Mennioui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Machine Learning

2EL1730

Lecture 2
Linear and logistic regression

Fragkiskos Malliaros and Maria Vakalopoulou

Friday, December 04, 2020

About Me

• Undergrad at National Technical University of Athens, Greece

• Ph.D in CS and CV at National Technical University of Athens, Greece

• Postdoc researcher at CentraleSupelec

• Assistant Processor at CentraleSupelec (since Jan. 2019)

• Research interests: ML, DL, Computer Vision, Medical Imaging and Remote
Sensing

2
Acknowledgements

• The lecture is partially based on material by

– Richard Zemel, Raquel Urtasun and Sanja Fidler (University of Toronto)
– Chloé-Agathe Azencott (Mines ParisTech)
– Julian McAuley (UC San Diego)
– Dimitris Papailiopoulos (UW-Madison)
– Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford Univ.)
• http://www.mmds.org
– Panagiotis Tsaparas (UOI)
– Evimaria Terzi (Boston University)
– Andrew Ng (Stanford University)
– Nina Balcan and Matt Gormley (CMU)

Thank you!

3
Last lecture

4
ML Pipeline

Test Data

Feature
Input Data and Model Training Model
Selection

Output (e.g.,
prediction)

5
Bias-Variance Tradeoff

• The center of the target is a model that

perfectly predicts the correct values
• We can repeat our entire model building
process to get a number of separate hits
on the target
– Each hit represents an individual
realization of the model
• Bias measures how far are in general
these models' predictions are from the
correct value
• Variance is how much the predictions for
a given point vary between different
realizations of the model

Source: http://scott.fortmann-roe.com/docs/BiasVariance.html
6
ROC Curve

• ROC = Receiver-Operator Characteristic

• Summarized by the area under the curve (AUROC or AUC)

FPR=0 • Plot TPR vs. FPR for all possible thresholds

TPR=1 (typically generated by the classifier)

• Shows the trade off in sensitivity (TPR) and (1

–specificity) (FPR) for all possible thresholds
(cutoff values between positive and negative
class)

• The larger the AUC, the better is overall

performance

Source: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Visual description of ROC curve: https://www.youtube.com/watch?v=OAl6eAyP-yo 7
Linear regression

8
Regression – Motivation

• Regression Analysis: is a statistical tool for investigating the relationship

between a dependent variable & one or more independent variables.

• Application 1: Predict movie rating

automatically

• Application 2: Can I predict the price of a 70

m2 house in Rue de Rivoli?

• Application 3: How many followers will I get in

twitter? Zillow Kaggle competition
($1,2M)

9
Regression – Introduction (1/2)

• What do all those problems have in common?

– Continues outputs y (e.g., a rating: a real number between 0-5, # of followers,
house price)
• The task of predicting continues outputs is called regression

Regression is one of the simplest supervised learning approaches to learn relationships

between input variables (features) and output variables (predictions)

ML
Data algorithm
Predictor

Labels

10
Regression – Introduction (2/2)

• What do I need in order to predict these outputs?

– Features (inputs): denoted by x
– Training examples: many x(i) for which y (i) is known (e.g., many
movies for which we know the rating)
– A model: a function f that represents the relationship between x
and y
– A loss or a cost or an objective function, which tells us how well our
model approximates the training examples
– Optimization: a way of finding the parameters of our model that
minimizes the loss function

11
Example

• Blue circles are data points (i.e., training

y
examples)
– Goal: we want to fit a curve (green) to
these data points

• Key questions:
– How do we parameterize the model?
– What loss function should we use to judge the fit?
– How do we optimize the fit to the unseen test data?

12
Linear Regression

y
• Given an input x we would like
to compute an output y
• In linear regression we assume
that y and x are related with
the following equation:
What we are
Observed values
trying to predict

y=θx
x
where θ is a parameter of the
model

13
Linear Regression – The General Model

• So far we assumed that the line Y

passes through the origin
• What if the line does not?
• Simply change the model to
θ0
y = θ0 + θ1x
X
• (The above can be generalized to multi-
dimensional data points)
• Idea: we can use least squares to
determine θ 0 , θ 1

14
Linear Regression – Problem Formulation

Data: Inputs are continuous vectors of length K (dimensions, features).

Outputs are continuous scalars

Prediction: Output is a linear function of the inputs

(We assume x1 is 1)

Learning: finds the parameters θ that minimize some objective function

15
Least-Squares Regression (1/2)

• Learning: finds the parameters that minimize some objective

function Very general
optimization setup.
Many ways to solve it

[Objective function] We minimize the sum of the squares:

Why?
– It reduces distance between true measurements and the predicted
hyperplane (line in 1-D)
16
Least-Squares Regression (2/2)

θ
θ

Define a model:

In our case, it’s a linear model:

17
Optimizing the Objective

• Learning: Three approaches to solving

• Approach 1: Gradient Descent (or Batch GD)

– Take larger steps to the opposite direction of the gradient
(For a single update, it computes the gradient using the whole dataset)

• Approach 2: Stochastic Gradient Descent (or Incremental GD)

– Take many small steps to the opposite direction of the gradient
(Updates the gradient using a single sample of the training set)

• Approach 3: Closed form solution

– Set derivatives equal to zero and solve for parameters

18
Gradient Descent – The Idea (1/3)

• Start from a random point u

• How do I get closer to the solution?

19
Gradient Descent – The Idea (2/3)

• Start from a random point u

• How do I get closer to the solution?
• Follow direction to the opposite of the gradient
– The gradient indicates the direction of the steepest descent

20
Gradient Descent – The Idea (3/3)

• Start from a random point u

• How do I get closer to the solution?
• Follow direction to the opposite of the gradient
– The gradient indicates the direction of the steepest descent

λ: learning rate

21
A1: (Batch) Gradient Descent

-
--
-

In order to apply GD to Linear Regression,

we need to compute the gradient of the
objective function (i.e., vector of partial
derivatives)
K
More details about the
derivative in a while
22
Gradient Descent – Convergence

-
• There are many possible ways to detect convergence. For example, we
could check whether the L2 norm of the gradient is below some small
tolerance

• Alternatively we could check that the reduction in the objective function

from one iteration to the next is small
23
A2: Stochastic Gradient Descent (SGD)

-
• Applied to Linear Regression, SGD is called the Least Mean
Squares (LMS) algorithm
• We need a per-example objective:

24
Stochastic Gradient Descent (SGD)

-
• Applied to Linear Regression, SGD is called the Least Mean
Squares (LMS) algorithm
• We need a per-example objective:

25
Partial Derivatives for Linear Regression (1/2)

26
Partial Derivatives for Linear Regression (2/2)

Used by SGD
(aka. LMS)

Used by
Gradient
Descent

27
Least Mean Squares (LMS)

28
A3: Closed Form (Analytical) Solution

• For some objectives we can also find the optimal solution

analytically
– This is the case for linear least-squares regression

• Basic idea:
– Write the cost function J(θ) in a matrix form
– Compute the derivative of J(θ) using the matrix form and set to zero
– Then,

Moore-Penrose pseudoinverse

Source: https://see.stanford.edu/materials/aimlcs229/cs229-notes1.pdf
29
Least Squares – Summary

• Learning: Three approaches to solving

• Approach 1: Gradient Descent (or Batch GD)

– Take larger steps to opposite direction of the gradient
– Pros: conceptually simple, guaranteed convergence
– Cons: batch, often slow to converge
• Approach 2: Stochastic Gradient Descent (or Incremental GD)
– Take many small steps to opposite direction of the gradient
– Pros: memory efficient, fast convergence, less prone to local optima
– Cons: convergence in practice requires tuning of hyper-parameters
• Approach 3: Closed form solution
– Set derivatives equal to zero and solve for parameters
– Pros: one shot algorithm!
– Cons: does not scale to large datasets (matrix inverse is bottleneck)
30
More Powerful Models – Fitting a Polynomial

• What if our linear model is not good? How can we create a more
complicated model?
• We can create a more complicated model by defining input
variables that are combinations of components of x
• Example: an M-th order polynomial function of one dimensional
feature x

• Still, a linear model

o The coefficients associated with the
features are linear

– where x j is the j-th power of x

• Use SGD to optimize the model

31
0th Order Polynomial

32
1st Order Polynomial

33
3rd Order Polynomial

34
9th Order Polynomial

35
Generalization

• Generalization: model’s ability to perform well on unseen data

• The last model with M=9 overfits the data (it models also noise)
• Not a problem if we have lots of training examples
• Let’s see how the mean-squared error (MSE) behaves

Training
Test

36
Polynomial Coefficients

Model complexity

θ0
θ1
θ2
Model parameters

θ3
θ4
θ5
θ6
θ7
θ8
θ9
More complex model 37
Occam’s Razor and Model Complexity

“Among competing hypotheses (models), the one with the fewest

assumptions should be selected”
Values of parameters θ

Θ(1) “Complex model”

Θ(2) “simple model”

Θ(3) “simple model”

38
Model Complexity

“Among competing hypotheses (models), the one with the fewest

assumptions should be selected”

• A “simple” model is one where θ has few non-zero parameters

– Only a few features are relevant (sparsity)
– will be small (L1 norm)

• A “simple” model is one where θ is almost uniform

– Few features are significantly more relevant than others
– will be small (L2 norm)

Occam’s Razor (wikipedia): https://en.wikipedia.org/wiki/Occam%27s_razor

39
Regularization

• Regularization is the process of penalizing model complexity

during training

• The loss function will be similar as before, with an additional

factor, called regularization term, that controls model complexity

• Two main approaches in regularized linear regression:

– Ridge regression: adds an L2 regularization term to linear
regression
– LASSO regression: adds an L1 regularization term to linear
regression

40
Ridge Regression

• Adds an L2 regularizer to Linear Regression

prefers uniform
parameters

Hyperparameter λ: how
much should we trade-off Model complexity
accuracy versus complexity
Ridge Regression

• Adds an L2 regularizer to Linear Regression

prefers uniform
parameters

• Pros
• Mutlicollinearity
• Biased but smaller variance and smaller MSE (Mean Squared Error)
• Cons
• Shrink coefficients to zero but cannot produce a parsimonious model
LASSO Regression

• Adds an L1 regularizer to Linear Regression

yields sparse
parameters

Hyperparameter λ: how Model complexity

much should we trade-off
accuracy versus complexity
LASSO Regression

• Adds an L1 regularizer to Linear Regression

yields sparse
parameters

• Pros
• Enforce sparsity
• Cons
• If a group of predictors are correlated LASSO will pick only one of them
Elastic Net Regression

• Adds an L1 and L2 regularizer to Linear Regression

ElasticNet 1

1 2

• Pros
• Enforce sparsity
• No limitation on the number of selected variables/encouraging grouping
• Cons
• It can suffer from double shrinkage
Regularizers Sum up

• Functions L1(red), L2(blue) L1 & L2 (yellow)

Linear regression

• Sum up
• Simple approach
• It is used to predicting continious dependent variable
• Works very well in practice
• Preferred some times in case of few training data
• Interpretative model

47
Classification as
regression

48
Classification vs. Regression

Given find f such that

Binary classification
• Empirical error of f on the training set,
given a loss function
Multiclass classification

Regression

49
Classification as Regression (1/2)

• Idea: suppose that we have a binary classification problem

• Assuming the standard model for linear regression

• How to obtain θ?
– Use the least-squares presented earlier

• How do I compute the label for a new example?

50
Classification as Regression (2/2)

• One dimensional example. The classifier has the form

Decision Rules

Our classiﬁer has the form

f (x, w) = wo + w T x
• A reasonable decision rule
A reasonable decision rule is
(
1 if f (x, w) ≥ 0
y=
− 1 otherwise 51
Decision Rules

• How can I mathematically write this rule?

• This specifies a linear classifier: it has a linear boundary (hyperplane)

which separates the space into two half-spaces

52
Example in 1-D

In 1-D, the linear boundary is simply a threshold

53
Example in 2-D

In 2-D, the linear boundary is a line

54
Example in 3-D

In 3-D, the linear boundary is a plane

55
Classification vs Regression

• The predictive value is continuous and not probabilistic

• Sensitive to imbalance data when using linear regression for

classification

56
Logistic regression

57
Logistic Regression (1/2)

• Focus on binary classification. Logistic regression aims to model

by training a classifier of the form

• Intuitively, it does not make sense for y i to take values larger than 1
or smaller than 0
– Note the sign() function presented before (classification as a regression task) is
not very useful

58
Logistic Regression (2/2)

• How to turn a real-valued expression

into a probability

• Replace the sign() with the sigmoid or logistic function

σ(ζ)
where

z
59
Problem Formulation

• Given the logistic regression model, how do we fit θ for it?

• Probabilistic interpretation (Bernoulli distribution – success/failure):
– p(y=1|x;θ) = y(x) [probability of success (i.e., y=1)]

– p(y=0)|x;θ) = 1 – y(x) [probability of failure (i.e., y=0)]

– Combine together: p(y|x;θ) = (y(x))y (1-y(x))1-y

Recall that y(x) is the hypothesis
σ(θ Tx)
• How to do training? (probability that the label of x is 1)
– Assuming that the m training examples were generated independently, we can
write down the likelihood of the parameters

60
Likelihood Maximization

• We can learn the parameters of the model by maximizing the likelihood

(Maximum Likelihood Estimation (MLE))
• concave function
• log is a monotonically increasing: the log of a function achieves the
maximum value at the same point as the function itself

• How do we maximize the likelihood?

– Take the derivative and do gradient ascent (maximization problem):
– Or, take the negative of the log likelihood function –L(θ) and perform
(stochastic) gradient descent

Source: https://homes.cs.washington.edu/~marcotcr/blog/concavity/ 61
Summary

• Probabilistic interpretation of classification

– We can have something similar for linear regression
• Extension to multiple classes
– Softmax regression
– Other way to deal with multiclass classification?
• Quick to train
• Fast at classification
• Good accuracy for many simple datasets
• Resistant to overfitting
– Regularized logistic regression

Softmax regression: http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/

62
scikit-learn

http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

63
Next Class

• Probabilistic classifiers
• Refresh your knowledge on the Bayes rules

64
Thank You!

http://www.travelandleisure.com/
65

SumitBurnwal ML
No ratings yet
SumitBurnwal ML
13 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Lecture3 Upload
No ratings yet
Lecture3 Upload
28 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
CSE 412 Lab Manual 3 Linear Regression
No ratings yet
CSE 412 Lab Manual 3 Linear Regression
10 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Linear Regression Lecture Notes
No ratings yet
Linear Regression Lecture Notes
34 pages
S1 - 25 (NSP) - ML - CS 34 - 10th17th Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 34 - 10th17th Aug 2025
89 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
25 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Regression
No ratings yet
Regression
25 pages
2a Linear Regression 18may
No ratings yet
2a Linear Regression 18may
28 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
ML Lecture - 3
No ratings yet
ML Lecture - 3
47 pages
AAI Lecture 10 SP 25
No ratings yet
AAI Lecture 10 SP 25
37 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
S&ML Unit 5 - Q & A
No ratings yet
S&ML Unit 5 - Q & A
15 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Linear Regression Techniques
No ratings yet
Linear Regression Techniques
25 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
ML 1
No ratings yet
ML 1
24 pages
Linear Regression
No ratings yet
Linear Regression
130 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
LinearRegression1 210720 171800
No ratings yet
LinearRegression1 210720 171800
41 pages
Linear Regression
No ratings yet
Linear Regression
38 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Linear-Regression ML
No ratings yet
Linear-Regression ML
36 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
Linear Regression - Everything You Need To Know About Linear Regression
No ratings yet
Linear Regression - Everything You Need To Know About Linear Regression
17 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
Linear Regression
No ratings yet
Linear Regression
36 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
ML Models and When To Choose One Over Others
No ratings yet
ML Models and When To Choose One Over Others
7 pages
Regression
No ratings yet
Regression
16 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
ML Day3
No ratings yet
ML Day3
10 pages
MECH4403 LR Week04
No ratings yet
MECH4403 LR Week04
25 pages
CSE445 Linear-Regression
No ratings yet
CSE445 Linear-Regression
40 pages
Linear Regression
No ratings yet
Linear Regression
8 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
ML MU Unit 3RegressionTechniquespdf 2025 02-07-10!56!37
No ratings yet
ML MU Unit 3RegressionTechniquespdf 2025 02-07-10!56!37
115 pages
Linear Regression
No ratings yet
Linear Regression
91 pages
Week 04
No ratings yet
Week 04
101 pages
Introml 02 Regression Annotated PDF
No ratings yet
Introml 02 Regression Annotated PDF
26 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
04 LinearModels
No ratings yet
04 LinearModels
28 pages
Everything You Need To Know About Linear Regression
No ratings yet
Everything You Need To Know About Linear Regression
19 pages
Linear Regression
No ratings yet
Linear Regression
89 pages
Elkatawneh, H.H., 2016. Comparing Qualitative and Quantitative Approaches.
No ratings yet
Elkatawneh, H.H., 2016. Comparing Qualitative and Quantitative Approaches.
5 pages
Ajit Tiwari Laptop
No ratings yet
Ajit Tiwari Laptop
69 pages
Chatgpt Learn Statistics
No ratings yet
Chatgpt Learn Statistics
28 pages
Statistics and Probability Q2 M17
No ratings yet
Statistics and Probability Q2 M17
12 pages
PR2 Examreviewer
No ratings yet
PR2 Examreviewer
3 pages
BUS 525 Assignment - Md. Shahriaz Siraj
No ratings yet
BUS 525 Assignment - Md. Shahriaz Siraj
33 pages
68-Article Text-190-1-10-20210206
No ratings yet
68-Article Text-190-1-10-20210206
10 pages
Worksheet-Scientific-Method-Rocket Paper
No ratings yet
Worksheet-Scientific-Method-Rocket Paper
5 pages
Han Proverbs Experimentation
0% (1)
Han Proverbs Experimentation
1 page
Study of Subjective Well-Being and Socially Risky Behaviours of Youth
No ratings yet
Study of Subjective Well-Being and Socially Risky Behaviours of Youth
26 pages
Business Research Methods Guide
No ratings yet
Business Research Methods Guide
49 pages
Number of Siblings, Sibling Spacing, Sex, and Birth Order Their Effects On Perceived
No ratings yet
Number of Siblings, Sibling Spacing, Sex, and Birth Order Their Effects On Perceived
19 pages
1st Summative Test in Practical Research 2: Am I Quantitative or Qualitative?
100% (5)
1st Summative Test in Practical Research 2: Am I Quantitative or Qualitative?
2 pages
Accounting Theory Critique
No ratings yet
Accounting Theory Critique
28 pages
Research Design Definition
100% (1)
Research Design Definition
6 pages
Core Concepts in Statistics and Research Methods.
No ratings yet
Core Concepts in Statistics and Research Methods.
9 pages
Assessment of Hygiene Practices: November 2018
No ratings yet
Assessment of Hygiene Practices: November 2018
10 pages
Logistic Regression-Advanced Biostat PDF
No ratings yet
Logistic Regression-Advanced Biostat PDF
86 pages
Practice Exam (Resit 2023-24) - Solutions
No ratings yet
Practice Exam (Resit 2023-24) - Solutions
29 pages
Encyclopedia of Social Science Research Methods
100% (3)
Encyclopedia of Social Science Research Methods
580 pages
Lecture Machinelearning
No ratings yet
Lecture Machinelearning
32 pages
Teamwork, Psychological Safety, and Patient Safety
100% (1)
Teamwork, Psychological Safety, and Patient Safety
5 pages
Critique: Title of The Study: Factors Associated With College Student Use of New Media For Educational and Social Reasons
No ratings yet
Critique: Title of The Study: Factors Associated With College Student Use of New Media For Educational and Social Reasons
1 page
Applied Econometrics Module
100% (1)
Applied Econometrics Module
142 pages
Ijerph 22 00570
No ratings yet
Ijerph 22 00570
22 pages
Factors Influencing Senior High School Students' Development of English Language Skills in Cagayan de Oro City
No ratings yet
Factors Influencing Senior High School Students' Development of English Language Skills in Cagayan de Oro City
11 pages
Unit - 2 Machine Learning
No ratings yet
Unit - 2 Machine Learning
45 pages
4 - Chrnic Kidney Disease Prediction Based On Machine Learning Algorithms
No ratings yet
4 - Chrnic Kidney Disease Prediction Based On Machine Learning Algorithms
12 pages
Data Science Exam for CSE Students
No ratings yet
Data Science Exam for CSE Students
4 pages
Slide - INTRODUCTION BUSINESS ANALYTICS & SIMULATION
No ratings yet
Slide - INTRODUCTION BUSINESS ANALYTICS & SIMULATION
47 pages