KEMBAR78
2EL1730 ML Lecture02 Linear and Logistic Regression | PDF | Regression Analysis | Least Squares
0% found this document useful (0 votes)
35 views65 pages

2EL1730 ML Lecture02 Linear and Logistic Regression

Uploaded by

Zakaria Mennioui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views65 pages

2EL1730 ML Lecture02 Linear and Logistic Regression

Uploaded by

Zakaria Mennioui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Machine Learning

2EL1730

Lecture 2
Linear and logistic regression

Fragkiskos Malliaros and Maria Vakalopoulou

Friday, December 04, 2020


About Me

• Undergrad at National Technical University of Athens, Greece

• Ph.D in CS and CV at National Technical University of Athens, Greece

• Postdoc researcher at CentraleSupelec

• Assistant Processor at CentraleSupelec (since Jan. 2019)

• Research interests: ML, DL, Computer Vision, Medical Imaging and Remote
Sensing

2
Acknowledgements

• The lecture is partially based on material by


– Richard Zemel, Raquel Urtasun and Sanja Fidler (University of Toronto)
– Chloé-Agathe Azencott (Mines ParisTech)
– Julian McAuley (UC San Diego)
– Dimitris Papailiopoulos (UW-Madison)
– Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford Univ.)
• http://www.mmds.org
– Panagiotis Tsaparas (UOI)
– Evimaria Terzi (Boston University)
– Andrew Ng (Stanford University)
– Nina Balcan and Matt Gormley (CMU)

Thank you!

3
Last lecture

4
ML Pipeline

Test Data

Feature
Input Data and Model Training Model
Selection

Output (e.g.,
prediction)

5
Bias-Variance Tradeoff

• The center of the target is a model that


perfectly predicts the correct values
• We can repeat our entire model building
process to get a number of separate hits
on the target
– Each hit represents an individual
realization of the model
• Bias measures how far are in general
these models' predictions are from the
correct value
• Variance is how much the predictions for
a given point vary between different
realizations of the model

Source: http://scott.fortmann-roe.com/docs/BiasVariance.html
6
ROC Curve

• ROC = Receiver-Operator Characteristic


• Summarized by the area under the curve (AUROC or AUC)

FPR=0 • Plot TPR vs. FPR for all possible thresholds


TPR=1 (typically generated by the classifier)

• Shows the trade off in sensitivity (TPR) and (1


–specificity) (FPR) for all possible thresholds
(cutoff values between positive and negative
class)

• The larger the AUC, the better is overall


performance

Source: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Visual description of ROC curve: https://www.youtube.com/watch?v=OAl6eAyP-yo 7
Linear regression

8
Regression – Motivation

• Regression Analysis: is a statistical tool for investigating the relationship


between a dependent variable & one or more independent variables.

• Application 1: Predict movie rating


automatically

• Application 2: Can I predict the price of a 70


m2 house in Rue de Rivoli?

• Application 3: How many followers will I get in


twitter? Zillow Kaggle competition
($1,2M)

9
Regression – Introduction (1/2)

• What do all those problems have in common?


– Continues outputs y (e.g., a rating: a real number between 0-5, # of followers,
house price)
• The task of predicting continues outputs is called regression

Regression is one of the simplest supervised learning approaches to learn relationships


between input variables (features) and output variables (predictions)

ML
Data algorithm
Predictor

Labels

10
Regression – Introduction (2/2)

• What do I need in order to predict these outputs?


– Features (inputs): denoted by x
– Training examples: many x(i) for which y (i) is known (e.g., many
movies for which we know the rating)
– A model: a function f that represents the relationship between x
and y
– A loss or a cost or an objective function, which tells us how well our
model approximates the training examples
– Optimization: a way of finding the parameters of our model that
minimizes the loss function

11
Example

• Blue circles are data points (i.e., training


y
examples)
– Goal: we want to fit a curve (green) to
these data points

• Key questions:
– How do we parameterize the model?
– What loss function should we use to judge the fit?
– How do we optimize the fit to the unseen test data?

12
Linear Regression

y
• Given an input x we would like
to compute an output y
• In linear regression we assume
that y and x are related with
the following equation:
What we are
Observed values
trying to predict

y=θx
x
where θ is a parameter of the
model

13
Linear Regression – The General Model

• So far we assumed that the line Y


passes through the origin
• What if the line does not?
• Simply change the model to
θ0
y = θ0 + θ1x
X
• (The above can be generalized to multi-
dimensional data points)
• Idea: we can use least squares to
determine θ 0 , θ 1

14
Linear Regression – Problem Formulation

Data: Inputs are continuous vectors of length K (dimensions, features).


Outputs are continuous scalars

Prediction: Output is a linear function of the inputs

(We assume x1 is 1)

Learning: finds the parameters θ that minimize some objective function

15
Least-Squares Regression (1/2)

• Learning: finds the parameters that minimize some objective


function Very general
optimization setup.
Many ways to solve it

[Objective function] We minimize the sum of the squares:

Why?
– It reduces distance between true measurements and the predicted
hyperplane (line in 1-D)
16
Least-Squares Regression (2/2)

θ
θ

Define a model:

In our case, it’s a linear model:


17
Optimizing the Objective

• Learning: Three approaches to solving

• Approach 1: Gradient Descent (or Batch GD)


– Take larger steps to the opposite direction of the gradient
(For a single update, it computes the gradient using the whole dataset)

• Approach 2: Stochastic Gradient Descent (or Incremental GD)


– Take many small steps to the opposite direction of the gradient
(Updates the gradient using a single sample of the training set)

• Approach 3: Closed form solution


– Set derivatives equal to zero and solve for parameters

18
Gradient Descent – The Idea (1/3)

• Start from a random point u


• How do I get closer to the solution?

19
Gradient Descent – The Idea (2/3)

• Start from a random point u


• How do I get closer to the solution?
• Follow direction to the opposite of the gradient
– The gradient indicates the direction of the steepest descent

20
Gradient Descent – The Idea (3/3)

• Start from a random point u


• How do I get closer to the solution?
• Follow direction to the opposite of the gradient
– The gradient indicates the direction of the steepest descent

λ: learning rate

21
A1: (Batch) Gradient Descent

-
--
-

In order to apply GD to Linear Regression,


we need to compute the gradient of the
objective function (i.e., vector of partial
derivatives)
K
More details about the
derivative in a while
22
Gradient Descent – Convergence

-
• There are many possible ways to detect convergence. For example, we
could check whether the L2 norm of the gradient is below some small
tolerance

• Alternatively we could check that the reduction in the objective function


from one iteration to the next is small
23
A2: Stochastic Gradient Descent (SGD)

-
• Applied to Linear Regression, SGD is called the Least Mean
Squares (LMS) algorithm
• We need a per-example objective:

24
Stochastic Gradient Descent (SGD)

-
• Applied to Linear Regression, SGD is called the Least Mean
Squares (LMS) algorithm
• We need a per-example objective:

25
Partial Derivatives for Linear Regression (1/2)

26
Partial Derivatives for Linear Regression (2/2)

Used by SGD
(aka. LMS)

Used by
Gradient
Descent

27
Least Mean Squares (LMS)

28
A3: Closed Form (Analytical) Solution

• For some objectives we can also find the optimal solution


analytically
– This is the case for linear least-squares regression

• Basic idea:
– Write the cost function J(θ) in a matrix form
– Compute the derivative of J(θ) using the matrix form and set to zero
– Then,

Moore-Penrose pseudoinverse

Source: https://see.stanford.edu/materials/aimlcs229/cs229-notes1.pdf
29
Least Squares – Summary

• Learning: Three approaches to solving

• Approach 1: Gradient Descent (or Batch GD)


– Take larger steps to opposite direction of the gradient
– Pros: conceptually simple, guaranteed convergence
– Cons: batch, often slow to converge
• Approach 2: Stochastic Gradient Descent (or Incremental GD)
– Take many small steps to opposite direction of the gradient
– Pros: memory efficient, fast convergence, less prone to local optima
– Cons: convergence in practice requires tuning of hyper-parameters
• Approach 3: Closed form solution
– Set derivatives equal to zero and solve for parameters
– Pros: one shot algorithm!
– Cons: does not scale to large datasets (matrix inverse is bottleneck)
30
More Powerful Models – Fitting a Polynomial

• What if our linear model is not good? How can we create a more
complicated model?
• We can create a more complicated model by defining input
variables that are combinations of components of x
• Example: an M-th order polynomial function of one dimensional
feature x

• Still, a linear model


o The coefficients associated with the
features are linear

– where x j is the j-th power of x

• Use SGD to optimize the model

31
0th Order Polynomial

32
1st Order Polynomial

33
3rd Order Polynomial

34
9th Order Polynomial

35
Generalization

• Generalization: model’s ability to perform well on unseen data


• The last model with M=9 overfits the data (it models also noise)
• Not a problem if we have lots of training examples
• Let’s see how the mean-squared error (MSE) behaves

Training
Test

36
Polynomial Coefficients

Model complexity

θ0
θ1
θ2
Model parameters

θ3
θ4
θ5
θ6
θ7
θ8
θ9
More complex model 37
Occam’s Razor and Model Complexity

“Among competing hypotheses (models), the one with the fewest


assumptions should be selected”
Values of parameters θ

Θ(1) “Complex model”


θ

Θ(2) “simple model”


θ

Θ(3) “simple model”


θ

38
Model Complexity

“Among competing hypotheses (models), the one with the fewest


assumptions should be selected”

• A “simple” model is one where θ has few non-zero parameters


– Only a few features are relevant (sparsity)
– will be small (L1 norm)

• A “simple” model is one where θ is almost uniform


– Few features are significantly more relevant than others
– will be small (L2 norm)

Occam’s Razor (wikipedia): https://en.wikipedia.org/wiki/Occam%27s_razor


39
Regularization

• Regularization is the process of penalizing model complexity


during training

• The loss function will be similar as before, with an additional


factor, called regularization term, that controls model complexity

• Two main approaches in regularized linear regression:


– Ridge regression: adds an L2 regularization term to linear
regression
– LASSO regression: adds an L1 regularization term to linear
regression

40
Ridge Regression

• Adds an L2 regularizer to Linear Regression

prefers uniform
parameters

Hyperparameter λ: how
much should we trade-off Model complexity
accuracy versus complexity
Ridge Regression

• Adds an L2 regularizer to Linear Regression

prefers uniform
parameters

• Pros
• Mutlicollinearity
• Biased but smaller variance and smaller MSE (Mean Squared Error)
• Cons
• Shrink coefficients to zero but cannot produce a parsimonious model
LASSO Regression

• Adds an L1 regularizer to Linear Regression

yields sparse
parameters

Hyperparameter λ: how Model complexity


much should we trade-off
accuracy versus complexity
LASSO Regression

• Adds an L1 regularizer to Linear Regression

yields sparse
parameters

• Pros
• Enforce sparsity
• Cons
• If a group of predictors are correlated LASSO will pick only one of them
Elastic Net Regression

• Adds an L1 and L2 regularizer to Linear Regression

ElasticNet 1

1 2

• Pros
• Enforce sparsity
• No limitation on the number of selected variables/encouraging grouping
• Cons
• It can suffer from double shrinkage
Regularizers Sum up

• Functions L1(red), L2(blue) L1 & L2 (yellow)


Linear regression

• Sum up
• Simple approach
• It is used to predicting continious dependent variable
• Works very well in practice
• Preferred some times in case of few training data
• Interpretative model

47
Classification as
regression

48
Classification vs. Regression

Given find f such that

Binary classification
• Empirical error of f on the training set,
given a loss function
Multiclass classification

Regression

49
Classification as Regression (1/2)

• Idea: suppose that we have a binary classification problem

• Assuming the standard model for linear regression

• How to obtain θ?
– Use the least-squares presented earlier

• How do I compute the label for a new example?

50
Classification as Regression (2/2)

• One dimensional example. The classifier has the form


Decision Rules

Our classifier has the form


f (x, w) = wo + w T x
• A reasonable decision rule
A reasonable decision rule is
(
1 if f (x, w) ≥ 0
y=
− 1 otherwise 51
Decision Rules

• How can I mathematically write this rule?

• This specifies a linear classifier: it has a linear boundary (hyperplane)

which separates the space into two half-spaces

52
Example in 1-D

In 1-D, the linear boundary is simply a threshold

53
Example in 2-D

In 2-D, the linear boundary is a line

54
Example in 3-D

In 3-D, the linear boundary is a plane

55
Classification vs Regression

• The predictive value is continuous and not probabilistic

• Sensitive to imbalance data when using linear regression for


classification

56
Logistic regression

57
Logistic Regression (1/2)

• Focus on binary classification. Logistic regression aims to model

by training a classifier of the form

• Intuitively, it does not make sense for y i to take values larger than 1
or smaller than 0
– Note the sign() function presented before (classification as a regression task) is
not very useful

58
Logistic Regression (2/2)

• How to turn a real-valued expression


into a probability

• Replace the sign() with the sigmoid or logistic function

σ(ζ)
where

z
59
Problem Formulation

• Given the logistic regression model, how do we fit θ for it?


• Probabilistic interpretation (Bernoulli distribution – success/failure):
– p(y=1|x;θ) = y(x) [probability of success (i.e., y=1)]

– p(y=0)|x;θ) = 1 – y(x) [probability of failure (i.e., y=0)]

– Combine together: p(y|x;θ) = (y(x))y (1-y(x))1-y


Recall that y(x) is the hypothesis
σ(θ Tx)
• How to do training? (probability that the label of x is 1)
– Assuming that the m training examples were generated independently, we can
write down the likelihood of the parameters

60
Likelihood Maximization

• We can learn the parameters of the model by maximizing the likelihood


(Maximum Likelihood Estimation (MLE))
• concave function
• log is a monotonically increasing: the log of a function achieves the
maximum value at the same point as the function itself

• How do we maximize the likelihood?


– Take the derivative and do gradient ascent (maximization problem):
– Or, take the negative of the log likelihood function –L(θ) and perform
(stochastic) gradient descent

Source: https://homes.cs.washington.edu/~marcotcr/blog/concavity/ 61
Summary

• Probabilistic interpretation of classification


– We can have something similar for linear regression
• Extension to multiple classes
– Softmax regression
– Other way to deal with multiclass classification?
• Quick to train
• Fast at classification
• Good accuracy for many simple datasets
• Resistant to overfitting
– Regularized logistic regression

Softmax regression: http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/


62
scikit-learn

http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

63
Next Class

• Probabilistic classifiers
• Refresh your knowledge on the Bayes rules

64
Thank You!

http://www.travelandleisure.com/
65

You might also like