KEMBAR78
Regression Analysis | PDF | Regression Analysis | Linear Regression
0% found this document useful (0 votes)
11 views54 pages

Regression Analysis

Uploaded by

245123742004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views54 pages

Regression Analysis

Uploaded by

245123742004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Regression Analysis

Dr. Venkateswara Rao

Note: These slides were assembled by Dr. K. V. Rao, with grateful acknowledgement of the many others who made their course materials available online.
Outline
• Problem Formulation
• Linear Regression Model
• Optimization
– Gradient descent algorithm
– Mini-batch gradient descent algorithm
– Stochastic gradient descent algorithm
• Vectorization
• Closed Form Solution
• Practical advises
• Probabilistic interpretation of Linear Regression.
Problem Formulation
• Given a the training set of pairs
𝑥 (1) , 𝑦 (1) , 𝑥 (2) , 𝑦 (2) , … , 𝑥 (𝑚) , 𝑦 (𝑚) , where
𝑥 (𝑖) ∈ 𝑅𝑑 and 𝑦 (𝑖) is a continuous target variable, the
task is to predict for 𝑥 (𝑚+𝑗) , 𝑗 ≥ 1.
Let us start with an Example
• How do we predict housing prices
– Collect data regarding housing prices and how they relate
to size in feet.
Example problem

• "Given this data, a friend has a house 750 square feet


- how much can they be expected to get?“

• Straight line through data


– Maybe $150 000

• Second order polynomial


– Maybe $200 000
Linear Regression
Linear Regression
• With our training set defined - how do we use it?
– Take training set
– Pass into a learning algorithm
– Algorithm outputs a function (denoted h )
– This function takes an input (e.g. size of new house)
• Tries to output the estimated value of Y
• How do we represent hypothesis h ?
– Going to present h as hθ(x) = θ0 + θ1x
– Means Y is a linear function of x
– θi are parameters

• A linear regression with one variable is also called univariate linear


regression

• So in summary
– A hypothesis takes in some variable
– Uses parameters determined by a learning system
– Outputs a prediction based on that input
Which line is best ?
• Many lines are possible !!
Which is the best?
• A cost function lets us figure out
how to fit the best straight line
to our data.
• What makes a line different ?
– Parameters θ0, θ1
• Which is the best line ?
– The line that minimizes the difference between the actual
and estimated prices.
• What is our objective ?
– Choose these parameters θ0, θ1 so that hθ(x) is close to y
for our training examples, i.e. minimize the difference
between h(x) and y for each/any/every example.
Objective
1
σ𝑚 𝑖 𝑖 2
• Loss function: 𝐽 𝜃 = 𝑖=1 ℎ𝜃 𝑥 −𝑦
2𝑚

• 𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽 𝜃
𝜃0 , 𝜃1

• How do we achieve this ?


– That is where gradient descent helps us !
Gradient

• The gradient is just the vector of partial derivatives


Graphical representation of a
convex cost function
𝐽 𝜽

𝜃1

• If the slope is negative we have


to increment 𝜃1 to reach to
the minimum value of 𝐽 𝜽 .
• If the slope is positive we have
to decrement 𝜃1 to reach to
the minimum value of 𝐽 𝜽 .
Is this okay ?
Initialize 𝜃0 , 𝜃1
• Repeat the following until convergence Gradient
Descent
Algorithm
– That is when the gradient is negative we are incrementing 𝜃𝑗
and when the gradient is positive we are decrementing 𝜃𝑗 .

𝜕 1
• j=0: 𝐽 𝜃0 , 𝜃1 = σ𝑚 (ℎ
𝑖=1 𝜃 𝑥 𝑖 − 𝑦𝑖 )
𝜕𝜃0 𝑚
𝜕 1
• j=1: 𝐽 𝜃0 , 𝜃1 = σ𝑚 ℎ
𝑖=1 𝜃 𝑥 𝑖 − 𝑦𝑖 . 𝑥𝑖
𝜕𝜃1 𝑚
Gradient Descent
• So, Gradient descent is an optimization
algorithm used to find the values of
parameters of a function that minimizes a
cost function (cost).

• Gradient descent is best used when the


parameters cannot be calculated
analytically (e.g. using linear algebra) and
must be searched for by an optimization
algorithm.
Multivariate Linear Regression
• Linear Regression with multiple input variables/features.

• More notations
– n: number of features (n = 4)
– m : number of examples (i.e. number of rows in a table)
– xi : vector of the input for an example (so a vector of the four
parameters for the ith input example)
• x3 is, for example, the 3rd house, and contains the four features
associated with that house
– xji The value of feature j in the ith training example
• x23 is, for example, the number of bedrooms in the third house
What is the form of our hypothesis?

• Previously our hypothesis took the following form.


– hθ(x) = θ0 + θ1x
• Here we have two parameters (theta 1 and theta 2) determined by
our cost function
• One variable x
• Now we have multiple features
– hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4

• Let us take x0 = 1 for convenience of notation


– hθ(x) = θ0x0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
– hθ(x) = θT X
Gradient descent for multiple variables

• Our cost function is

• Gradient descent
Gradient descent for multiple variables

• When n = 1

• When n >=1
Gradient Decent in practice: 1 Feature
Scaling

• Range difference of features


– x1 = size (0 - 2000 feet)
– x2 = number of bedrooms (1-5)
– Means the contours generated if we
plot θ1 vs. θ2 give a very tall and thin
shape due to the huge range difference
• Running gradient descent on this kind of cost
function can take a long time to find the global
minimum
• Feature scaling
– If you have a problem with multiple features
– You should make sure those features have a similar scale
• Means gradient descent will converge more quickly
Some feature scaling methods
• Max/Min Scaling
• Mean Normalization
• Z-score Scaling
Learning Rate α
• Focus on the learning rate (α)
– How to chose α ?
• Make sure gradient descent is
working
• Plot min J(θ) vs. no of iterations
– (i.e. plotting J(θ) over the course
of gradient descent
• If gradient descent is working
then J(θ) should decrease after
every iteration
Learning Rate α
• Checking its working If you plot J(θ) vs. iterations and see
the value is increasing - means you probably need a
smaller α
– Cause is because your minimizing a function which looks like this

– But you overshoot, so reduce learning rate so you actually reach


the minimum (green line)

Vector Notation
1 2104 5 1 45 460
1 1416 3 2 40 232
• 𝑋= Y=
1 1534 3 2 30 315
1 852 2 1 36 𝑚×(𝑛+1) 178 𝑚×1

𝜃0
𝜃1
• 𝜃= ℎ𝜃 𝑥 = 𝑥0 θ0 + ⋯ + 𝑥𝑛 𝜃𝑛 = 𝑋𝜃

𝜃𝑛 𝑛+1 ×1

1 𝑇
• 𝐽= 𝑋𝜃 − 𝑌 1×𝑚 𝑋𝜃 − 𝑌 𝑚×1
2𝑚
Vectorization

𝜕 1
• 𝐽 𝜃 = σ𝑚 ℎ
𝑖=1 𝜃 𝑥 𝑖
− 𝑦 𝑖
. 𝑥𝑗
𝑖
𝜕𝜃𝑗 𝑚
1
• = σ𝑚
𝑖=1 𝑥 𝑖 𝜃 − 𝑦 𝑖 . 𝑥𝑗𝑖
𝑚

𝛼
• Θ 𝑛+1 ×1 =Θ 𝑛+1 ×1 − ( 𝑋𝑇 𝑋𝜃 − 𝑌 𝑚×1 )
𝑚 𝑛+1 ×𝑚
Closed form solution
• Given X and Y, our aim is to find 𝜃 so that Y = 𝑋𝜃.

• Y = 𝑋𝜃.
• X −1 Y = 𝜃
−1 T
• X −1 T
X X Y=𝜃
• (X T X)−1 X T Y = 𝜃
Why not closed form?
• The problem with this operation is the time
complexity of calculating the inverse of a nxn matrix
which is O(n^3) and as n increases it can take a very
long time to finish.
• When n is low (n < 1000 or n < 10000) you can think of
normal equations as the better option for calculation
theta, however for greater values Gradient Descent is
much more faster, so the only reason is the time.

• Closed form works better when the input size is


smaller. No need to choose 𝛼 and no need to iterate.
Batch, Stochastic and Mini-Batch Gradient
Descent
• In Gradient Descent or Batch Gradient Descent, we use the
whole training data for updating theta (or, w).

• In Stochastic Gradient Descent, we use only single training


example for updating theta.

• Mini-batch Gradient Descent lies in between of these two


extremes, in which we can use a mini-batch(small portion)
of training data for updating theta.
Linear Regression – Probabilistic Interpretation
• Assume out Y is generated by

Y = θT X + M = hθ(x) + M
where, M is the Gaussian noise with mean 0 and variance 1.

• So training data is

𝑥 (1) , hθ(𝑥 (1) )+𝑀(1) , 𝑥 (2) , hθ(𝑥 (2) )+𝑀(2) , … , 𝑥 𝑚 , hθ(𝑥 𝑚 )+𝑀 𝑚 ,

Where 𝑀(1) , 𝑀(2) , … , 𝑀 𝑚 are independent random variables with


mean 0 and variance 1.
Linear Regression – Probabilistic Interpretation
• A Gaussian RV Z with mean µ and variance 𝜎 2 has p.d.f.
1 𝑧−𝜇 2

𝑓𝑍 𝑧 = 𝑒 2𝜎2
𝜎√2𝜋
𝑚 2
1
• So, we are assuming: 𝑓𝑀 𝑚 = 𝑒− 2
√2𝜋
2
𝑦−ℎΘ 𝑥
1
• Equivalently, 𝑓𝑌 𝑦 = 𝑒− 2
√2𝜋
• The likelihood 𝑓𝐷|Θ 𝑑|Θ of the training data d is therefore:
𝑚
1 −( 𝑦(𝑖) −ℎΘ (𝑥 𝑖 )) 2
𝑓𝐷|Θ 𝑑|Θ = ෑ 𝑒 2
𝑖=1
√2𝜋
1 (𝑦 (𝑖) −ℎ (𝑥 𝑖 )) 2
• Log likelihood is log 𝑓𝐷|Θ 𝑑|Θ = log − σ𝑚
𝑖=1
Θ
2𝜋 2
• The maximum likelihood estimate of Θ maximizes
− 𝑀𝑎𝑥 σ𝑚 (𝑖) 𝑖
𝑖=1( 𝑦 −ℎΘ (𝑥 ))
2 and
Θ
• Minimizes 𝑀𝑖𝑛 σ𝑚
𝑖=1( 𝑦 (𝑖) −ℎ (𝑥 𝑖 )) 2
Θ
Θ
Why is Probabilistic Interpretation important?
• Casting an ML approach within a statistical framework
clarifies the assumptions that have been made (perhaps
implicitly). E.g. in linear regression:
– Noise is additive: Y = θT X + M
– Noise on each observation is independent and identically
distributed
– Noise is Gaussian – it is this which leads directly to the use of a
2
square loss 𝑦 − ℎΘ 𝑥 .
– Changing the noise model would lead to a different loss function.
• We can leverage the wealth of results and approaches
from probability/statistics, and perhaps gain new insights.
E.g. in linear regression:
– Without regularization, our estimate of θ is the maximum
likelihood estimate.
Polynomial Regression

Note: These slides were assembled by Dr. K. V. Rao, with grateful acknowledgement of the many others who made their course materials available online.
Outline
• Choice of features
• Polynomial Regression
• Overfitting
• Regularization
• Lasso Regression
• Ridge Regression
• ElasticNet Regression
Choice of features
• How you can get different learning algorithms by
choosing appropriate features
• Polynomial regression for non-linear function
• Example
– House price prediction
• Two features
– Frontage - width of the plot of land along road (x1)
– Depth - depth away from road (x2)
– You don't have to use just two features
• Can create new features
– Might decide that an important feature is the land area
• So, create a new feature = frontage * depth (x3)
• h(x) = θ0 + θ1x3
– Area is a better indicator
– Often, by defining new features you may get a better model
Polynomial Regression
• Polynomial regression
– May fit the data better
– θ0 + θ1x + θ2x2 e.g. here we have a quadratic
function
– For housing data could use a quadratic function
• But may not fit the data so well - inflection point means housing prices decrease when
size gets really big
• So instead must use a cubic function
• How do we fit the model to this data
– To map our old linear hypothesis and cost functions to these polynomial
descriptions the easy thing to do is set
• x1 = x, x2 = x2,x3 = x3
– By selecting the features like this and applying the linear regression algorithms
you can do polynomial linear regression
– Remember, feature scaling becomes even more important here
• Instead of a conventional polynomial you could do variable
^(1/something) - i.e. square root, cubed root etc
• Lots of features - later look at developing an algorithm to chose the
best features
Polynomial Curve Fitting

Plot of a training data set of m = 10


points, shown as blue circles, each
comprising an observation of the
input variable x along with the
corresponding target variable t.
The green curve shows the
function sin(2πx) used to generate
the data.

h𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + ⋯ + 𝜃𝑀 𝑥 𝑀

What M should we choose?→Model Selection

Given M, what 𝜃′𝑠 should we choose? →Parameter Selection


Sum-of-Squares Error Function

1 2
Loss function: 𝐽 𝜃 = σ𝑚
𝑖=1 ℎ𝜃 𝑥𝑖 −𝑦 𝑖
2𝑚

Note: We use 𝜃 and w interchangeably . 𝑡𝑛 = 𝑦𝑖 , 𝑎𝑛𝑑 𝑦 𝑥𝑛 , 𝑤 = ℎ𝜃 𝑥 𝑖 .


0th Order Polynomial

How do M, the quality of fitting and the capability to generalize


relate to each other??
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial
Over-fitting
Data Set Size:

m = 15, 9th Order Polynomial


Data Set Size:
m = 100, 9th Order Polynomial

Increasing the size of the data sets alleviates the over-fitting problem.
Polynomial Coefficients
Regularization

• Penalize large coefficient values

1 2
Loss function: 𝐽 𝜃 = σ𝑚
𝑖=1 ℎ𝜃 𝑥𝑖 −𝑦 𝑖 +𝜆 𝜃 𝑙
2𝑚

Idea: penalize high weights that contribute to high


variance and sensitivity to outliers.
Lasso regression
1
σ𝑚 𝑖 𝑖 2
• 𝐽 𝜃 = 𝑖=1 ℎ𝜃 𝑥 −𝑦 +𝜆 𝜃 1
2𝑚

• L1 - norm

• 𝒘=𝜽
Ridge Regression
1
σ𝑚 𝑖 𝑖 2 𝜆
• 𝐽 𝜃 = 𝑖=1 ℎ𝜃 𝑥 −𝑦 + 𝜃 2
2𝑚 2
• L2-norm

• 𝒘=𝜽=𝜷
ElasticNet Regression
1
σ𝑚 𝑖 𝑖 2
• 𝐽 𝜃 = 𝑖=1 ℎ𝜃 𝑥 −𝑦 + 𝜆1 𝜃 1 + 𝜆2 𝜃 2
2𝑚
l-norm for different values of l
• From left to right l = 4, l = 2, l = 1, l = 0.5, l = 0.1.
L2 –norm Regularization:
L2 –norm Regularization:

9th Order Polynomial


Polynomial Coefficients

Weight of regularization increases


Regularization: vs.
Reference Books
Thank you !

Happy Learning !!

You might also like