Regression Analysis
Dr. Venkateswara Rao
Note: These slides were assembled by Dr. K. V. Rao, with grateful acknowledgement of the many others who made their course materials available online.
Outline
• Problem Formulation
• Linear Regression Model
• Optimization
– Gradient descent algorithm
– Mini-batch gradient descent algorithm
– Stochastic gradient descent algorithm
• Vectorization
• Closed Form Solution
• Practical advises
• Probabilistic interpretation of Linear Regression.
Problem Formulation
• Given a the training set of pairs
𝑥 (1) , 𝑦 (1) , 𝑥 (2) , 𝑦 (2) , … , 𝑥 (𝑚) , 𝑦 (𝑚) , where
𝑥 (𝑖) ∈ 𝑅𝑑 and 𝑦 (𝑖) is a continuous target variable, the
task is to predict for 𝑥 (𝑚+𝑗) , 𝑗 ≥ 1.
Let us start with an Example
• How do we predict housing prices
– Collect data regarding housing prices and how they relate
to size in feet.
Example problem
• "Given this data, a friend has a house 750 square feet
- how much can they be expected to get?“
• Straight line through data
– Maybe $150 000
• Second order polynomial
– Maybe $200 000
Linear Regression
Linear Regression
• With our training set defined - how do we use it?
– Take training set
– Pass into a learning algorithm
– Algorithm outputs a function (denoted h )
– This function takes an input (e.g. size of new house)
• Tries to output the estimated value of Y
• How do we represent hypothesis h ?
– Going to present h as hθ(x) = θ0 + θ1x
– Means Y is a linear function of x
– θi are parameters
• A linear regression with one variable is also called univariate linear
regression
• So in summary
– A hypothesis takes in some variable
– Uses parameters determined by a learning system
– Outputs a prediction based on that input
Which line is best ?
• Many lines are possible !!
Which is the best?
• A cost function lets us figure out
how to fit the best straight line
to our data.
• What makes a line different ?
– Parameters θ0, θ1
• Which is the best line ?
– The line that minimizes the difference between the actual
and estimated prices.
• What is our objective ?
– Choose these parameters θ0, θ1 so that hθ(x) is close to y
for our training examples, i.e. minimize the difference
between h(x) and y for each/any/every example.
Objective
1
σ𝑚 𝑖 𝑖 2
• Loss function: 𝐽 𝜃 = 𝑖=1 ℎ𝜃 𝑥 −𝑦
2𝑚
• 𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽 𝜃
𝜃0 , 𝜃1
• How do we achieve this ?
– That is where gradient descent helps us !
Gradient
• The gradient is just the vector of partial derivatives
Graphical representation of a
convex cost function
𝐽 𝜽
𝜃1
• If the slope is negative we have
to increment 𝜃1 to reach to
the minimum value of 𝐽 𝜽 .
• If the slope is positive we have
to decrement 𝜃1 to reach to
the minimum value of 𝐽 𝜽 .
Is this okay ?
Initialize 𝜃0 , 𝜃1
• Repeat the following until convergence Gradient
Descent
Algorithm
– That is when the gradient is negative we are incrementing 𝜃𝑗
and when the gradient is positive we are decrementing 𝜃𝑗 .
𝜕 1
• j=0: 𝐽 𝜃0 , 𝜃1 = σ𝑚 (ℎ
𝑖=1 𝜃 𝑥 𝑖 − 𝑦𝑖 )
𝜕𝜃0 𝑚
𝜕 1
• j=1: 𝐽 𝜃0 , 𝜃1 = σ𝑚 ℎ
𝑖=1 𝜃 𝑥 𝑖 − 𝑦𝑖 . 𝑥𝑖
𝜕𝜃1 𝑚
Gradient Descent
• So, Gradient descent is an optimization
algorithm used to find the values of
parameters of a function that minimizes a
cost function (cost).
• Gradient descent is best used when the
parameters cannot be calculated
analytically (e.g. using linear algebra) and
must be searched for by an optimization
algorithm.
Multivariate Linear Regression
• Linear Regression with multiple input variables/features.
• More notations
– n: number of features (n = 4)
– m : number of examples (i.e. number of rows in a table)
– xi : vector of the input for an example (so a vector of the four
parameters for the ith input example)
• x3 is, for example, the 3rd house, and contains the four features
associated with that house
– xji The value of feature j in the ith training example
• x23 is, for example, the number of bedrooms in the third house
What is the form of our hypothesis?
• Previously our hypothesis took the following form.
– hθ(x) = θ0 + θ1x
• Here we have two parameters (theta 1 and theta 2) determined by
our cost function
• One variable x
• Now we have multiple features
– hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
• Let us take x0 = 1 for convenience of notation
– hθ(x) = θ0x0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
– hθ(x) = θT X
Gradient descent for multiple variables
• Our cost function is
• Gradient descent
Gradient descent for multiple variables
• When n = 1
• When n >=1
Gradient Decent in practice: 1 Feature
Scaling
• Range difference of features
– x1 = size (0 - 2000 feet)
– x2 = number of bedrooms (1-5)
– Means the contours generated if we
plot θ1 vs. θ2 give a very tall and thin
shape due to the huge range difference
• Running gradient descent on this kind of cost
function can take a long time to find the global
minimum
• Feature scaling
– If you have a problem with multiple features
– You should make sure those features have a similar scale
• Means gradient descent will converge more quickly
Some feature scaling methods
• Max/Min Scaling
• Mean Normalization
• Z-score Scaling
Learning Rate α
• Focus on the learning rate (α)
– How to chose α ?
• Make sure gradient descent is
working
• Plot min J(θ) vs. no of iterations
– (i.e. plotting J(θ) over the course
of gradient descent
• If gradient descent is working
then J(θ) should decrease after
every iteration
Learning Rate α
• Checking its working If you plot J(θ) vs. iterations and see
the value is increasing - means you probably need a
smaller α
– Cause is because your minimizing a function which looks like this
– But you overshoot, so reduce learning rate so you actually reach
the minimum (green line)
•
Vector Notation
1 2104 5 1 45 460
1 1416 3 2 40 232
• 𝑋= Y=
1 1534 3 2 30 315
1 852 2 1 36 𝑚×(𝑛+1) 178 𝑚×1
𝜃0
𝜃1
• 𝜃= ℎ𝜃 𝑥 = 𝑥0 θ0 + ⋯ + 𝑥𝑛 𝜃𝑛 = 𝑋𝜃
⋮
𝜃𝑛 𝑛+1 ×1
1 𝑇
• 𝐽= 𝑋𝜃 − 𝑌 1×𝑚 𝑋𝜃 − 𝑌 𝑚×1
2𝑚
Vectorization
𝜕 1
• 𝐽 𝜃 = σ𝑚 ℎ
𝑖=1 𝜃 𝑥 𝑖
− 𝑦 𝑖
. 𝑥𝑗
𝑖
𝜕𝜃𝑗 𝑚
1
• = σ𝑚
𝑖=1 𝑥 𝑖 𝜃 − 𝑦 𝑖 . 𝑥𝑗𝑖
𝑚
𝛼
• Θ 𝑛+1 ×1 =Θ 𝑛+1 ×1 − ( 𝑋𝑇 𝑋𝜃 − 𝑌 𝑚×1 )
𝑚 𝑛+1 ×𝑚
Closed form solution
• Given X and Y, our aim is to find 𝜃 so that Y = 𝑋𝜃.
• Y = 𝑋𝜃.
• X −1 Y = 𝜃
−1 T
• X −1 T
X X Y=𝜃
• (X T X)−1 X T Y = 𝜃
Why not closed form?
• The problem with this operation is the time
complexity of calculating the inverse of a nxn matrix
which is O(n^3) and as n increases it can take a very
long time to finish.
• When n is low (n < 1000 or n < 10000) you can think of
normal equations as the better option for calculation
theta, however for greater values Gradient Descent is
much more faster, so the only reason is the time.
• Closed form works better when the input size is
smaller. No need to choose 𝛼 and no need to iterate.
Batch, Stochastic and Mini-Batch Gradient
Descent
• In Gradient Descent or Batch Gradient Descent, we use the
whole training data for updating theta (or, w).
• In Stochastic Gradient Descent, we use only single training
example for updating theta.
• Mini-batch Gradient Descent lies in between of these two
extremes, in which we can use a mini-batch(small portion)
of training data for updating theta.
Linear Regression – Probabilistic Interpretation
• Assume out Y is generated by
Y = θT X + M = hθ(x) + M
where, M is the Gaussian noise with mean 0 and variance 1.
• So training data is
𝑥 (1) , hθ(𝑥 (1) )+𝑀(1) , 𝑥 (2) , hθ(𝑥 (2) )+𝑀(2) , … , 𝑥 𝑚 , hθ(𝑥 𝑚 )+𝑀 𝑚 ,
Where 𝑀(1) , 𝑀(2) , … , 𝑀 𝑚 are independent random variables with
mean 0 and variance 1.
Linear Regression – Probabilistic Interpretation
• A Gaussian RV Z with mean µ and variance 𝜎 2 has p.d.f.
1 𝑧−𝜇 2
−
𝑓𝑍 𝑧 = 𝑒 2𝜎2
𝜎√2𝜋
𝑚 2
1
• So, we are assuming: 𝑓𝑀 𝑚 = 𝑒− 2
√2𝜋
2
𝑦−ℎΘ 𝑥
1
• Equivalently, 𝑓𝑌 𝑦 = 𝑒− 2
√2𝜋
• The likelihood 𝑓𝐷|Θ 𝑑|Θ of the training data d is therefore:
𝑚
1 −( 𝑦(𝑖) −ℎΘ (𝑥 𝑖 )) 2
𝑓𝐷|Θ 𝑑|Θ = ෑ 𝑒 2
𝑖=1
√2𝜋
1 (𝑦 (𝑖) −ℎ (𝑥 𝑖 )) 2
• Log likelihood is log 𝑓𝐷|Θ 𝑑|Θ = log − σ𝑚
𝑖=1
Θ
2𝜋 2
• The maximum likelihood estimate of Θ maximizes
− 𝑀𝑎𝑥 σ𝑚 (𝑖) 𝑖
𝑖=1( 𝑦 −ℎΘ (𝑥 ))
2 and
Θ
• Minimizes 𝑀𝑖𝑛 σ𝑚
𝑖=1( 𝑦 (𝑖) −ℎ (𝑥 𝑖 )) 2
Θ
Θ
Why is Probabilistic Interpretation important?
• Casting an ML approach within a statistical framework
clarifies the assumptions that have been made (perhaps
implicitly). E.g. in linear regression:
– Noise is additive: Y = θT X + M
– Noise on each observation is independent and identically
distributed
– Noise is Gaussian – it is this which leads directly to the use of a
2
square loss 𝑦 − ℎΘ 𝑥 .
– Changing the noise model would lead to a different loss function.
• We can leverage the wealth of results and approaches
from probability/statistics, and perhaps gain new insights.
E.g. in linear regression:
– Without regularization, our estimate of θ is the maximum
likelihood estimate.
Polynomial Regression
Note: These slides were assembled by Dr. K. V. Rao, with grateful acknowledgement of the many others who made their course materials available online.
Outline
• Choice of features
• Polynomial Regression
• Overfitting
• Regularization
• Lasso Regression
• Ridge Regression
• ElasticNet Regression
Choice of features
• How you can get different learning algorithms by
choosing appropriate features
• Polynomial regression for non-linear function
• Example
– House price prediction
• Two features
– Frontage - width of the plot of land along road (x1)
– Depth - depth away from road (x2)
– You don't have to use just two features
• Can create new features
– Might decide that an important feature is the land area
• So, create a new feature = frontage * depth (x3)
• h(x) = θ0 + θ1x3
– Area is a better indicator
– Often, by defining new features you may get a better model
Polynomial Regression
• Polynomial regression
– May fit the data better
– θ0 + θ1x + θ2x2 e.g. here we have a quadratic
function
– For housing data could use a quadratic function
• But may not fit the data so well - inflection point means housing prices decrease when
size gets really big
• So instead must use a cubic function
• How do we fit the model to this data
– To map our old linear hypothesis and cost functions to these polynomial
descriptions the easy thing to do is set
• x1 = x, x2 = x2,x3 = x3
– By selecting the features like this and applying the linear regression algorithms
you can do polynomial linear regression
– Remember, feature scaling becomes even more important here
• Instead of a conventional polynomial you could do variable
^(1/something) - i.e. square root, cubed root etc
• Lots of features - later look at developing an algorithm to chose the
best features
Polynomial Curve Fitting
Plot of a training data set of m = 10
points, shown as blue circles, each
comprising an observation of the
input variable x along with the
corresponding target variable t.
The green curve shows the
function sin(2πx) used to generate
the data.
h𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + ⋯ + 𝜃𝑀 𝑥 𝑀
What M should we choose?→Model Selection
Given M, what 𝜃′𝑠 should we choose? →Parameter Selection
Sum-of-Squares Error Function
1 2
Loss function: 𝐽 𝜃 = σ𝑚
𝑖=1 ℎ𝜃 𝑥𝑖 −𝑦 𝑖
2𝑚
Note: We use 𝜃 and w interchangeably . 𝑡𝑛 = 𝑦𝑖 , 𝑎𝑛𝑑 𝑦 𝑥𝑛 , 𝑤 = ℎ𝜃 𝑥 𝑖 .
0th Order Polynomial
How do M, the quality of fitting and the capability to generalize
relate to each other??
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial
Over-fitting
Data Set Size:
m = 15, 9th Order Polynomial
Data Set Size:
m = 100, 9th Order Polynomial
Increasing the size of the data sets alleviates the over-fitting problem.
Polynomial Coefficients
Regularization
• Penalize large coefficient values
1 2
Loss function: 𝐽 𝜃 = σ𝑚
𝑖=1 ℎ𝜃 𝑥𝑖 −𝑦 𝑖 +𝜆 𝜃 𝑙
2𝑚
Idea: penalize high weights that contribute to high
variance and sensitivity to outliers.
Lasso regression
1
σ𝑚 𝑖 𝑖 2
• 𝐽 𝜃 = 𝑖=1 ℎ𝜃 𝑥 −𝑦 +𝜆 𝜃 1
2𝑚
• L1 - norm
• 𝒘=𝜽
Ridge Regression
1
σ𝑚 𝑖 𝑖 2 𝜆
• 𝐽 𝜃 = 𝑖=1 ℎ𝜃 𝑥 −𝑦 + 𝜃 2
2𝑚 2
• L2-norm
• 𝒘=𝜽=𝜷
ElasticNet Regression
1
σ𝑚 𝑖 𝑖 2
• 𝐽 𝜃 = 𝑖=1 ℎ𝜃 𝑥 −𝑦 + 𝜆1 𝜃 1 + 𝜆2 𝜃 2
2𝑚
l-norm for different values of l
• From left to right l = 4, l = 2, l = 1, l = 0.5, l = 0.1.
L2 –norm Regularization:
L2 –norm Regularization:
9th Order Polynomial
Polynomial Coefficients
Weight of regularization increases
Regularization: vs.
Reference Books
Thank you !
Happy Learning !!