0% found this document useful (0 votes)

11 views54 pages

Regression Analysis

Uploaded by

245123742004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views54 pages

Regression Analysis

Uploaded by

245123742004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Regression Analysis

Dr. Venkateswara Rao

Note: These slides were assembled by Dr. K. V. Rao, with grateful acknowledgement of the many others who made their course materials available online.
Outline
• Problem Formulation
• Linear Regression Model
• Optimization
– Gradient descent algorithm
– Mini-batch gradient descent algorithm
– Stochastic gradient descent algorithm
• Vectorization
• Closed Form Solution
• Practical advises
• Probabilistic interpretation of Linear Regression.
Problem Formulation
• Given a the training set of pairs
𝑥 (1) , 𝑦 (1) , 𝑥 (2) , 𝑦 (2) , … , 𝑥 (𝑚) , 𝑦 (𝑚) , where
𝑥 (𝑖) ∈ 𝑅𝑑 and 𝑦 (𝑖) is a continuous target variable, the
task is to predict for 𝑥 (𝑚+𝑗) , 𝑗 ≥ 1.
Let us start with an Example
• How do we predict housing prices
– Collect data regarding housing prices and how they relate
to size in feet.
Example problem

• "Given this data, a friend has a house 750 square feet

- how much can they be expected to get?“

• Straight line through data

– Maybe $150 000

• Second order polynomial

– Maybe $200 000
Linear Regression
Linear Regression
• With our training set defined - how do we use it?
– Take training set
– Pass into a learning algorithm
– Algorithm outputs a function (denoted h )
– This function takes an input (e.g. size of new house)
• Tries to output the estimated value of Y
• How do we represent hypothesis h ?
– Going to present h as hθ(x) = θ0 + θ1x
– Means Y is a linear function of x
– θi are parameters

• A linear regression with one variable is also called univariate linear

regression

• So in summary
– A hypothesis takes in some variable
– Uses parameters determined by a learning system
– Outputs a prediction based on that input
Which line is best ?
• Many lines are possible !!
Which is the best?
• A cost function lets us figure out
how to fit the best straight line
to our data.
• What makes a line different ?
– Parameters θ0, θ1
• Which is the best line ?
– The line that minimizes the difference between the actual
and estimated prices.
• What is our objective ?
– Choose these parameters θ0, θ1 so that hθ(x) is close to y
for our training examples, i.e. minimize the difference
between h(x) and y for each/any/every example.
Objective
1
σ𝑚 𝑖 𝑖 2
• Loss function: 𝐽 𝜃 = 𝑖=1 ℎ𝜃 𝑥 −𝑦
2𝑚

• 𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽 𝜃
𝜃0 , 𝜃1

• How do we achieve this ?

– That is where gradient descent helps us !
Gradient

• The gradient is just the vector of partial derivatives

Graphical representation of a
convex cost function
𝐽 𝜽

𝜃1

• If the slope is negative we have

to increment 𝜃1 to reach to
the minimum value of 𝐽 𝜽 .
• If the slope is positive we have
to decrement 𝜃1 to reach to
the minimum value of 𝐽 𝜽 .
Is this okay ?
Initialize 𝜃0 , 𝜃1
• Repeat the following until convergence Gradient
Descent
Algorithm
– That is when the gradient is negative we are incrementing 𝜃𝑗
and when the gradient is positive we are decrementing 𝜃𝑗 .

𝜕 1
• j=0: 𝐽 𝜃0 , 𝜃1 = σ𝑚 (ℎ
𝑖=1 𝜃 𝑥 𝑖 − 𝑦𝑖 )
𝜕𝜃0 𝑚
𝜕 1
• j=1: 𝐽 𝜃0 , 𝜃1 = σ𝑚 ℎ
𝑖=1 𝜃 𝑥 𝑖 − 𝑦𝑖 . 𝑥𝑖
𝜕𝜃1 𝑚
Gradient Descent
• So, Gradient descent is an optimization
algorithm used to find the values of
parameters of a function that minimizes a
cost function (cost).

• Gradient descent is best used when the

parameters cannot be calculated
analytically (e.g. using linear algebra) and
must be searched for by an optimization
algorithm.
Multivariate Linear Regression
• Linear Regression with multiple input variables/features.

• More notations
– n: number of features (n = 4)
– m : number of examples (i.e. number of rows in a table)
– xi : vector of the input for an example (so a vector of the four
parameters for the ith input example)
• x3 is, for example, the 3rd house, and contains the four features
associated with that house
– xji The value of feature j in the ith training example
• x23 is, for example, the number of bedrooms in the third house
What is the form of our hypothesis?

• Previously our hypothesis took the following form.

– hθ(x) = θ0 + θ1x
• Here we have two parameters (theta 1 and theta 2) determined by
our cost function
• One variable x
• Now we have multiple features
– hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4

• Let us take x0 = 1 for convenience of notation

– hθ(x) = θ0x0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
– hθ(x) = θT X
Gradient descent for multiple variables

• Our cost function is

• Gradient descent
Gradient descent for multiple variables

• When n = 1

• When n >=1
Gradient Decent in practice: 1 Feature
Scaling

• Range difference of features

– x1 = size (0 - 2000 feet)
– x2 = number of bedrooms (1-5)
– Means the contours generated if we
plot θ1 vs. θ2 give a very tall and thin
shape due to the huge range difference
• Running gradient descent on this kind of cost
function can take a long time to find the global
minimum
• Feature scaling
– If you have a problem with multiple features
– You should make sure those features have a similar scale
• Means gradient descent will converge more quickly
Some feature scaling methods
• Max/Min Scaling
• Mean Normalization
• Z-score Scaling
Learning Rate α
• Focus on the learning rate (α)
– How to chose α ?
• Make sure gradient descent is
working
• Plot min J(θ) vs. no of iterations
– (i.e. plotting J(θ) over the course
of gradient descent
• If gradient descent is working
then J(θ) should decrease after
every iteration
Learning Rate α
• Checking its working If you plot J(θ) vs. iterations and see
the value is increasing - means you probably need a
smaller α
– Cause is because your minimizing a function which looks like this

– But you overshoot, so reduce learning rate so you actually reach

the minimum (green line)
•
Vector Notation
1 2104 5 1 45 460
1 1416 3 2 40 232
• 𝑋= Y=
1 1534 3 2 30 315
1 852 2 1 36 𝑚×(𝑛+1) 178 𝑚×1

𝜃0
𝜃1
• 𝜃= ℎ𝜃 𝑥 = 𝑥0 θ0 + ⋯ + 𝑥𝑛 𝜃𝑛 = 𝑋𝜃
⋮
𝜃𝑛 𝑛+1 ×1

1 𝑇
• 𝐽= 𝑋𝜃 − 𝑌 1×𝑚 𝑋𝜃 − 𝑌 𝑚×1
2𝑚
Vectorization

𝜕 1
• 𝐽 𝜃 = σ𝑚 ℎ
𝑖=1 𝜃 𝑥 𝑖
− 𝑦 𝑖
. 𝑥𝑗
𝑖
𝜕𝜃𝑗 𝑚
1
• = σ𝑚
𝑖=1 𝑥 𝑖 𝜃 − 𝑦 𝑖 . 𝑥𝑗𝑖
𝑚

𝛼
• Θ 𝑛+1 ×1 =Θ 𝑛+1 ×1 − ( 𝑋𝑇 𝑋𝜃 − 𝑌 𝑚×1 )
𝑚 𝑛+1 ×𝑚
Closed form solution
• Given X and Y, our aim is to find 𝜃 so that Y = 𝑋𝜃.

• Y = 𝑋𝜃.
• X −1 Y = 𝜃
−1 T
• X −1 T
X X Y=𝜃
• (X T X)−1 X T Y = 𝜃
Why not closed form?
• The problem with this operation is the time
complexity of calculating the inverse of a nxn matrix
which is O(n^3) and as n increases it can take a very
long time to finish.
• When n is low (n < 1000 or n < 10000) you can think of
normal equations as the better option for calculation
theta, however for greater values Gradient Descent is
much more faster, so the only reason is the time.

• Closed form works better when the input size is

smaller. No need to choose 𝛼 and no need to iterate.
Batch, Stochastic and Mini-Batch Gradient
Descent
• In Gradient Descent or Batch Gradient Descent, we use the
whole training data for updating theta (or, w).

• In Stochastic Gradient Descent, we use only single training

example for updating theta.

• Mini-batch Gradient Descent lies in between of these two

extremes, in which we can use a mini-batch(small portion)
of training data for updating theta.
Linear Regression – Probabilistic Interpretation
• Assume out Y is generated by

Y = θT X + M = hθ(x) + M
where, M is the Gaussian noise with mean 0 and variance 1.

• So training data is

𝑥 (1) , hθ(𝑥 (1) )+𝑀(1) , 𝑥 (2) , hθ(𝑥 (2) )+𝑀(2) , … , 𝑥 𝑚 , hθ(𝑥 𝑚 )+𝑀 𝑚 ,

Where 𝑀(1) , 𝑀(2) , … , 𝑀 𝑚 are independent random variables with

mean 0 and variance 1.
Linear Regression – Probabilistic Interpretation
• A Gaussian RV Z with mean µ and variance 𝜎 2 has p.d.f.
1 𝑧−𝜇 2
−
𝑓𝑍 𝑧 = 𝑒 2𝜎2
𝜎√2𝜋
𝑚 2
1
• So, we are assuming: 𝑓𝑀 𝑚 = 𝑒− 2
√2𝜋
2
𝑦−ℎΘ 𝑥
1
• Equivalently, 𝑓𝑌 𝑦 = 𝑒− 2
√2𝜋
• The likelihood 𝑓𝐷|Θ 𝑑|Θ of the training data d is therefore:
𝑚
1 −( 𝑦(𝑖) −ℎΘ (𝑥 𝑖 )) 2
𝑓𝐷|Θ 𝑑|Θ = ෑ 𝑒 2
𝑖=1
√2𝜋
1 (𝑦 (𝑖) −ℎ (𝑥 𝑖 )) 2
• Log likelihood is log 𝑓𝐷|Θ 𝑑|Θ = log − σ𝑚
𝑖=1
Θ
2𝜋 2
• The maximum likelihood estimate of Θ maximizes
− 𝑀𝑎𝑥 σ𝑚 (𝑖) 𝑖
𝑖=1( 𝑦 −ℎΘ (𝑥 ))
2 and
Θ
• Minimizes 𝑀𝑖𝑛 σ𝑚
𝑖=1( 𝑦 (𝑖) −ℎ (𝑥 𝑖 )) 2
Θ
Θ
Why is Probabilistic Interpretation important?
• Casting an ML approach within a statistical framework
clarifies the assumptions that have been made (perhaps
implicitly). E.g. in linear regression:
– Noise is additive: Y = θT X + M
– Noise on each observation is independent and identically
distributed
– Noise is Gaussian – it is this which leads directly to the use of a
2
square loss 𝑦 − ℎΘ 𝑥 .
– Changing the noise model would lead to a different loss function.
• We can leverage the wealth of results and approaches
from probability/statistics, and perhaps gain new insights.
E.g. in linear regression:
– Without regularization, our estimate of θ is the maximum
likelihood estimate.
Polynomial Regression

Note: These slides were assembled by Dr. K. V. Rao, with grateful acknowledgement of the many others who made their course materials available online.
Outline
• Choice of features
• Polynomial Regression
• Overfitting
• Regularization
• Lasso Regression
• Ridge Regression
• ElasticNet Regression
Choice of features
• How you can get different learning algorithms by
choosing appropriate features
• Polynomial regression for non-linear function
• Example
– House price prediction
• Two features
– Frontage - width of the plot of land along road (x1)
– Depth - depth away from road (x2)
– You don't have to use just two features
• Can create new features
– Might decide that an important feature is the land area
• So, create a new feature = frontage * depth (x3)
• h(x) = θ0 + θ1x3
– Area is a better indicator
– Often, by defining new features you may get a better model
Polynomial Regression
• Polynomial regression
– May fit the data better
– θ0 + θ1x + θ2x2 e.g. here we have a quadratic
function
– For housing data could use a quadratic function
• But may not fit the data so well - inflection point means housing prices decrease when
size gets really big
• So instead must use a cubic function
• How do we fit the model to this data
– To map our old linear hypothesis and cost functions to these polynomial
descriptions the easy thing to do is set
• x1 = x, x2 = x2,x3 = x3
– By selecting the features like this and applying the linear regression algorithms
you can do polynomial linear regression
– Remember, feature scaling becomes even more important here
• Instead of a conventional polynomial you could do variable
^(1/something) - i.e. square root, cubed root etc
• Lots of features - later look at developing an algorithm to chose the
best features
Polynomial Curve Fitting

Plot of a training data set of m = 10

points, shown as blue circles, each
comprising an observation of the
input variable x along with the
corresponding target variable t.
The green curve shows the
function sin(2πx) used to generate
the data.

h𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + ⋯ + 𝜃𝑀 𝑥 𝑀

What M should we choose?→Model Selection

Given M, what 𝜃′𝑠 should we choose? →Parameter Selection

Sum-of-Squares Error Function

1 2
Loss function: 𝐽 𝜃 = σ𝑚
𝑖=1 ℎ𝜃 𝑥𝑖 −𝑦 𝑖
2𝑚

Note: We use 𝜃 and w interchangeably . 𝑡𝑛 = 𝑦𝑖 , 𝑎𝑛𝑑 𝑦 𝑥𝑛 , 𝑤 = ℎ𝜃 𝑥 𝑖 .

0th Order Polynomial

How do M, the quality of fitting and the capability to generalize

relate to each other??
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial
Over-fitting
Data Set Size:

m = 15, 9th Order Polynomial

Data Set Size:
m = 100, 9th Order Polynomial

Increasing the size of the data sets alleviates the over-fitting problem.
Polynomial Coefficients
Regularization

• Penalize large coefficient values

1 2
Loss function: 𝐽 𝜃 = σ𝑚
𝑖=1 ℎ𝜃 𝑥𝑖 −𝑦 𝑖 +𝜆 𝜃 𝑙
2𝑚

Idea: penalize high weights that contribute to high

variance and sensitivity to outliers.
Lasso regression
1
σ𝑚 𝑖 𝑖 2
• 𝐽 𝜃 = 𝑖=1 ℎ𝜃 𝑥 −𝑦 +𝜆 𝜃 1
2𝑚

• L1 - norm

• 𝒘=𝜽
Ridge Regression
1
σ𝑚 𝑖 𝑖 2 𝜆
• 𝐽 𝜃 = 𝑖=1 ℎ𝜃 𝑥 −𝑦 + 𝜃 2
2𝑚 2
• L2-norm

• 𝒘=𝜽=𝜷
ElasticNet Regression
1
σ𝑚 𝑖 𝑖 2
• 𝐽 𝜃 = 𝑖=1 ℎ𝜃 𝑥 −𝑦 + 𝜆1 𝜃 1 + 𝜆2 𝜃 2
2𝑚
l-norm for different values of l
• From left to right l = 4, l = 2, l = 1, l = 0.5, l = 0.1.
L2 –norm Regularization:
L2 –norm Regularization:

9th Order Polynomial

Polynomial Coefficients

Weight of regularization increases

Regularization: vs.
Reference Books
Thank you !

Happy Learning !!

ML 02 Linear Regression
No ratings yet
ML 02 Linear Regression
51 pages
Cost Function
No ratings yet
Cost Function
17 pages
Week 6
No ratings yet
Week 6
72 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
M02 Linear Regression Methods
No ratings yet
M02 Linear Regression Methods
40 pages
Multivariable Linear Regression Guide
No ratings yet
Multivariable Linear Regression Guide
7 pages
Lecture 2 Linear Regression, Machine Learning Course Andrew NG
No ratings yet
Lecture 2 Linear Regression, Machine Learning Course Andrew NG
14 pages
Linear Regression
100% (1)
Linear Regression
51 pages
Computing For Data Sciences: Introduction To Regression Analysis
No ratings yet
Computing For Data Sciences: Introduction To Regression Analysis
9 pages
Lec6 7 Linear Regression
No ratings yet
Lec6 7 Linear Regression
38 pages
Chap6 (Regression)
No ratings yet
Chap6 (Regression)
74 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Lecture 3 Ai
No ratings yet
Lecture 3 Ai
48 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
Week 4
No ratings yet
Week 4
101 pages
ML Regression & Gradient Descent
No ratings yet
ML Regression & Gradient Descent
37 pages
Week 04
No ratings yet
Week 04
101 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
CS 229: Supervised Learning Basics
100% (1)
CS 229: Supervised Learning Basics
48 pages
Lecture 2.1 Linear Regression
No ratings yet
Lecture 2.1 Linear Regression
36 pages
Lec02 - Linear Regression
No ratings yet
Lec02 - Linear Regression
35 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
Lecture 03 04
No ratings yet
Lecture 03 04
28 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
ML02
No ratings yet
ML02
25 pages
Lecture LinearRegression
No ratings yet
Lecture LinearRegression
42 pages
ML Notes
No ratings yet
ML Notes
14 pages
Week 2
No ratings yet
Week 2
5 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
CS229
No ratings yet
CS229
69 pages
Linear Regression and Gradient Descent
No ratings yet
Linear Regression and Gradient Descent
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
Notes 1
No ratings yet
Notes 1
30 pages
Regression
No ratings yet
Regression
30 pages
Lecture 2. Regression
No ratings yet
Lecture 2. Regression
61 pages
ML: Introduction 1. What Is Machine Learning?
No ratings yet
ML: Introduction 1. What Is Machine Learning?
38 pages
Lecture 04
No ratings yet
Lecture 04
24 pages
LinearRegression Annotated
No ratings yet
LinearRegression Annotated
116 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
AIMLB PGP 2025 Session 5
No ratings yet
AIMLB PGP 2025 Session 5
67 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
12 pages
Machine Learning - Home - Week 2 - Notes - Coursera
No ratings yet
Machine Learning - Home - Week 2 - Notes - Coursera
10 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Updating Weight
No ratings yet
Updating Weight
9 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
(PR 2024) Lec2 Regression II
No ratings yet
(PR 2024) Lec2 Regression II
41 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Signal Unit I l312
No ratings yet
Signal Unit I l312
15 pages
Cellular Networks12
No ratings yet
Cellular Networks12
15 pages
Waste To Energy: Transforming Waste Into Power: Miskin Ravi Kumar
No ratings yet
Waste To Energy: Transforming Waste Into Power: Miskin Ravi Kumar
71 pages
Random Variables and Distributions - New
No ratings yet
Random Variables and Distributions - New
84 pages
Recommender System - New
No ratings yet
Recommender System - New
49 pages
ClassNotes 15nov2021
No ratings yet
ClassNotes 15nov2021
9 pages
Redeme Your Birth Certificate
96% (71)
Redeme Your Birth Certificate
5 pages
Campground Fraud Class Action Suit
No ratings yet
Campground Fraud Class Action Suit
114 pages
Neuroanatomy Textbook Comparison
100% (1)
Neuroanatomy Textbook Comparison
2 pages
Smart Grid Protocols
100% (3)
Smart Grid Protocols
69 pages
ĐỀ LỚP 10 Ma - de - 103
No ratings yet
ĐỀ LỚP 10 Ma - de - 103
5 pages
Case Analysis - PACADI
0% (1)
Case Analysis - PACADI
12 pages
Mineral Resources of West Bengal
No ratings yet
Mineral Resources of West Bengal
22 pages
US Apparel Market Forecast 2024
No ratings yet
US Apparel Market Forecast 2024
24 pages
Orta Sevi̇yede İngi̇li̇zce Bi̇len Ana Di̇li̇ Türkçe Olan Öğrenci̇leri̇n Vücut
No ratings yet
Orta Sevi̇yede İngi̇li̇zce Bi̇len Ana Di̇li̇ Türkçe Olan Öğrenci̇leri̇n Vücut
163 pages
Natural Heritage
No ratings yet
Natural Heritage
14 pages
Account Statement: Penyata Akaun
No ratings yet
Account Statement: Penyata Akaun
2 pages
Medical Biller Practice Test, Medical Billing Practice Test
No ratings yet
Medical Biller Practice Test, Medical Billing Practice Test
7 pages
Preparation of 4-Methylpropiophenone
25% (4)
Preparation of 4-Methylpropiophenone
3 pages
Hafed TPP
No ratings yet
Hafed TPP
49 pages
Reid Hoffman On Musk, AI and The Future of Humanity
No ratings yet
Reid Hoffman On Musk, AI and The Future of Humanity
16 pages
B 40 C Bp Spare Parts Guide
No ratings yet
B 40 C Bp Spare Parts Guide
39 pages
Chapter Two and References - 043431
No ratings yet
Chapter Two and References - 043431
9 pages
GX-20 QG Eng 0307
No ratings yet
GX-20 QG Eng 0307
2 pages
Test 2 Jan 2022
No ratings yet
Test 2 Jan 2022
3 pages
Ellie Masias: Objective
No ratings yet
Ellie Masias: Objective
3 pages
BSD 1307 Object Oriented Analysis and Design
No ratings yet
BSD 1307 Object Oriented Analysis and Design
2 pages
Patrick Nguyen - Key Selection Criteria
No ratings yet
Patrick Nguyen - Key Selection Criteria
2 pages
Pharmacokinetics for Pharmacy Students
No ratings yet
Pharmacokinetics for Pharmacy Students
48 pages
Rutkas Notebook A Voice From The Holocaust
No ratings yet
Rutkas Notebook A Voice From The Holocaust
102 pages
DLL Science WK1 Q4 2024
No ratings yet
DLL Science WK1 Q4 2024
5 pages
Serrano v. Central Bank of The Philippines
No ratings yet
Serrano v. Central Bank of The Philippines
2 pages
D&D Spell Incantations Guide
No ratings yet
D&D Spell Incantations Guide
25 pages
Princeton Chromatography SFC & HPLC Solutions
No ratings yet
Princeton Chromatography SFC & HPLC Solutions
20 pages
Aqa 2011 Past Paper
No ratings yet
Aqa 2011 Past Paper
14 pages
Class X Exam Marks Register
No ratings yet
Class X Exam Marks Register
9 pages

Regression Analysis

Uploaded by

Regression Analysis

Uploaded by

Regression Analysis

Dr. Venkateswara Rao

• "Given this data, a friend has a house 750 square feet

• Straight line through data

• Second order polynomial

• A linear regression with one variable is also called univariate linear

• How do we achieve this ?

• The gradient is just the vector of partial derivatives

• If the slope is negative we have

• Gradient descent is best used when the

• Previously our hypothesis took the following form.

• Let us take x0 = 1 for convenience of notation

• Our cost function is

• Range difference of features

– But you overshoot, so reduce learning rate so you actually reach

• Closed form works better when the input size is

• In Stochastic Gradient Descent, we use only single training

• Mini-batch Gradient Descent lies in between of these two

Where 𝑀(1) , 𝑀(2) , … , 𝑀 𝑚 are independent random variables with

Plot of a training data set of m = 10

What M should we choose?→Model Selection

Given M, what 𝜃′𝑠 should we choose? →Parameter Selection

Note: We use 𝜃 and w interchangeably . 𝑡𝑛 = 𝑦𝑖 , 𝑎𝑛𝑑 𝑦 𝑥𝑛 , 𝑤 = ℎ𝜃 𝑥 𝑖 .

How do M, the quality of fitting and the capability to generalize

m = 15, 9th Order Polynomial

• Penalize large coefficient values

Idea: penalize high weights that contribute to high

9th Order Polynomial

Weight of regularization increases

You might also like