KEMBAR78
GradientDescent-Regression Slides | PDF | Regression Analysis | Least Squares
0% found this document useful (0 votes)
15 views26 pages

GradientDescent-Regression Slides

Regression is a fundamental supervised learning algorithm used to predict continuous values based on input data. It includes models like linear and polynomial regression, where the relationship between variables is established through parameters that are learned from data. Techniques such as least squares and gradient descent are employed to optimize these models and minimize prediction errors.

Uploaded by

7601nandu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views26 pages

GradientDescent-Regression Slides

Regression is a fundamental supervised learning algorithm used to predict continuous values based on input data. It includes models like linear and polynomial regression, where the relationship between variables is established through parameters that are learned from data. Techniques such as least squares and gradient descent are employed to optimize these models and minimize prediction errors.

Uploaded by

7601nandu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Regression

● Regression algorithm is one of the most fundamental machine


● learning algorithms.
● We have come across regression problems in some stage of our life.
○ Predicting the price of a house given the house features
○ The relationship between height and weight

● Regression is a supervised learning problem where we provide the


algorithm with the true value of each data during the training
process.The trained model is used to predict continuous values.
Regression
Let us assume that we are given m data points D= <x1,y1>, <x2,y2>, … <xm,ym>.
● the problem is to determine a function f such that f(x) is the best
Then
predictor for y, with respect to D.

There are several common regression models like:


● Linear regression
● Polynomial regression
Linear Regression
● The simplest form, which assumes a linear relationship between the independent
●variables and the dependent variable.
● Example:
○ Predict the price of house based on factors like area, location etc.
○ Predict height from age

the goal is to build a system that can take a


vector x ∈ Rn as input and predict the value of a
scalar y ∈ R as its output. The output of linear
regression is a linear function of the input. Let ŷ
be the value that our model predicts y should
take on. We define the output to be ŷ = wTx,
where w ∈ Rn is a vector of parameters.
Linear Regression
Parameters are values that control the behavior of the system. In this case, wi is the

coefficient that we multiply by feature xi before summing up the contributions from all
the features. We can think of w as a set of weights that determine how
each feature affects the prediction. If a feature xi receives a positive weight wi,
then increasing the value of that feature increases the value of our prediction ŷ.
If a feature receives a negative weight, then increasing the value of that feature
decreases the value of our prediction. If feature’s weight is large in magnitude,
then it has a large effect on the prediction. If feature’s weight is zero, it has no
effect on the prediction.
Linear Regression
● Consider a simple linear regression model y = 𝛽0+ 𝛽1x + 𝜀, where

y is termed as the dependent variable and x as the independent.
● The terms 𝛽0+ 𝛽1 are the parameters of the model.
○ 𝛽0 is termed as an intercept term
○ 𝛽1 is termed as slope parameter.These are called regression
coefficients
○ The error component 𝜀 accounts for the failure of data to lie on
a straight line. It represents the difference between true and
observed realization of y.
Least Squares Linear Regression
● Least squares linear regression is a method used to find the best
●fitting line through a set of data points. The goal is to minimize the
sum of the squared differences between the observed data points and
the predicted values on the line.
● We can write the model for each observation as yi = 𝛽0+ 𝛽1xi + 𝜀i (i =
1,2,..,n)
● Least squares method minimize the sum of squared differences.
● sum of squared differences
● We have to find the value of 𝛽0 and 𝛽1 which minimizes the above
error.
Least Squares Linear Regression
How to learn 𝛽 0 and 𝛽 1

It can be solved in different ways. We can find a closed form solution given the
training examples.
● Line Slope = cov(x,y) / var(x)
● Line Intercept = mean(y) - slope(x, y) * mean(x)
● Slope 𝛽1is calculated as

Where

● Intercept 𝛽0is calculated as


Least Squares Linear Regression
● For a linear regression model with multiple features, the relationship

between the target variable y and the input features X is expressed as:
Least Squares Linear Regression
Transform this to a matrix we get :

y is an n×1 vector of observed outputs.


X is an n×(p+1) matrix of inputs (independent variables), where n is the number of
observations and p+1 represents the features (including the intercept term, which is
typically the first column of ones).
𝛳 is a (p+1)×1 vector of the model parameters (coefficients).
Least Squares Linear Regression
We want to minimize the sum of squared differences between the observed outputs
and●the predictions made by the linear model:

Using matrix notation we can rewrite it as

To find the least squares solution, we need to find the point at which the gradient of
the objective function is zero. The gradient of J(θ) with respect to θ is given by:
Least Squares Linear Regression
Set the gradient to 0 and solve for 𝛳 we get

This above equation is the normal equation for least squares linear regression.
The closed form solution for the linear regression is
It is important to note that the closed-form solution exists only when the matrix
Xᵀ * X is invertible. Also it is suitable for smaller datasets where matrix inversion is
computationally feasible.
Least Squares Linear Regression

● least squares linear regression, a closed-form solution allows us to
calculate the regression coefficients (e.g., β0 and β1) in one step using
matrix algebra, specifically by solving the normal equation.
Example
Let’s assume we have the following data points: x=[1,2,3,4,5], y=[2,3,5,4,6].
Compute β0 and β1
Polynomial Regression
● It is an extension of linear regression where the relationship between the

independent variable x and the dependent variable y is modeled as an nth
degree polynomial. It allows for more complex, non-linear
relationships between the input and output variables by fitting a
polynomial curve to the data.
In polynomial regression, the model takes
the form: y=β0+β1x+β2x2+⋯+βnxn+ϵ
Polynomial Regression
Polynomial regression can be solved using matrix algebra. Given a design

matrix X (which contains the original and polynomial-transformed features),
the parameters 𝛳 can be estimated by:

Where X is:
Learning Parameters: Gradient Descent
Gradient descent (GD) is a mechanism in supervised learning to learn parameters
● network by navigating the error surface in an efficient and principled
of neural
way. It is used to find the function parameters (coefficients) that minimize a loss
function.
Error surfaces, on the other hand, are graphical representations of the
relationship between the model’s parameters and the corresponding error values.

Let θ be vector of parameters. θ = [ω, b] ∈ R2 . ω and


b, are randomly initialized.

Change in value of ω and b be ∆θ = [∆ω, ∆b]


Learning Parameters: Gradient Descent
● move only by a small amount η. Moving
in the direction of ∆θ, in small steps
η∆θ then the resultant vector is θnew
shown in red colour in figure.

θnew = θ + η · ∆θ. Now how to find the value of ∆θ


Let us denote ∆θ = u. Then by Tayer series L(θ + ηu) can be written as

L(θ + ηu) = L(θ) + η ∗ u T ∇L(θ) + (η2/2! )∗ u T ∇2L(θ)u + (η3/3!) ∗ ... + (η 4/4!)∗ …


= L(θ) + η ∗ u T∇L(θ) [η is typically small, so η2 , η3 , ... → 0]

where ∇L(θ) is the gradient vector as


Learning Parameters: Gradient Descent
L(θ + ηu) = L(θ) + η ∗ u T∇L(θ)

Note that the move (ηu) would be favorable only if, L(θ + ηu) − L (θ) < 0 [i.e., if the new
loss is less than the previous loss]

This implies that uT∇L(θ) be less than zero


Let β be the angle between u and ∇L(θ), then we know that,

Multiply throughout by k
cos(β) = −1 when the angle is 180◦ ,

Therefore u should move in a direction opposite to the gradient.


Learning Parameters: Gradient Descent


Learning Parameters: Gradient Descent


Gradient Descent:Example


Batch Gradient Descent
batch gradient descent: gradient descent uses all n training examples for weight
●updates
In this approach, the algorithm calculates the gradient of the cost function using the
entire dataset before updating the weights (or parameters).

Single Weight Update per Epoch: After calculating the average gradient using all
training examples, the algorithm updates the model's weights once. This process is
repeated for multiple iterations until the cost function converges to a minimum.
Batch Gradient Descent
Consider the example of linear regression hθ(x)=θ0+θ1x, let the cost function
●be mean squared error.

As per gradient descent to update the parameters we need gradients of θ0 and


θ1

Note that it uses all m training examples for weight updates.

In ML applications as we have millions of data it becomes very expensive.


Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a variation of the gradient descent

algorithm where we update the model parameters (weights) using one training
example at a time.

Stochastic Gradient Descent (SGD), on the other hand, speeds things up by


updating the model’s parameters after processing each individual training example

This makes the algorithm faster and more memory-efficient, especially for large
datasets, but it can introduce some noise into the updates

In the above cost function take m =1, so each weight update will happen for each data
point.
Stochastic Gradient Descent
Algorithm:

Mini-Batch Gradient Descent
the algorithm updates the parameters after it sees mini batch size number of data

points.

Averaging the gradients gives a better sense of gradient direction which is consistent
with number of samples.
Mini-batch version SGD is default option for training neural networks

You might also like