0% found this document useful (0 votes)

20 views17 pages

What Is Regression

The document provides an introduction to regression, a statistical method used to analyze the relationship between a dependent variable and independent variables. It covers the basics of simple linear regression, including the best fit line, residuals, and the importance of various metrics like RSS and R² in assessing model strength. Additionally, it discusses machine learning model classifications, assumptions of linear regression, and considerations when transitioning from simple to multiple linear regression.

Uploaded by

x6407570

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views17 pages

What Is Regression

Uploaded by

x6407570

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

You are on page 1/ 17

What Is Regression?

Regression is a statistical method used in finance, investing, and other

disciplines that attempts to determine the strength and character of the
relationship between one dependent variable (usually denoted by Y) and a
series of other variables (known as independent variables).

In this session, you will get a quick introduction to machine learning

models. You will learn the basics of simple linear regression and
understand how to find the best-fitted line for a model and the various
parameters associated with it. After completing the session, you will have
an idea of when to use which type of algorithm i.e., supervised or
unsupervised. You will be able to determine the strength of a linear
regression model.

This session will cover the following topics:

 Introduction to machine learning

 Supervised and unsupervised learning methods

 Linear regression model

 Residuals

 Residual sum of squares (RSS) and R² (R-squared)

Machine learning model

machine learning models can be classified into the following three types
based on the task performed and the nature of the output:

1. Regression: The output variable to be predicted is a continuous

variable, e.g., predicting the house price based on the size of the
house, availability of schools in the area, and other essential factors.

2. Classification: The output variable to be predicted is

a categorical variable, e.g., designating some papers as ‘Secret’
or ‘Confidential’.

3. Clustering: No predefined notion of a label is allocated to the

groups/clusters formed, e.g., grouping together users with similar
viewing patterns on Netflix, in order to recommend similar content.

you can classify machine learning models into two broad categories:
 Supervised learning methods: It uses old data with labels for
building the model. Regression and classification algorithms fall
under this category.

 Unsupervised learning methods: No predefined labels are

assigned to old data in this type. Clustering algorithms fall under
this category.

You can find the machine learning classification below.

Regression Line

Let’s now delve deeper into linear regression. You will learn to identify
whether a given problem is a regression problem or a classification
problem.

Imagine that you need to predict the final score of a cricket team before
the innings end. You can build a regression model to be able to make a
decent prediction.

Example – Run Rate helps in predicting the final Score.

A simple linear regression model attempts to explain the relationship

between a dependent and an independent variable using a straight line.

The independent variable is also known as the predictor variable, and

the dependent variable is also known as the output variable.
Best Fit Line

In regression, a best fit line is a line that fits the given scatter plot in the
best way. Let’s see how you can define the notion of a best fit line.

Let’s reiterate what you have learnt so far:

 You started with a scatter plot to check the relationship between the
sales and the marketing budget.

 You found residuals and the RSS for any given line passing through
the scatter plot.

 Then you found the equation of the best fit line by minimising the
RSS and also found the optimal values of β₀ and β₁.

So, you know that the best fit line is obtained by minimising a quantity
called the residual sum of squares (RSS). You will now be introduced to the
concept of gradient descent.

Gradient descent: Gradient descent is an optimisation algorithm that

optimises the objective function (cost function for linear regression) to
reach the optimal solution.

there are two ways to minimise the cost function(RSS):

1. Differentiation: We differentiate the cost function and equate it to

zero, and solve the two equations we get to find βo and β1.

2. Gradient descent approach: You take some βo and β1 values and

iteratively move towards a better βo and β1 so as to minimise the
cost function.

Let’s now see a demonstration of obtaining a best fit line and finding out
the RSS for the marketing spends vs sales example in Excel.
Strength of Simple Linear Regression

Once you have determined the best fit line, there are a few critical
questions that you need to answer, such as:

 How well does the best fit line represent the scatter plot?

 How well does the best fit line predict the new data?

Residual Sum of Squares (RSS)

In the previous example of marketing spend (in lakh rupees) and sales
amount (in crore rupees), let’s assume that you get the same data in
different units: marketing spend (in lakh rupees) and sales amount (in
dollars). Do you think there will be any change in the value of the RSS due
to a change in the units (as compared to the value calculated in the Excel
demonstration)?

Yes, the value of the RSS will change because the units are
changed.

✓ Correct
Feedback:

Correct. The RSS for any regression line is given by the

expression∑(yi−yipred)2. RSS is the sum of the squared difference
between the actual and the predicted values, and its value will change if
the units change as it has units of y2. For example, (₹140 - ₹ 70)^2 =
4900, whereas (2 USD - 1 USD)^2 = 1. So, the value of the RSS is
different and varies when the units are changed.

graphic below and look at how the position of the regression line and the
values of the RSS and the TSS change with a change in the values of β₀
and β₁.

Let’s go back to our Excel demonstration and see how you can find out the
TSS and then the R² for the example for which we had performed
regression.
Comprehension

The plot below represents a scatter plot of two variables, X and Y, with the
Y variable being dependent on X. Let’s assume that the line with the
equation Y = X/2 + 3 plotted in the graph represents the best fit line. This
is the same line whose equation you found earlier.

You can find the value of the residual for each point, e.g., for x = 2, the
residual would be 5 - 4 = 1.
Apart from R², there is one more concept named RSE (residual squared
error), which is linked to RSS.

As you have seen, residual square error (RSE) = √RSS/df where df

(degrees of freedom) = n-2 (‘n’ is the number of data points). Now let us
summarise the topics you have studied till now. You have also seen that
RSE has the same disadvantage as RSS i.e., both depend on the scale of Y.

Summary

This brings us to the end of the session on simple linear regression. In the
next session, you will work on simple linear regression in Python so that
you understand the practical aspects of it. Before that, let’s go through a
brief summary of what you learnt in this session.

 Machine learning models can be classified into the following two

categories on the basis of the learning algorithm:

 Supervised learning method: Past data with labels is

available to build the model.

 Regression: The output variable is continuous in

nature.

 Classification: The output variable is categorical in

nature.
 Unsupervised learning method: Past data with labels is not
available.

 Clustering: There is no predefined notion of labels.

 The past data set is divided into two parts in the supervised learning
method:

 Training data: Used for the model to learn during modelling

 Testing data: Used by the trained model for prediction and

model evaluation

 Linear regression models can be classified into two types depending

upon the number of independent variables:

 Simple linear regression: This is used when the number of

independent variables is 1.

 Multiple linear regression: This is used when the number

of independent variables is more than 1.

 The equation of the best fit regression line Y = β₀ + β₁X, can be

found by minimising the cost function (RSS in this case, using the
ordinary least squares method), which is done using the following
two methods:

 Differentiation

 Gradient descent

 The strength of a linear regression model is mainly explained

by R², where R² = 1 - (RSS/TSS).

 RSS: Residual sum of squares

 TSS: Total sum of squares

 RSE helps in measuring the lack of fit of a model on a given data.

The closeness of the estimated regression coefficients to the true
ones can be estimated using RSE. It is related to RSS by the
formula: RSE=√RSSdf, where df=n−2 and n is the number of data
points.

Welcome to the session ‘Simple Linear Regression in Python’. So far,

we have discussed the theory part of simple linear regression. Now, let’s
move on to building a simple linear regression model in Python.
In this session, you will learn about the generic steps required to build a
simple linear regression model. You will first read and visualise the data
set. Next, you will split the data set into train and test sets. You will then
build the model on the training data and draw inferences. You will use the
advertising dataset given in ISLR and analyse the relationship between
‘TV advertising’ and ‘Sales’ using a simple linear regression model. You
will learn to make a linear model using two different libraries
– statsmodels and SKLearn.

Assumptions of Simple Linear Regression

Before moving on to the Python code, we need to address an important
aspect of linear regression: the assumptions of linear regression.

While building a linear model, you assume that the target variable and the
input variables are linearly dependent.
You are making inferences on the ‘population’ using a ‘sample’. The
assumption that variables are linearly dependent is not enough to
generalise the results you obtain on a sample to the population, which is
much larger in size than the sample. Thus, you need to have certain
assumptions in place in order to make inferences.

Let’s understand the importance of each assumption one by one:

 There is a linear relationship between X and Y:

 X and Y should display some sort of a linear
relationship; otherwise, there is no use in fitting a linear model
between them.

 Error terms are normally distributed with mean zero (not X,

Y):

 There is no problem if the error terms are not normally

distributed if you just wish to fit a line and not make any
further interpretations.

 If you are willing to make some inferences on the model that

you have built (you will see this in the coming segments), you
need to have a notion of the distribution of the error terms.
One particular repercussion of the error terms not being
normally distributed is that the p-values obtained during the
hypothesis test to determine the significance of the
coefficients become unreliable. (You’ll see this in a later
segment)

 The assumption of normality is made, as it has been observed

that the error terms generally follow a normal distribution
with a mean equal to zero in most cases.
 Error terms are independent of each other:

 The error terms should not be dependent on one another (like

in a time-series data wherein the next value is dependent on
the previous one).

 Error terms have constant variance (homoscedasticity):

 The variance should not increase (or decrease) as the error

values change.
 The variance should not follow any pattern as the error terms
change.

Moving From SLR to MLR – New Considerations

When moving from a simple linear regression model to a multiple linear

regression model, you have to look at a few things.

The new aspects to consider when moving from simple to multiple linear
regression are as follows:

Overfitting

As you keep adding variables, the model may become far too complex.

It may end up memorising the training data and, consequently, fail to

generalise.

A model is generally said to overfit when the training accuracy is high

while the test accuracy is very low.

Multicollinearity

This refers to associations between predictor variables, which you will

study later.

Feature selection
Selecting an optimal set from a pool of given features, many of which
might be redundant, becomes an important task.

You learnt about multiple linear regression models and new aspects like
overfitting, multicollinearity, and feature selection. In the next segment,
you will study multicollinearity.

Multicollinearity

In the last segment, you learnt about the new considerations that are
required to be made when moving to multiple linear regression. Let’s now
look at the next aspect, i.e., multicollinearity.

Multicollinearity refers to the phenomenon of having related predictor

(independent) variables in the input data set. In simple terms, in a model
that has been built using several independent variables, some of these
variables might be interrelated, due to which the presence of that variable
in the model is redundant. You drop some of these related independent
variables as a way of dealing with multicollinearity.

Multicollinearity affects the following:

Interpretation

If there will be a change in Y when all other variables are held constant.

Inference
You saw two basic ways of dealing with multicollinearity:

1. Looking at pairwise correlations

Looking at the correlation between different pairs of independent

variables

2. Checking the variance inflation factor (VIF)

Sometimes, pairwise correlations are not enough.

Instead of just one variable, the independent variable may depend upon a
combination of other variables.

VIF calculates how well one independent variable is explained by all the
other independent variables combined.
The VIF is given by:

Here, ‘i’ refers to the ith variable, which is represented as a linear

combination of the rest of the independent variables. You will see VIF in
action during the Python demonstration on multiple linear regression.

The common heuristic we follow for the VIF values is:

> 10: VIF value is definitely high, and the variable should be eliminated.

> 5: Can be okay, but it is worth inspecting.

< 5: Good VIF value; no need to eliminate this variable.

Once you have detected the multicollinearity present in the data set, how
exactly do you deal with it?

Some methods that can be used to deal with multicollinearity are as

follows:

Dropping variables

Drop the variable that is highly correlated with others

Pick the business interpretable variable

Creating a new variable using the interactions of the older variables

Add interaction features, i.e., features derived using some of the original
features

Variable transformations

Principal component analysis (covered in a later module)

We have learnt about multicollinearity and how to deal with it. In the next
segment, you will learn how to handle the categorical variables present in
a data set.

Dealing With Categorical Variables

So far, you have worked with numerical variables, but often, you will have
non-numeric variables in data sets. These variables are also known as
categorical variables. Obviously, these variables cannot be used directly in
the model, as they are non-numeric.

Linear Regression - Everything You Need To Know About Linear Regression
No ratings yet
Linear Regression - Everything You Need To Know About Linear Regression
17 pages
Everything You Need To Know About Linear Regression
No ratings yet
Everything You Need To Know About Linear Regression
19 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
26 pages
Applying Machine Learning Algorithms With Scikit-Learn (Sklearn) - Notes
No ratings yet
Applying Machine Learning Algorithms With Scikit-Learn (Sklearn) - Notes
19 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
Practical 5
No ratings yet
Practical 5
8 pages
ML Unit-2
No ratings yet
ML Unit-2
123 pages
Unit 2
No ratings yet
Unit 2
26 pages
Lecture6 Regression
No ratings yet
Lecture6 Regression
42 pages
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
No ratings yet
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
16 pages
Unit 2
No ratings yet
Unit 2
136 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
13 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
Machine Learning: Bilal Khan
100% (2)
Machine Learning: Bilal Khan
20 pages
Chapter - 2 - Linear and Logistic Regression
No ratings yet
Chapter - 2 - Linear and Logistic Regression
34 pages
Linear Regression
No ratings yet
Linear Regression
83 pages
Lecture 9-10
No ratings yet
Lecture 9-10
28 pages
AI Lec23
No ratings yet
AI Lec23
36 pages
Teit ML2
No ratings yet
Teit ML2
11 pages
Linear Regression
No ratings yet
Linear Regression
18 pages
Chap 2 Linear Regression - Part1
No ratings yet
Chap 2 Linear Regression - Part1
29 pages
Analytics Compendium
No ratings yet
Analytics Compendium
41 pages
ML 2 ND Unit
No ratings yet
ML 2 ND Unit
50 pages
Types of Supervised Learning2
No ratings yet
Types of Supervised Learning2
66 pages
Supervised Learning
No ratings yet
Supervised Learning
20 pages
6 - Classification and Regression Tasks
No ratings yet
6 - Classification and Regression Tasks
115 pages
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
No ratings yet
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
60 pages
Linear Regression
No ratings yet
Linear Regression
15 pages
CSL0777 L12
No ratings yet
CSL0777 L12
18 pages
Isn't Linear Regression From Statistics?
No ratings yet
Isn't Linear Regression From Statistics?
4 pages
Linear Regression - 1st Draft
No ratings yet
Linear Regression - 1st Draft
5 pages
Complete
No ratings yet
Complete
12 pages
Linear Regression
No ratings yet
Linear Regression
24 pages
Supervised Learning Essentials
No ratings yet
Supervised Learning Essentials
30 pages
Unit-4 DS Student
No ratings yet
Unit-4 DS Student
43 pages
ML Lecture - 3
No ratings yet
ML Lecture - 3
47 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
4 ML
No ratings yet
4 ML
41 pages
Lect 10 Regression
No ratings yet
Lect 10 Regression
7 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
10 pages
Linear-Regression ML
No ratings yet
Linear-Regression ML
36 pages
Regression: Unit Iii
No ratings yet
Regression: Unit Iii
54 pages
Day.9 SML
No ratings yet
Day.9 SML
23 pages
S&ML Unit 5 - Q & A
No ratings yet
S&ML Unit 5 - Q & A
15 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
26 pages
Lecture 3 - Linear Regression Imran 20022025 092939am
No ratings yet
Lecture 3 - Linear Regression Imran 20022025 092939am
46 pages
Unit 2
No ratings yet
Unit 2
34 pages
Regression
No ratings yet
Regression
4 pages
Unit 3
No ratings yet
Unit 3
48 pages
Unit 2
No ratings yet
Unit 2
67 pages
Unit 2 Notes - Final
No ratings yet
Unit 2 Notes - Final
32 pages
Linear Regression
100% (1)
Linear Regression
8 pages
RRB - Unit 2 Regresion
No ratings yet
RRB - Unit 2 Regresion
53 pages
Supervised Machine Learning - Regression
No ratings yet
Supervised Machine Learning - Regression
34 pages
Unit - Iii Data Analysis
No ratings yet
Unit - Iii Data Analysis
39 pages
Linear Regression
No ratings yet
Linear Regression
49 pages
Mechanical Engineering Review 2 Fundamentals Thermodynamics
No ratings yet
Mechanical Engineering Review 2 Fundamentals Thermodynamics
5 pages
Biology Levels for Students
No ratings yet
Biology Levels for Students
3 pages
Analysis of Soil Sample
100% (1)
Analysis of Soil Sample
20 pages
The Effectiveness of Isometric Contractions Compared With Isotonic Contractions in Reducing Pain For In-Season Athletes With Patellar Tendinopathy
No ratings yet
The Effectiveness of Isometric Contractions Compared With Isotonic Contractions in Reducing Pain For In-Season Athletes With Patellar Tendinopathy
4 pages
DLP in Math Ttleg
No ratings yet
DLP in Math Ttleg
3 pages
ND II 3rdterm Sum
No ratings yet
ND II 3rdterm Sum
7 pages
Morphemes Explained for Linguistics
No ratings yet
Morphemes Explained for Linguistics
4 pages
Class XI Admission Fees 2022-23
No ratings yet
Class XI Admission Fees 2022-23
1 page
Actividad 6 Reading Comprehension: Deisy Johanna Guayacán Vanegas
No ratings yet
Actividad 6 Reading Comprehension: Deisy Johanna Guayacán Vanegas
4 pages
T2FD Antenna: History & Design
100% (1)
T2FD Antenna: History & Design
4 pages
Dsoc202 Social Stratification English PDF
No ratings yet
Dsoc202 Social Stratification English PDF
315 pages
JCB ENGLISH Fault Finding COMPLETE PDF
97% (29)
JCB ENGLISH Fault Finding COMPLETE PDF
129 pages
Extn-111 Pyqs Last 4 Yrs
No ratings yet
Extn-111 Pyqs Last 4 Yrs
7 pages
ASTM-F1515-03-2008 - Standard Test Method For Measuring Light Stability of Resilient Flooring by Color Change
No ratings yet
ASTM-F1515-03-2008 - Standard Test Method For Measuring Light Stability of Resilient Flooring by Color Change
2 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
2 pages
Education Philosophy Review
50% (2)
Education Philosophy Review
34 pages
AICh EWeir Loading SPR 2009
No ratings yet
AICh EWeir Loading SPR 2009
13 pages
Quick Bill Summary: Change To Your Service
No ratings yet
Quick Bill Summary: Change To Your Service
1 page
AI in Medical Imaging Challenges
No ratings yet
AI in Medical Imaging Challenges
21 pages
Ati:F:Ht1: Service Bulletin
No ratings yet
Ati:F:Ht1: Service Bulletin
42 pages
Grade 4 DLL English 4 q3 Week 5
No ratings yet
Grade 4 DLL English 4 q3 Week 5
5 pages
Ww85k5410uw - DC68 03677F 03 PDF
No ratings yet
Ww85k5410uw - DC68 03677F 03 PDF
56 pages
AWW Dust Collector Article Jan 2006
No ratings yet
AWW Dust Collector Article Jan 2006
7 pages
Humanities and Art First Session
No ratings yet
Humanities and Art First Session
31 pages
DS1720 01
No ratings yet
DS1720 01
19 pages
PM - I CIA
No ratings yet
PM - I CIA
5 pages
Emerging Issues of Procurement
No ratings yet
Emerging Issues of Procurement
19 pages
TechCalc Thermal Software Guide
No ratings yet
TechCalc Thermal Software Guide
108 pages
Marketing Implementation & Control Guide
No ratings yet
Marketing Implementation & Control Guide
19 pages

What Is Regression

Uploaded by

What Is Regression

Uploaded by

What Is Regression?

Regression is a statistical method used in finance, investing, and other

In this session, you will get a quick introduction to machine learning

This session will cover the following topics:

 Introduction to machine learning

 Supervised and unsupervised learning methods

 Linear regression model

 Residual sum of squares (RSS) and R² (R-squared)

Machine learning model

1. Regression: The output variable to be predicted is a continuous

2. Classification: The output variable to be predicted is

3. Clustering: No predefined notion of a label is allocated to the

 Unsupervised learning methods: No predefined labels are

You can find the machine learning classification below.

Example – Run Rate helps in predicting the final Score.

A simple linear regression model attempts to explain the relationship

The independent variable is also known as the predictor variable, and

Let’s reiterate what you have learnt so far:

Gradient descent: Gradient descent is an optimisation algorithm that

there are two ways to minimise the cost function(RSS):

1. Differentiation: We differentiate the cost function and equate it to

2. Gradient descent approach: You take some βo and β1 values and

Residual Sum of Squares (RSS)

Correct. The RSS for any regression line is given by the

As you have seen, residual square error (RSE) = √RSS/df where df

 Machine learning models can be classified into the following two

 Supervised learning method: Past data with labels is

 Regression: The output variable is continuous in

 Classification: The output variable is categorical in

 Clustering: There is no predefined notion of labels.

 Training data: Used for the model to learn during modelling

 Testing data: Used by the trained model for prediction and

 Linear regression models can be classified into two types depending

 Simple linear regression: This is used when the number of

 Multiple linear regression: This is used when the number

 The equation of the best fit regression line Y = β₀ + β₁X, can be

 The strength of a linear regression model is mainly explained

 RSS: Residual sum of squares

 TSS: Total sum of squares

 RSE helps in measuring the lack of fit of a model on a given data.

Welcome to the session ‘Simple Linear Regression in Python’. So far,

Assumptions of Simple Linear Regression

Let’s understand the importance of each assumption one by one:

 There is a linear relationship between X and Y:

 Error terms are normally distributed with mean zero (not X,

 There is no problem if the error terms are not normally

 If you are willing to make some inferences on the model that

 The assumption of normality is made, as it has been observed

 The error terms should not be dependent on one another (like

 Error terms have constant variance (homoscedasticity):

 The variance should not increase (or decrease) as the error

Moving From SLR to MLR – New Considerations

When moving from a simple linear regression model to a multiple linear

It may end up memorising the training data and, consequently, fail to

A model is generally said to overfit when the training accuracy is high

This refers to associations between predictor variables, which you will

Multicollinearity refers to the phenomenon of having related predictor

Multicollinearity affects the following:

1. Looking at pairwise correlations

Looking at the correlation between different pairs of independent

2. Checking the variance inflation factor (VIF)

Sometimes, pairwise correlations are not enough.

Here, ‘i’ refers to the ith variable, which is represented as a linear

The common heuristic we follow for the VIF values is:

> 5: Can be okay, but it is worth inspecting.

< 5: Good VIF value; no need to eliminate this variable.

Some methods that can be used to deal with multicollinearity are as

Drop the variable that is highly correlated with others

Pick the business interpretable variable

Principal component analysis (covered in a later module)

Dealing With Categorical Variables

You might also like