KEMBAR78
What Is Regression | PDF | Regression Analysis | Linear Regression
0% found this document useful (0 votes)
20 views17 pages

What Is Regression

The document provides an introduction to regression, a statistical method used to analyze the relationship between a dependent variable and independent variables. It covers the basics of simple linear regression, including the best fit line, residuals, and the importance of various metrics like RSS and R² in assessing model strength. Additionally, it discusses machine learning model classifications, assumptions of linear regression, and considerations when transitioning from simple to multiple linear regression.

Uploaded by

x6407570
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views17 pages

What Is Regression

The document provides an introduction to regression, a statistical method used to analyze the relationship between a dependent variable and independent variables. It covers the basics of simple linear regression, including the best fit line, residuals, and the importance of various metrics like RSS and R² in assessing model strength. Additionally, it discusses machine learning model classifications, assumptions of linear regression, and considerations when transitioning from simple to multiple linear regression.

Uploaded by

x6407570
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 17

What Is Regression?

Regression is a statistical method used in finance, investing, and other


disciplines that attempts to determine the strength and character of the
relationship between one dependent variable (usually denoted by Y) and a
series of other variables (known as independent variables).

In this session, you will get a quick introduction to machine learning


models. You will learn the basics of simple linear regression and
understand how to find the best-fitted line for a model and the various
parameters associated with it. After completing the session, you will have
an idea of when to use which type of algorithm i.e., supervised or
unsupervised. You will be able to determine the strength of a linear
regression model.

This session will cover the following topics:

 Introduction to machine learning

 Supervised and unsupervised learning methods

 Linear regression model

 Residuals

 Residual sum of squares (RSS) and R² (R-squared)

Machine learning model


machine learning models can be classified into the following three types
based on the task performed and the nature of the output:

1. Regression: The output variable to be predicted is a continuous


variable, e.g., predicting the house price based on the size of the
house, availability of schools in the area, and other essential factors.

2. Classification: The output variable to be predicted is


a categorical variable, e.g., designating some papers as ‘Secret’
or ‘Confidential’.

3. Clustering: No predefined notion of a label is allocated to the


groups/clusters formed, e.g., grouping together users with similar
viewing patterns on Netflix, in order to recommend similar content.

you can classify machine learning models into two broad categories:
 Supervised learning methods: It uses old data with labels for
building the model. Regression and classification algorithms fall
under this category.

 Unsupervised learning methods: No predefined labels are


assigned to old data in this type. Clustering algorithms fall under
this category.

You can find the machine learning classification below.

Regression Line

Let’s now delve deeper into linear regression. You will learn to identify
whether a given problem is a regression problem or a classification
problem.

Imagine that you need to predict the final score of a cricket team before
the innings end. You can build a regression model to be able to make a
decent prediction.

Example – Run Rate helps in predicting the final Score.

A simple linear regression model attempts to explain the relationship


between a dependent and an independent variable using a straight line.

The independent variable is also known as the predictor variable, and


the dependent variable is also known as the output variable.
Best Fit Line

In regression, a best fit line is a line that fits the given scatter plot in the
best way. Let’s see how you can define the notion of a best fit line.

Let’s reiterate what you have learnt so far:

 You started with a scatter plot to check the relationship between the
sales and the marketing budget.

 You found residuals and the RSS for any given line passing through
the scatter plot.

 Then you found the equation of the best fit line by minimising the
RSS and also found the optimal values of β₀ and β₁.

So, you know that the best fit line is obtained by minimising a quantity
called the residual sum of squares (RSS). You will now be introduced to the
concept of gradient descent.

Gradient descent: Gradient descent is an optimisation algorithm that


optimises the objective function (cost function for linear regression) to
reach the optimal solution.

there are two ways to minimise the cost function(RSS):

1. Differentiation: We differentiate the cost function and equate it to


zero, and solve the two equations we get to find βo and β1.

2. Gradient descent approach: You take some βo and β1 values and


iteratively move towards a better βo and β1 so as to minimise the
cost function.

Let’s now see a demonstration of obtaining a best fit line and finding out
the RSS for the marketing spends vs sales example in Excel.
Strength of Simple Linear Regression

Once you have determined the best fit line, there are a few critical
questions that you need to answer, such as:

 How well does the best fit line represent the scatter plot?

 How well does the best fit line predict the new data?

Residual Sum of Squares (RSS)


In the previous example of marketing spend (in lakh rupees) and sales
amount (in crore rupees), let’s assume that you get the same data in
different units: marketing spend (in lakh rupees) and sales amount (in
dollars). Do you think there will be any change in the value of the RSS due
to a change in the units (as compared to the value calculated in the Excel
demonstration)?

Yes, the value of the RSS will change because the units are
changed.

✓ Correct
Feedback:

Correct. The RSS for any regression line is given by the


expression∑(yi−yipred)2. RSS is the sum of the squared difference
between the actual and the predicted values, and its value will change if
the units change as it has units of y2. For example, (₹140 - ₹ 70)^2 =
4900, whereas (2 USD - 1 USD)^2 = 1. So, the value of the RSS is
different and varies when the units are changed.

graphic below and look at how the position of the regression line and the
values of the RSS and the TSS change with a change in the values of β₀
and β₁.

Let’s go back to our Excel demonstration and see how you can find out the
TSS and then the R² for the example for which we had performed
regression.
Comprehension

The plot below represents a scatter plot of two variables, X and Y, with the
Y variable being dependent on X. Let’s assume that the line with the
equation Y = X/2 + 3 plotted in the graph represents the best fit line. This
is the same line whose equation you found earlier.

You can find the value of the residual for each point, e.g., for x = 2, the
residual would be 5 - 4 = 1.
Apart from R², there is one more concept named RSE (residual squared
error), which is linked to RSS.

As you have seen, residual square error (RSE) = √RSS/df where df


(degrees of freedom) = n-2 (‘n’ is the number of data points). Now let us
summarise the topics you have studied till now. You have also seen that
RSE has the same disadvantage as RSS i.e., both depend on the scale of Y.

Summary

This brings us to the end of the session on simple linear regression. In the
next session, you will work on simple linear regression in Python so that
you understand the practical aspects of it. Before that, let’s go through a
brief summary of what you learnt in this session.

 Machine learning models can be classified into the following two


categories on the basis of the learning algorithm:

 Supervised learning method: Past data with labels is


available to build the model.

 Regression: The output variable is continuous in


nature.

 Classification: The output variable is categorical in


nature.
 Unsupervised learning method: Past data with labels is not
available.

 Clustering: There is no predefined notion of labels.

 The past data set is divided into two parts in the supervised learning
method:

 Training data: Used for the model to learn during modelling

 Testing data: Used by the trained model for prediction and


model evaluation

 Linear regression models can be classified into two types depending


upon the number of independent variables:

 Simple linear regression: This is used when the number of


independent variables is 1.

 Multiple linear regression: This is used when the number


of independent variables is more than 1.

 The equation of the best fit regression line Y = β₀ + β₁X, can be


found by minimising the cost function (RSS in this case, using the
ordinary least squares method), which is done using the following
two methods:

 Differentiation

 Gradient descent

 The strength of a linear regression model is mainly explained


by R², where R² = 1 - (RSS/TSS).

 RSS: Residual sum of squares

 TSS: Total sum of squares

 RSE helps in measuring the lack of fit of a model on a given data.


The closeness of the estimated regression coefficients to the true
ones can be estimated using RSE. It is related to RSS by the
formula: RSE=√RSSdf, where df=n−2 and n is the number of data
points.

Welcome to the session ‘Simple Linear Regression in Python’. So far,


we have discussed the theory part of simple linear regression. Now, let’s
move on to building a simple linear regression model in Python.
In this session, you will learn about the generic steps required to build a
simple linear regression model. You will first read and visualise the data
set. Next, you will split the data set into train and test sets. You will then
build the model on the training data and draw inferences. You will use the
advertising dataset given in ISLR and analyse the relationship between
‘TV advertising’ and ‘Sales’ using a simple linear regression model. You
will learn to make a linear model using two different libraries
– statsmodels and SKLearn.

Assumptions of Simple Linear Regression


Before moving on to the Python code, we need to address an important
aspect of linear regression: the assumptions of linear regression.

While building a linear model, you assume that the target variable and the
input variables are linearly dependent.
You are making inferences on the ‘population’ using a ‘sample’. The
assumption that variables are linearly dependent is not enough to
generalise the results you obtain on a sample to the population, which is
much larger in size than the sample. Thus, you need to have certain
assumptions in place in order to make inferences.

Let’s understand the importance of each assumption one by one:

 There is a linear relationship between X and Y:


 X and Y should display some sort of a linear
relationship; otherwise, there is no use in fitting a linear model
between them.

 Error terms are normally distributed with mean zero (not X,


Y):

 There is no problem if the error terms are not normally


distributed if you just wish to fit a line and not make any
further interpretations.

 If you are willing to make some inferences on the model that


you have built (you will see this in the coming segments), you
need to have a notion of the distribution of the error terms.
One particular repercussion of the error terms not being
normally distributed is that the p-values obtained during the
hypothesis test to determine the significance of the
coefficients become unreliable. (You’ll see this in a later
segment)

 The assumption of normality is made, as it has been observed


that the error terms generally follow a normal distribution
with a mean equal to zero in most cases.
 Error terms are independent of each other:

 The error terms should not be dependent on one another (like


in a time-series data wherein the next value is dependent on
the previous one).

 Error terms have constant variance (homoscedasticity):

 The variance should not increase (or decrease) as the error


values change.
 The variance should not follow any pattern as the error terms
change.

Moving From SLR to MLR – New Considerations

When moving from a simple linear regression model to a multiple linear


regression model, you have to look at a few things.

The new aspects to consider when moving from simple to multiple linear
regression are as follows:

Overfitting

As you keep adding variables, the model may become far too complex.

It may end up memorising the training data and, consequently, fail to


generalise.

A model is generally said to overfit when the training accuracy is high


while the test accuracy is very low.

Multicollinearity

This refers to associations between predictor variables, which you will


study later.

Feature selection
Selecting an optimal set from a pool of given features, many of which
might be redundant, becomes an important task.

You learnt about multiple linear regression models and new aspects like
overfitting, multicollinearity, and feature selection. In the next segment,
you will study multicollinearity.

Multicollinearity

In the last segment, you learnt about the new considerations that are
required to be made when moving to multiple linear regression. Let’s now
look at the next aspect, i.e., multicollinearity.

Multicollinearity refers to the phenomenon of having related predictor


(independent) variables in the input data set. In simple terms, in a model
that has been built using several independent variables, some of these
variables might be interrelated, due to which the presence of that variable
in the model is redundant. You drop some of these related independent
variables as a way of dealing with multicollinearity.

Multicollinearity affects the following:

Interpretation

If there will be a change in Y when all other variables are held constant.

Inference
You saw two basic ways of dealing with multicollinearity:

1. Looking at pairwise correlations

Looking at the correlation between different pairs of independent


variables

2. Checking the variance inflation factor (VIF)

Sometimes, pairwise correlations are not enough.

Instead of just one variable, the independent variable may depend upon a
combination of other variables.

VIF calculates how well one independent variable is explained by all the
other independent variables combined.
The VIF is given by:

Here, ‘i’ refers to the ith variable, which is represented as a linear


combination of the rest of the independent variables. You will see VIF in
action during the Python demonstration on multiple linear regression.

The common heuristic we follow for the VIF values is:

> 10: VIF value is definitely high, and the variable should be eliminated.

> 5: Can be okay, but it is worth inspecting.

< 5: Good VIF value; no need to eliminate this variable.

Once you have detected the multicollinearity present in the data set, how
exactly do you deal with it?

Some methods that can be used to deal with multicollinearity are as


follows:

Dropping variables

Drop the variable that is highly correlated with others

Pick the business interpretable variable


Creating a new variable using the interactions of the older variables

Add interaction features, i.e., features derived using some of the original
features

Variable transformations

Principal component analysis (covered in a later module)

We have learnt about multicollinearity and how to deal with it. In the next
segment, you will learn how to handle the categorical variables present in
a data set.

Dealing With Categorical Variables

So far, you have worked with numerical variables, but often, you will have
non-numeric variables in data sets. These variables are also known as
categorical variables. Obviously, these variables cannot be used directly in
the model, as they are non-numeric.

You might also like