What Is Regression?
Regression is a statistical method used in finance, investing, and other
disciplines that attempts to determine the strength and character of the
relationship between one dependent variable (usually denoted by Y) and a
series of other variables (known as independent variables).
In this session, you will get a quick introduction to machine learning
models. You will learn the basics of simple linear regression and
understand how to find the best-fitted line for a model and the various
parameters associated with it. After completing the session, you will have
an idea of when to use which type of algorithm i.e., supervised or
unsupervised. You will be able to determine the strength of a linear
regression model.
This session will cover the following topics:
Introduction to machine learning
Supervised and unsupervised learning methods
Linear regression model
Residuals
Residual sum of squares (RSS) and R² (R-squared)
Machine learning model
machine learning models can be classified into the following three types
based on the task performed and the nature of the output:
1. Regression: The output variable to be predicted is a continuous
variable, e.g., predicting the house price based on the size of the
house, availability of schools in the area, and other essential factors.
2. Classification: The output variable to be predicted is
a categorical variable, e.g., designating some papers as ‘Secret’
or ‘Confidential’.
3. Clustering: No predefined notion of a label is allocated to the
groups/clusters formed, e.g., grouping together users with similar
viewing patterns on Netflix, in order to recommend similar content.
you can classify machine learning models into two broad categories:
Supervised learning methods: It uses old data with labels for
building the model. Regression and classification algorithms fall
under this category.
Unsupervised learning methods: No predefined labels are
assigned to old data in this type. Clustering algorithms fall under
this category.
You can find the machine learning classification below.
Regression Line
Let’s now delve deeper into linear regression. You will learn to identify
whether a given problem is a regression problem or a classification
problem.
Imagine that you need to predict the final score of a cricket team before
the innings end. You can build a regression model to be able to make a
decent prediction.
Example – Run Rate helps in predicting the final Score.
A simple linear regression model attempts to explain the relationship
between a dependent and an independent variable using a straight line.
The independent variable is also known as the predictor variable, and
the dependent variable is also known as the output variable.
Best Fit Line
In regression, a best fit line is a line that fits the given scatter plot in the
best way. Let’s see how you can define the notion of a best fit line.
Let’s reiterate what you have learnt so far:
You started with a scatter plot to check the relationship between the
sales and the marketing budget.
You found residuals and the RSS for any given line passing through
the scatter plot.
Then you found the equation of the best fit line by minimising the
RSS and also found the optimal values of β₀ and β₁.
So, you know that the best fit line is obtained by minimising a quantity
called the residual sum of squares (RSS). You will now be introduced to the
concept of gradient descent.
Gradient descent: Gradient descent is an optimisation algorithm that
optimises the objective function (cost function for linear regression) to
reach the optimal solution.
there are two ways to minimise the cost function(RSS):
1. Differentiation: We differentiate the cost function and equate it to
zero, and solve the two equations we get to find βo and β1.
2. Gradient descent approach: You take some βo and β1 values and
iteratively move towards a better βo and β1 so as to minimise the
cost function.
Let’s now see a demonstration of obtaining a best fit line and finding out
the RSS for the marketing spends vs sales example in Excel.
Strength of Simple Linear Regression
Once you have determined the best fit line, there are a few critical
questions that you need to answer, such as:
How well does the best fit line represent the scatter plot?
How well does the best fit line predict the new data?
Residual Sum of Squares (RSS)
In the previous example of marketing spend (in lakh rupees) and sales
amount (in crore rupees), let’s assume that you get the same data in
different units: marketing spend (in lakh rupees) and sales amount (in
dollars). Do you think there will be any change in the value of the RSS due
to a change in the units (as compared to the value calculated in the Excel
demonstration)?
Yes, the value of the RSS will change because the units are
changed.
✓ Correct
Feedback:
Correct. The RSS for any regression line is given by the
expression∑(yi−yipred)2. RSS is the sum of the squared difference
between the actual and the predicted values, and its value will change if
the units change as it has units of y2. For example, (₹140 - ₹ 70)^2 =
4900, whereas (2 USD - 1 USD)^2 = 1. So, the value of the RSS is
different and varies when the units are changed.
graphic below and look at how the position of the regression line and the
values of the RSS and the TSS change with a change in the values of β₀
and β₁.
Let’s go back to our Excel demonstration and see how you can find out the
TSS and then the R² for the example for which we had performed
regression.
Comprehension
The plot below represents a scatter plot of two variables, X and Y, with the
Y variable being dependent on X. Let’s assume that the line with the
equation Y = X/2 + 3 plotted in the graph represents the best fit line. This
is the same line whose equation you found earlier.
You can find the value of the residual for each point, e.g., for x = 2, the
residual would be 5 - 4 = 1.
Apart from R², there is one more concept named RSE (residual squared
error), which is linked to RSS.
As you have seen, residual square error (RSE) = √RSS/df where df
(degrees of freedom) = n-2 (‘n’ is the number of data points). Now let us
summarise the topics you have studied till now. You have also seen that
RSE has the same disadvantage as RSS i.e., both depend on the scale of Y.
Summary
This brings us to the end of the session on simple linear regression. In the
next session, you will work on simple linear regression in Python so that
you understand the practical aspects of it. Before that, let’s go through a
brief summary of what you learnt in this session.
Machine learning models can be classified into the following two
categories on the basis of the learning algorithm:
Supervised learning method: Past data with labels is
available to build the model.
Regression: The output variable is continuous in
nature.
Classification: The output variable is categorical in
nature.
Unsupervised learning method: Past data with labels is not
available.
Clustering: There is no predefined notion of labels.
The past data set is divided into two parts in the supervised learning
method:
Training data: Used for the model to learn during modelling
Testing data: Used by the trained model for prediction and
model evaluation
Linear regression models can be classified into two types depending
upon the number of independent variables:
Simple linear regression: This is used when the number of
independent variables is 1.
Multiple linear regression: This is used when the number
of independent variables is more than 1.
The equation of the best fit regression line Y = β₀ + β₁X, can be
found by minimising the cost function (RSS in this case, using the
ordinary least squares method), which is done using the following
two methods:
Differentiation
Gradient descent
The strength of a linear regression model is mainly explained
by R², where R² = 1 - (RSS/TSS).
RSS: Residual sum of squares
TSS: Total sum of squares
RSE helps in measuring the lack of fit of a model on a given data.
The closeness of the estimated regression coefficients to the true
ones can be estimated using RSE. It is related to RSS by the
formula: RSE=√RSSdf, where df=n−2 and n is the number of data
points.
Welcome to the session ‘Simple Linear Regression in Python’. So far,
we have discussed the theory part of simple linear regression. Now, let’s
move on to building a simple linear regression model in Python.
In this session, you will learn about the generic steps required to build a
simple linear regression model. You will first read and visualise the data
set. Next, you will split the data set into train and test sets. You will then
build the model on the training data and draw inferences. You will use the
advertising dataset given in ISLR and analyse the relationship between
‘TV advertising’ and ‘Sales’ using a simple linear regression model. You
will learn to make a linear model using two different libraries
– statsmodels and SKLearn.
Assumptions of Simple Linear Regression
Before moving on to the Python code, we need to address an important
aspect of linear regression: the assumptions of linear regression.
While building a linear model, you assume that the target variable and the
input variables are linearly dependent.
You are making inferences on the ‘population’ using a ‘sample’. The
assumption that variables are linearly dependent is not enough to
generalise the results you obtain on a sample to the population, which is
much larger in size than the sample. Thus, you need to have certain
assumptions in place in order to make inferences.
Let’s understand the importance of each assumption one by one:
There is a linear relationship between X and Y:
X and Y should display some sort of a linear
relationship; otherwise, there is no use in fitting a linear model
between them.
Error terms are normally distributed with mean zero (not X,
Y):
There is no problem if the error terms are not normally
distributed if you just wish to fit a line and not make any
further interpretations.
If you are willing to make some inferences on the model that
you have built (you will see this in the coming segments), you
need to have a notion of the distribution of the error terms.
One particular repercussion of the error terms not being
normally distributed is that the p-values obtained during the
hypothesis test to determine the significance of the
coefficients become unreliable. (You’ll see this in a later
segment)
The assumption of normality is made, as it has been observed
that the error terms generally follow a normal distribution
with a mean equal to zero in most cases.
Error terms are independent of each other:
The error terms should not be dependent on one another (like
in a time-series data wherein the next value is dependent on
the previous one).
Error terms have constant variance (homoscedasticity):
The variance should not increase (or decrease) as the error
values change.
The variance should not follow any pattern as the error terms
change.
Moving From SLR to MLR – New Considerations
When moving from a simple linear regression model to a multiple linear
regression model, you have to look at a few things.
The new aspects to consider when moving from simple to multiple linear
regression are as follows:
Overfitting
As you keep adding variables, the model may become far too complex.
It may end up memorising the training data and, consequently, fail to
generalise.
A model is generally said to overfit when the training accuracy is high
while the test accuracy is very low.
Multicollinearity
This refers to associations between predictor variables, which you will
study later.
Feature selection
Selecting an optimal set from a pool of given features, many of which
might be redundant, becomes an important task.
You learnt about multiple linear regression models and new aspects like
overfitting, multicollinearity, and feature selection. In the next segment,
you will study multicollinearity.
Multicollinearity
In the last segment, you learnt about the new considerations that are
required to be made when moving to multiple linear regression. Let’s now
look at the next aspect, i.e., multicollinearity.
Multicollinearity refers to the phenomenon of having related predictor
(independent) variables in the input data set. In simple terms, in a model
that has been built using several independent variables, some of these
variables might be interrelated, due to which the presence of that variable
in the model is redundant. You drop some of these related independent
variables as a way of dealing with multicollinearity.
Multicollinearity affects the following:
Interpretation
If there will be a change in Y when all other variables are held constant.
Inference
You saw two basic ways of dealing with multicollinearity:
1. Looking at pairwise correlations
Looking at the correlation between different pairs of independent
variables
2. Checking the variance inflation factor (VIF)
Sometimes, pairwise correlations are not enough.
Instead of just one variable, the independent variable may depend upon a
combination of other variables.
VIF calculates how well one independent variable is explained by all the
other independent variables combined.
The VIF is given by:
Here, ‘i’ refers to the ith variable, which is represented as a linear
combination of the rest of the independent variables. You will see VIF in
action during the Python demonstration on multiple linear regression.
The common heuristic we follow for the VIF values is:
> 10: VIF value is definitely high, and the variable should be eliminated.
> 5: Can be okay, but it is worth inspecting.
< 5: Good VIF value; no need to eliminate this variable.
Once you have detected the multicollinearity present in the data set, how
exactly do you deal with it?
Some methods that can be used to deal with multicollinearity are as
follows:
Dropping variables
Drop the variable that is highly correlated with others
Pick the business interpretable variable
Creating a new variable using the interactions of the older variables
Add interaction features, i.e., features derived using some of the original
features
Variable transformations
Principal component analysis (covered in a later module)
We have learnt about multicollinearity and how to deal with it. In the next
segment, you will learn how to handle the categorical variables present in
a data set.
Dealing With Categorical Variables
So far, you have worked with numerical variables, but often, you will have
non-numeric variables in data sets. These variables are also known as
categorical variables. Obviously, these variables cannot be used directly in
the model, as they are non-numeric.