Linear Regression
Linear Regression
● The first machine learning algorithm we
will explore is also one of the oldest!
Linear Regression
● Linear Regression
○ Theory of Linear Regression
○ Simple Implementation with Python
○ Scikit-Learn Overview
○ Linear Regression with Scikit-learn
○ Polynomial Regression
○ Regularization
○ Overview of Project Dataset
Let’s get started!
Introduction to Linear
Regression
Algorithm Theory - Part One
History and Motivation
Linear Regression
● Before we do any coding, we will have a
deep dive into building out an intuition
of the theory and motivation behind
Linear Regression.
Linear Regression
● This will include understanding:
○ Brief History
○ Linear Relationships
○ Ordinary Least Squares
○ Cost Functions
○ Gradient Descent
○ Vectorization
Introduction to Linear
Regression
Brief History
Linear Regression
● The history of the “invention” of linear
regression is a bit muddled.
● The linear regression methods based on
least squares grew out of a need for
mathematically improving navigation
methods based on astronomy during the
Age of Exploration in the 1700s.
Linear Regression
● 1722 - Roger Cotes discovers combining
different observations yields better
estimates of the true value.
● 1750 - Tobias Mayer explores averaging
different results under similar conditions
in studying librations of the moon.
Linear Regression
● 1757 - Roger Joseph Boscovich further
develops combining observations
studying the shape of the Earth.
● 1788 - Pierre-Simon LaPlace develops
similar averaging theories in explaining
the differences of motion between
Jupiter and Saturn.
Linear Regression
● 1805 - First public exposition on Linear
Regression with least squares method
published by Adrien-Marie Legendre -
Nouvelles Méthodes pour la
Détermination des Orbites des Comètes
Linear Regression
● 1809 - Carl Friedrich Gauss publishes his
methods of calculating orbits of celestial
bodies.
● Claiming to have invented least-squares
back in 1795!
Linear Regression
● 1808 - Robert Adrain published his
formulation of least squares (a year
before publication by Gauss).
Introduction to Linear
Regression
Linear Relationships
Linear Regression
● Put simply, a linear relationship implies
some constant straight line relationship.
● The simplest possible being y = x.
Linear Regression
● Here we see x = [1,2,3] and y = [1,2,3]
Linear Regression
● We could then, based on the three real
data points, build out the relationship
y=x as our “fitted” line.
Linear Regression
● This implies for some new x value I can
predict its related y.
Linear Regression
● But what happens with real data?
Linear Regression
● How do we draw a “fitted” line?
Linear Regression
● Ho do we draw a better “fitted” line?
Linear Regression
● Fundamentally, we understand we want
to minimize the overall distance from the
points to the line.
Linear Regression
● Fundamentally, we understand we want
to minimize the overall distance from the
points to the line.
Linear Regression
● We also know we can measure this error
from the real data points to the line,
known as the residual error.
Linear Regression
● Some lines will clearly be better fits than
others.
Linear Regression
● We can also see the residuals can be
both positive and negative.
Introduction to Linear
Regression
Ordinary Least Squares
Linear Regression
● Ordinary Least Squares works by
minimizing the sum of the squares of the
differences between the observed
dependent variable (values of the
variable being observed) in the given
dataset and those predicted by the linear
function.
Linear Regression
● We can visualize squared error to
minimize:
Linear Regression
● We can visualize squared error to
minimize:
Linear Regression
● Having a squared error will help us
simplify our calculations later on when
setting up a derivative.
Linear Regression
● Let’s explore Ordinary Least Squares by
converting a real data set into
mathematical notation, then working to
solve a linear relationship between
features and a variable!
Introduction to Linear
Regression
Algorithm Theory - Part Two
OLS Equations
Linear Regression
● Linear Regression OLS Theory
○ We know the equation of a simple
straight line:
■ y = mx + c
● m is the slope or gradient
● c is intercept with y-axis (or the
value of y when x is zero [0,c] )
Linear Regression
Linear Regression
Linear Regression
● Linear Regression OLS Theory
○ We can see for y=mx+c there is only
room for one possible feature x.
○ OLS will allow us to directly solve for
the slope m and intercept c.
○ We will later see we’ll need tools like
gradient descent to scale this to
multiple features.
Linear Regression
● Let’s explore how we could translate a
real data set into mathematical notation
for linear regression.
● Then we’ll solve a simple case of one
feature to explore OLS in action.
● Afterwards we’ll focus on gradient
descent for real world data set situations.
Linear Regression
● Linear Regression allows us to build a
relationship between multiple features
to estimate a target output/predicted
value Area m Bedrooms
2 Bathrooms Price
200 3 2 $500,000
190 2 1 $450,000
230 3 3 $650,000
180 1 1 $400,000
210 2 2 $550,000
Linear Regression
● We can translate this data into
generalized mathematical notation
● [x: features, y: predicted value]
X y
Area m2 Bedrooms Bathrooms Price
200 3 2 $500,000
190 2 1 $450,000
230 3 3 $650,000
180 1 1 $400,000
210 2 2 $550,000
Linear Regression
● We can translate this data into
generalized mathematical notation
X y
x1 x2 x3 y
200 3 2 $500,000
190 2 1 $450,000
230 3 3 $650,000
180 1 1 $400,000
210 2 2 $550,000
Linear Regression
● We can translate this data into
generalized mathematical notation
X y
x1 x2 x3 y
x11 3 2 $500,000
x21 2 1 $450,000
x31 3 3 $650,000
x41 1 1 $400,000
x51 2 2 $550,000
Linear Regression
● We can translate this data into
generalized mathematical notation
X y
x1 x2 x3 y
x11 x11 x11 y1
x21 x21 x21 y2
x31 x31 x31 y3
x41 x41 x41 y4
x51 x51 x51 y5
Linear Regression
● Now let’s build out a linear relationship
between the features X and label y.
X y
x1 x2 x3 y
x11 x11 x11 y1
x21 x21 x21 y2
x31 x31 x31 y3
x41 x41 x41 y4
x51 x51 x51 y5
Linear Regression
● Now let’s build out a linear relationship
between the features x and label y.
X y
x1 x2 x3 y
Linear Regression
● Reformat for y = x equation
y X
y x1 x2 x3
Linear Regression
● Each feature should have some Beta
coefficient associated with it.
y X
y x1 x2 x3
Linear Regression
● This is the same as the common notation
for a simple straight line: y=mx+c
y X
y x1 x2 x3
Linear Regression
● This is stating there is some Beta
coefficient for each feature to minimize
error. y X
y x1 x2 x3
Linear Regression
● We can also express this equation as a
sum:
y X
y x1 x2 x3
Linear Regression
● Note the y hat symbol displays a
prediction.
Linear Regression
● Line equation:
Linear Regression
Linear Regression
β
Linear Regression
Linear Regression
Linear Regression
● For simple problems with one X feature
we can easily solve for Betas values with
an analytical solution.
● Let’s quickly solve a simple example
problem, then we will see that for
multiple features we will need gradient
descent.
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
● As we expand to more than a single
feature however, an analytical solution
quickly becomes unscalable.
● Instead we shift focus on minimizing a
cost function with gradient descent.
Linear Regression
● We can use gradient descent to solve a
cost function to calculate Beta values!
Introduction to Linear
Regression
Algorithm Theory - Part Three
Cost Function
Linear Regression
● What we know so far:
○ Linear Relationships
■ y = mx+c
○ Ordinary Least Squares (OLS)
■ Solve simple linear regression
○ Not scalable for multiple features
○ Translating real data to Matrix Notation
○ Generalized formula for Beta coefficients
Linear Regression
● Remember we are searching for Beta
values for a best-fit line.
Linear Regression
● The equation below simply defines our
line, but how to choose beta
coefficients?
Linear Regression
● We’ve decided to define a “best-fit” as
minimizing the squared error or better
known as mean squared error (MES)
Linear Regression
● What is a Cost Function?
○ It is a function that measures the performance
of a Machine Learning model for given data.
○ Cost Function quantifies the error between
predicted values and expected values and
presents it in the form of a single real number.
Linear Regression
● What is a Cost Function?
○ In this situation, the event we are finding the
cost of is the difference between estimated
values, or the hypothesis and the real values —
the actual data we are trying to fit a line to.
https://medium.com/@lachlanmiller_52885/understanding-and-calculating-the-cost-function-for-linear-regression-39b8a3519fcb
Linear Regression
● What is a Cost Function?
Linear Regression
● What is a Cost Function?
○ The goal here is to find a line of best fit. A line that
approximates the values most accurately.
Linear Regression
● What is a Cost Function?
○ Here are some random
guesses for the slope of
each line
Linear Regression
● What is a Cost Function?
We have three hypothesis — three potential sets of data that might
represent a line of best fit. The slope for each line is as follows:
best_fit_2 looks pretty good. But we are data scientists, we don’t guess,
we conduct analysis and make well founded statements using
mathematics.
Linear Regression
● Our cost function can be defined by the
squared error formula:
Linear Regression
● Remember a cost function maps event or
values of one or more variables onto a real
number.
● In this case, the event we are finding the cost
of is the difference between estimated
values (the hypothesis) and the real values
— the actual data we are trying to fit a line to.
Linear Regression
• m is the number of samples — in this case, we
have three samples for X.
• Those are 1, 2 and 3. So 1/2*m is a constant. It
turns out to be 1/6, or 0.1667 .
Linear Regression
• Now we have sigma. This means the sum.
• In this case, the sum from i to m, or 1 to 3.
Linear Regression
• We repeat the calculation to the right of the
sigma for each sample.
• The actual calculation is just the hypothesis
value for minus the actual value of y. Then
you square whatever you get.
Linear Regression
• The final result will be a single number.
• We repeat this process for all the hypothesis, in this
case best_fit_1 , best_fit_2 and best_fit_3. Whichever has the
lowest result, or the lowest “cost” is the best fit of the three
hypothesis.
• Then you square whatever you get.
Linear Regression
● Calculating the Cost Function by Hand
Let’s run the calculation for best_fit_1
Linear Regression
● Calculating the Cost Function by Hand
Let’s run the calculation for best_fit_1
= 1/6
Linear Regression
● Calculating the Cost Function by Hand
Let’s run the calculation for best_fit_1
(0.50 – 1)2 = 0.25
Linear Regression
● Calculating the Cost Function by Hand
Let’s run the calculation for best_fit_1
(1.00 – 2.50)2 = 2.25
Linear Regression
● Calculating the Cost Function by Hand
Let’s run the calculation for best_fit_1
(1.50-3.50)2 = 4
Linear Regression
J = 1/6 * (0.25 + 2.25 + 4.00 )2
J = 1.083
Linear Regression
• Repeat the same process for all the other hypothesis
and we get:
• A lowest cost is desirable. A low costs represents a
smaller difference. By minimizing the cost, we are
finding the best fit.
• Out of the three-hypothesis presented, best_fit_2 has
the lowest cost.
Linear Regression
The orange
line, best_fit_2, is the
best fit of the three.
We can see this is
likely the case by
visual inspection, but
now we have a more
defined process for
confirming our
observations.
Linear Regression
● Unfortunately, it is not scalable to try to
get an analytical solution to minimize
this cost function.
● In the next lecture we will learn to use
gradient descent to minimize this cost
function.
Introduction to Linear
Regression
Algorithm Theory - Part Three
Gradient Descent
Linear Regression
● We just figured out a cost function to
minimize!
● Taking the cost function derivative and
then solving for zero to get the set of
Beta coefficients will be too difficult to
solve directly through an analytical
solution.
Linear Regression
● Instead we can describe this cost
function through vectorized matrix
notation and use gradient descent to
have a computer figure out the set of
Beta coefficient values that minimize the
cost/loss function.
Linear Regression
● Our goals:
○ Find a set of Beta coefficient values
that minimizes the error (cost
function)
○ Leverage computational power
instead of having to manually attempt
to analytically solve the derivative.
Linear Regression
● What is a Gradient Descent?
Gradient Descent (GD) is an efficient optimization
algorithm that attempts to find a local or global
minimum of a function.
In other words, GD is a general function for
minimizing a function, in this case the Mean
Squared Error cost function
Linear Regression
● What is a Gradient Descent?
Gradient Descent basically just does what we were
doing by hand — change the theta values, or
parameters, bit by bit, until we hopefully arrived to a
minimum.
Linear Regression
● What is a Gradient Descent?
Gradient descent enables a model to learn
the gradient or direction that the model
should take in order to reduce errors
(differences between actual y and
predicted y).
Linear Regression
● What is a Gradient Descent?
Direction in the simple linear regression
example refers to how the model
parameters m and c should be tweaked or
corrected to further reduce the cost
function
Linear Regression
● What is a Gradient Descent?
As the model iterates, it gradually
converges towards a minimum where
further tweaks to the parameters produce
little or zero changes in the loss — also
referred to as convergence.
Linear Regression
● Gradient descent can be defined by the
following formula:
Source: https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/ch04.html
Linear Regression
At this point the
model
has optimized the
weights such that
they minimize the
cost function.
Linear Regression
Linear Regression
The first thing to notice is
the thick red line. This is the
line estimated from the
initial values of m and b1.
You can see that this
doesn’t fit the data points
well at all and because of
this it is has the highest
error (MSE )
Linear Regression
However, you can see the
lines gradually moving
toward the data points until
a line of best fit (the thick
blue line) is identified.
In other words, upon each
iteration the model has
learned better values for m
and c until it finds the values
that minimize the cost
function.
Linear Regression
● What is a Gradient Descent?
The alternative to the gradient descent
process would be brute forcing a potentially
infinite combination of parameters until the
set that minimizes the cost are identified.
For obvious reasons this isn’t really feasible.
Linear Regression
● What is a Gradient Descent?
Gradient descent, therefore, enables the
learning process to make corrective updates
to the learned estimates that move the
model toward an optimal combination of
parameters.
Linear Regression
● Let’s visually explore what this looks like
in the case of a single Beta value.
Linear Regression
● Common mountain analogy
Linear Regression
● Common mountain analogy
Linear Regression
● Common mountain analogy
Linear Regression
● Common mountain analogy
Linear Regression
● Common mountain analogy
Linear Regression
● Common mountain analogy
Linear Regression
● Common mountain analogy
Linear Regression
● Common mountain analogy
Linear Regression
● Common mountain analogy
Linear Regression
● Common mountain analogy
Linear Regression
● This is exactly what gradient descent
does!
● It even looks similar for the case of a
single coefficient search.
Linear Regression
● 1 dimensional cost function (single Beta)
Linear Regression
● Choose a starting point
Linear Regression
● Calculate gradient at that point
Linear Regression
● Step forward proportional to negative
gradient
Linear Regression
● Repeat the steps
Linear Regression
● Repeat the steps
Linear Regression
● Note how we are essentially mapping the
gradient!
Linear Regression
● Eventually we will find the Beta that
minimizes the cost function!
Linear Regression
● Steps are proportional to negative
gradient!
Linear Regression
● Steeper gradient at start gives larger
steps.
Linear Regression
● Smaller gradient at end gives smaller
steps.
Linear Regression
● Finally! We can now leverage all our
computational power to find optimal
Beta coefficients that minimize the cost
function producing the line of best fit!
● We are now ready to code out Linear
Regression!
Simple Linear
Regression
Linear Regression
● Now that we understand what is
happening “under the hood” for linear
regression, let’s begin by coding through
an example of simple linear regression.
Linear Regression
● Simple Linear Regression
○ Limited to one X feature (y=mx+c)
○ We will create a best-fit line to map
out a linear relationship between
total advertising spend and
resulting sales.
● Let’s head over to the notebook!
Linear Regression
● Simple Linear Regression Exercise
○ Years Old
○ Experience Years
○ Academic Years
○ Salary
● Predict the Salary for the following
people:
Scikit-Learn
Overview
Scikit-Learn
● We’ve seen NumPy had some built in
capabilities for simple linear regression,
but when it comes to more complex
models, we’ll need Scikit-Learn!
● Before we jump straight into machine
learning with Scikit-Learn and Python,
let’s understand the philosophy behind
sklearn.
Scikit-Learn
● Scikit-learn is a library containing many
machine learning algorithms.
● It utilizes a generalized “estimator API”
framework to calling the models.
● This means the way algorithms are
imported, fitted, and used is uniform
across all algorithms.
Scikit-Learn
● This allows users to easily swap algorithms
in and out and test various approaches.
● This uniform framework also means users
can easily apply almost any algorithm
effectively without truly understanding
what the algorithm is doing!
Scikit-Learn
● Scikit-learn also comes with many
convenience tools, including train test
split functions, cross validation tools, and
a variety of reporting metric functions.
● This leaves Scikit-Learn as a “one-stop
shop” for many of our machine learning
needs.
Scikit-Learn
● Philosophy of Scikit-Learn
○ Scikit-Learn’s approach to model
building focuses on applying
models and performance metrics.
○ This is a more pragmatic industry
style approach rather than an
academic approach of describing
the model and its parameters.
Scikit-Learn
● Philosophy of Scikit-Learn
○ Academic users used to R style
reporting may also want to explore
the statsmodels python library if
interested in more statistical
description of models such as
significance levels.
Scikit-Learn
● Let’s quickly review the framework of
Scikit-Learn for the supervised machine
learning process.
● We will quickly see how the code directly
relates to the process theory!
Supervised Machine Learning Process
● Recall that we will perform a Train | Test
split for supervised learning.
Area m2 Bedrooms Bathrooms Price
200 3 2 $500,000
TRAIN 190 2 1 $450,000
230 3 3 $650,000
180 1 1 $400,000
TEST
210 2 2 $550,000
Supervised Machine Learning Process
● Also recall there are 4 main components
after a Train | Test split:
Area m2 Bedrooms Bathrooms Price
200 3 2 $500,000
X TRAIN 190 2 1 $450,000 Y TRAIN
230 3 3 $650,000
180 1 1 $400,000 Y TEST
X TEST
210 2 2 $550,000
Supervised Machine Learning Process
● Scikit-Learn easily does this split (as well
as more advanced cross-validation)
Area m2 Bedrooms Bathrooms Price
200 3 2 $500,000
TRAIN 190 2 1 $450,000
230 3 3 $650,000
180 1 1 $400,000
TEST
210 2 2 $550,000
Scikit-Learn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
Supervised Machine Learning Process
● Also recall that we want to compare
predictions to the y test labels.
Predictions Area m2 Bedrooms Bathrooms Price
$410,000 180 1 1 $400,000
TEST
$540,000 210 2 2 $550,000
Scikit-Learn
from sklearn.model_family import ModelAlgo
Scikit-Learn
from sklearn.model_family import ModelAlgo
mymodel = ModelAlgo(param1,param2)
Scikit-Learn
from sklearn.model_family import ModelAlgo
mymodel = ModelAlgo(param1,param2)
mymodel.fit(X_train,y_train)
Scikit-Learn
from sklearn.model_family import ModelAlgo
mymodel = ModelAlgo(param1,param2)
mymodel.fit(X_train,y_train)
predictions = mymodel.predict(X_test)
Scikit-Learn
from sklearn.model_family import ModelAlgo
mymodel = ModelAlgo(param1,param2)
mymodel.fit(X_train,y_train)
predictions = mymodel.predict(X_test)
from sklearn.metrics import error_metric
Scikit-Learn
from sklearn.model_family import ModelAlgo
mymodel = ModelAlgo(param1,param2)
mymodel.fit(X_train,y_train)
predictions = mymodel.predict(X_test)
from sklearn.metrics import error_metric
performance = error_metric(y_test,predictions)
Scikit-Learn
● This framework will be similar for any
supervised machine learning algorithm.
● Let’s begin exploring it further with
Linear Regression!
Linear Regression with
Scikit-Learn
Part One:
Data Setup and Model Training
Linear Regression
● Previously, we explored “Is there a
relationship between total advertising
spend and sales? “
● Now we want to expand this to “What is
the relationship between each
advertising channel
(TV,Radio,Newspaper) and sales?”
Linear Regression
● Le’ts jump into jupyter notebook to
answer this question
Performance
Evaluation
Regression Metrics
Evaluating Regression
● Now that we have a fitted model that
can perform predictions based on
features, how do we decide if those
predictions are any good?
● Fortunately we have the known test
labels to compare our results to.
Evaluating Regression
● Let’s take a moment now to discuss
evaluating Regression Models
● Regression is a task when a model
attempts to predict continuous
values (unlike categorical values,
which is classification)
Evaluating Regression
● For example, attempting to predict the
price of a house given its features is a
regression task.
● Attempting to predict the country a
house is in given its features would be
a classification task.
Evaluating Regression
● You may have heard of some
evaluation metrics like accuracy or
recall.
● These sort of metrics aren’t useful for
regression problems, we need metrics
designed for continuous values!
Evaluating Regression
● Let’s discuss some of the most
common evaluation metrics for
regression:
○ Mean Absolute Error
○ Mean Squared Error
○ Root Mean Square Error
Evaluating Regression
● The metrics shown here apply to any
regression task, not just Linear
Regression!
Evaluating Regression
● Mean Absolute Error (MAE)
○ This is the mean of the absolute
value of errors.
○ Easy to understand
Evaluating Regression
● MAE won’t punish large errors however
https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Evaluating Regression
● MAE won’t punish large errors
however.
Evaluating Regression
● We want our error metrics to account
for these!
Evaluating Regression
● Mean Squared Error (MSE)
○ Large errors are “punished” more
than with MAE making MSE more
popular.
Evaluating Regression
● Mean Squared Error (MSE)
○ Issue with MSE:
■ Different units than original y.
■ It reports units of y squared!
Evaluating Regression
● Root Mean Square Error (RMSE)
○ This is the root of the mean of the
squared errors.
○ Most popular (has same units as y)
Machine Learning
● Most common question:
○ “What is a good value for RMSE?”
● Context is everything!
● A RMSE of $10 is fantastic for
predicting the price of a house, but
horrible for predicting the price of a
candy bar!
Machine Learning
● Compare your error metric to the
average value of the label in your data
set to try to get an intuition of its overall
performance.
● Domain knowledge also plays an
important role here!
Machine Learning
● Context of importance is also necessary
to consider.
○ We may create a model to predict how
much medication to give, in which
case small fluctuations in RMSE may
actually be very significant.
Machine Learning
● Context of importance is also necessary
to consider.
○ If we create a model to try to improve
on some runners performance, we
would need some baseline RMSE to
compare to.
Evaluating Regression
● Let’s quickly jump back to the notebook
and calculate these metrics with SciKit-
Learn!
Evaluating Residuals
Linear Regression
● Often for Linear Regression it is a good
idea to separately evaluate residuals
● (y-ŷ) and not just calculate performance
metrics (e.g. RMSE).
● Let’s explore why this is important.
Linear Regression
● Anscombe’s Quartet:
Linear Regression
● Clearly Linear Regression is not suitable!
Linear Regression
● We need to understand how can we tell
if we’re dealing with more than one x
feature?
● We can not see this discrepancy of fit
visually if we have multiple features!
Linear Regression
● What we could do is plot residual error
against true y values.
● Consider an appropriate data set:
Linear Regression
● The residual errors should be random
and close to a normal distribution.
Linear Regression
● The residual errors should be random
and close to a normal distribution.
Linear Regression
● Residual plot shows residual error vs.
true y value.
Linear Regression
● There should be no clear line or curve.
Linear Regression
● What about non valid datasets?
Linear Regression
● What about non valid datasets?
Example 1
Linear Regression
● Residual plot showing a clear pattern,
indicating Linear Regression no valid,
and we should choose another model
Linear Regression
● Residual plot showing a clear pattern,
indicating Linear Regression no valid!
Linear Regression
● What about non valid datasets? Example 2
Linear Regression
● Residual plot showing a clear pattern,
indicating Linear Regression no valid!
Linear Regression
● Residual plot showing a clear pattern,
indicating Linear Regression no valid!
Linear Regression
● Let’s explore creating these plots with
Python and our model results!
Model Deployment
Linear Regression
● We’re almost done with our first
machine learning run through!
● Let’s quickly review what we’ve done so
far in the ML process.
Supervised Machine Learning Process
● Recall the Supervised ML Process
Training
Data Set ?
X and y Fit/Train Adjust as Deploy
Data Model Needed Model
Test
Data Set
Evaluate
Performance
Linear Regression
● We will explore later on the polynomial
regression and regularization as model
adjustments.
● For now, let’s focus on a simple
“deployment” of our model by saving
and loading it, then applying to new
data.