KEMBAR78
Lecture 16 Regression | PDF | Regression Analysis | Linear Regression
0% found this document useful (0 votes)
15 views30 pages

Lecture 16 Regression

The document provides an overview of data analysis and statistical modeling, focusing on regression analysis and its applications. It outlines key concepts, notations, and requirements for regression, as well as guidelines for using regression equations for predictions. Additionally, it includes example calculations and Python code for finding and visualizing regression equations.

Uploaded by

bscs23091
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views30 pages

Lecture 16 Regression

The document provides an overview of data analysis and statistical modeling, focusing on regression analysis and its applications. It outlines key concepts, notations, and requirements for regression, as well as guidelines for using regression equations for predictions. Additionally, it includes example calculations and Python code for finding and visualizing regression equations.

Uploaded by

bscs23091
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Data Analysis and Statistical

Modeling
Dr. Syed Faisal Bukhari
Associate Professor
Department of Data Science
Faculty of Computing and Information Technology
University of the Punjab
Textbooks

Probability & Statistics for Engineers & Scientists,


Ninth Edition, Ronald E. Walpole, Raymond H.
Myer

Elementary Statistics: Picturing the World, 6th


Edition, Ron Larson and Betsy Farber

Elementary Statistics, 13th Edition, Mario F. Triola

Dr. Faisal Bukhari, DDS, PU, Lahore


Reference books
 Probability and Statistical Inference, Ninth Edition,
Robert V. Hogg, Elliot A. Tanis, Dale L. Zimmerman

 Probability Demystified, Allan G. Bluman

 Practical Statistics for Data Scientists: 50 Essential


Concepts, Peter Bruce and Andrew Bruce

Schaum's Outline of Probability, Second Edition,


Seymour Lipschutz, Marc Lipson

Python for Probability, Statistics, and Machine


Learning, José Unpingco
Dr. Faisal Bukhari, DDS, PU, Lahore
References

Probability & Statistics for Engineers & Scientists,


Ninth edition, Ronald E. Walpole, Raymond H. Myer

Elementary Statistics, Tenth Edition, Mario F. Triola

These notes contain material from the above resources.

Dr. Faisal Bukhari, DDS, PU, Lahore


Basic Concepts of Regression
In some cases, two variables are related in a
deterministic way, meaning that given a value for
one variable, the value of the other variable is
automatically determined without any error.

For example, the total cost y of an item with a list


price of x and a sales tax of 5% can be found by using
the deterministic equation y = 1.05x. If an item is
priced at $100, its total cost is $105.

Dr. Faisal Bukhari, DDS, PU, Lahore


Probabilistic Models
In probabilistic models, meaning that one variable is
not determined completely by the other variable.

For example, a child’s height is not determined


completely by the height of the father (or mother).

Sir Francis Galton (1822–1911) studied the


phenomenon of heredity and showed that when tall
or short couples have children, the heights of those
children tend to regress, or revert to the more
typical mean height for people of the same gender.
Dr. Faisal Bukhari, DDS, PU, Lahore
Notations
The regression equation expresses a relationship
between x (called the explanatory variable, or
ෝ and
predictor variable, or independent variable) 𝒚
(called the response variable, or dependent
variable).

The typical equation of a straight line y = mx + b is


expressed in the form 𝒚 ෝ = 𝒃𝟎 + 𝒃𝟏 𝐱 or 𝒚 ෝ=𝒂+
𝐛𝐱, where b0 or a is the y-intercept and b1 or b is the
slope.

Dr. Faisal Bukhari, DDS, PU, Lahore


The given notation shows that b0 and b1 are sample
statistics used to estimate the population
parameters 𝜷0 and 𝜷𝟏.

We will use paired sample data to estimate the


regression equation. Using only sample data, we
can’t find the exact values of the population
parameters 𝜷 0 and 𝜷𝟏 , but we can use the
sample data to estimate them with b0 and b1

Dr. Faisal Bukhari, DDS, PU, Lahore


Requirements
1. The sample of paired (x, y) data is a random sample
of quantitative data.

2. Visual examination of the scatterplot shows that the


points approximate a straight-line pattern.

3. Any outliers must be removed if they are known to


be errors. Consider the effects of any outliers that are
not known errors.

Dr. Faisal Bukhari, DDS, PU, Lahore


Requirements
Note: Requirements 2 and 3 above are simplified
attempts at checking these formal requirements for
regression analysis:
 For each fixed value of x, the corresponding values
of y have a distribution that is bell-shaped.
 For the different fixed values of x, the distributions
of the corresponding y-values all have the same
variance.
For the different fixed values of x, the distributions of
the corresponding y-values have means that lie
along the same straight line.
The y values are independent.
Dr. Faisal Bukhari, DDS, PU, Lahore
.
Requirements
Results are not seriously affected if departures from
normal distributions and equal variances are not too
extreme.

Dr. Faisal Bukhari, DDS, PU, Lahore


Definitions
Given a collection of paired sample data, the regression
equation
ෝ = 𝒃𝟎 + 𝒃𝟏 𝐱
𝒚
algebraically describes the relationship between the
two variables. The graph of the regression equation is
called the regression line (or line of best fit, or least-
squares line).

Dr. Faisal Bukhari, DDS, PU, Lahore


Notation for Regression Equation
Population Sample Statistic
Parameter
y-intercept of 𝛃0 𝐛𝟎
regression equation
Slope of regression 𝛃1 𝐛𝟏
equation
Equation of the Y = 𝛃0 + 𝛃1 x 𝐲ො = 𝐛𝟎 + 𝐛𝟏 𝐱
regression line
Finding the slope b1 and y-intercept b0 in the
regression equation 𝐲ො = 𝐛𝟎 + 𝐛𝟏 𝐱

Dr. Faisal Bukhari, DDS, PU, Lahore


Slope n( σ 𝒙𝒚) – ( σ 𝒙) ( σ 𝒚)
𝐛𝟏 =
n( σ 𝒙𝟐 ) −( σ 𝒙)𝟐

y-intercept: ഥ - 𝐛𝟏 𝒙
𝐛𝟎 = 𝒚 ഥ
or
( σ 𝒚)( σ 𝒙𝟐 ) – ( σ 𝒙)( σ 𝒙𝒚)
𝐛𝟎 =
n( σ 𝒙𝟐 ) −( σ 𝒙)𝟐

Dr. Faisal Bukhari, DDS, PU, Lahore


Example Finding the Regression Equation
Use the given sample data to find the regression
equation.

x 3 1 3 5
y 5 8 6 4

Dr. Faisal Bukhari, DDS, PU, Lahore


REQUIREMENT The data are a simple random sample.
The accompanying Python-generated scatterplot shows
a pattern of points that does appear to be a straight-
line pattern. There are no outliers. We can proceed to
find the slope and intercept of the regression line.

Dr. Faisal Bukhari, DDS, PU, Lahore


x y xy 𝒙𝟐 𝒚𝟐

3 5 15 9 25

1 8 8 1 64

3 6 18 9 36

5 4 20 25 16

σ 𝒙 = 12 σ 𝒚 = 23 σ 𝒙𝒚 = 61 σ 𝒙𝟐 = 44 σ 𝒚𝟐 =
141
Dr. Faisal Bukhari, DDS, PU, Lahore
n( σ 𝐱𝐲) – ( σ 𝐱) ( σ 𝐲)
𝐛𝟏 =
n( σ 𝐱𝟐 ) −( σ 𝐱)𝟐
4(61 ) – (12 ) (23 ) −32
𝐛𝟏 = = = -1
4(44) −(12 ) 2 32
12
ഥ=
𝒙 =3
4
23
ഥ=
𝒚 = 5.75
4

𝐛𝟎 = 𝐲ത - 𝐛𝟏 𝐱ത
b0 = 5.75 – (-1)(3)
𝐛𝟎 = 8.75

Dr. Faisal Bukhari, DDS, PU, Lahore


import numpy as np
from scipy.stats import linregress

# Given data points


x = np.array([3, 1, 3, 5]) # Independent variable
y = np.array([5, 8, 6, 4]) # Dependent variable

# Calculate the slope and intercept using linregress


slope, intercept, r_value, p_value, std_err =
linregress(x, y)

# Formulate the regression equation


regression_equation = f"y = {intercept:.2f} +
{slope:.2f}x"

# Output results
print("Slope:", slope)
print("Intercept:", intercept)
print("Regression Equation:", regression_equation)
Dr. Faisal Bukhari, DDS, PU, Lahore
Explanation
np.array: Converts lists to numpy arrays for easy
manipulation.
linregress: Calculates the slope, intercept, and other
regression statistics.
Print Statements: Display the results, including the
slope, intercept, and formatted regression equation.
This code will output:
Slope: -1.0
Intercept: 8.75
Regression Equation: y = 8.75 −1.00x
Dr. Faisal Bukhari, DDS, PU, Lahore
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress

# Given data points


x = np.array([3, 1, 3, 5]) # Independent variable
y = np.array([5, 8, 6, 4]) # Dependent variable

# Calculate the slope and intercept using linregress


slope, intercept, r_value, p_value, std_err =
linregress(x, y)

# Generate predicted y values based on the regression line


y_pred = intercept + slope * x

Dr. Faisal Bukhari, DDS, PU, Lahore


# Create scatter plot of the original data points
plt.figure(figsize=(8, 6))
plt.scatter(x, y, color='blue', label="Data Points")

# Plot the regression line


plt.plot(x, y_pred, color='red', label=f"Regression
Line: y = {intercept:.2f} - {abs(slope):.2f}x")

# Add titles and labels


plt.title("Scatter Plot with Regression Line")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.grid(True)

# Show the plot


plt.show()

Dr. Faisal Bukhari, DDS, PU, Lahore


Dr. Faisal Bukhari, DDS, PU, Lahore
Knowing the slope b1 and y-intercept b0, we can now
express the estimated equation of the regression
line as
𝐲ො = 𝐛𝟎 + 𝐛𝟏 𝐱
𝐲ො = 𝟖. 𝟕𝟓 − 𝟏𝐱
We should realize that this equation is an estimate of
the true regression equation Y = 𝛃0 + 𝛃1 x. This
estimate is based on one particular set of sample
data, but another sample drawn from the same
population would probably lead to a slightly different
equation.
Dr. Faisal Bukhari, DDS, PU, Lahore
Scatter plot and Regression line

Dr. Faisal Bukhari, DDS, PU, Lahore


Using the Regression Equation for
Predictions
Regression equations are often useful for predicting
the value of one variable, given some particular
value of the other variable.

 If the regression line fits the data quite well, then it


makes sense to use its equation for predictions,
provided that we don’t go beyond the scope of the
available values.

Dr. Faisal Bukhari, DDS, PU, Lahore


Using the Regression Equation for
Predictions
In predicting a value of y based on some given value of
x...
1. If there is not a linear correlation, the best predicted
ഥ.
y-value is 𝒚

2. If there is a linear correlation, the best predicted y-


value is found by substituting the x-value into the
regression equation.

Dr. Faisal Bukhari, DDS, PU, Lahore


Procedure for Predicting

Dr. Faisal Bukhari, DDS, PU, Lahore


Guidelines for Using the Regression
Equation
1. If there is no linear correlation, don’t use the
regression equation to make predictions.

2. When using the regression equation for predictions,


stay within the scope of the available sample data. If
you find a regression equation that relates women’s
heights and shoe sizes, it’s absurd to predict the shoe
size of a woman who is 10 ft tall.

Dr. Faisal Bukhari, DDS, PU, Lahore


Guidelines for Using the Regression
Equation
3.A regression equation based on old data is not
necessarily valid now. The regression equation relating
used-car prices and ages of cars is no longer usable
if it’s based on data from the 1990s.

4.Don’t make predictions about a population that is


different from the population from which the sample data
were drawn. If we collect sample data from men and
develop a regression equation relating age and TV remote-
control usage, the results don’t necessarily apply to
women. If we use state averages to develop a regression
equation relating SAT math scores and SAT verbal scores,
the results don’t necessarily apply to individuals.

Dr. Faisal Bukhari, DDS, PU, Lahore

You might also like