Linear Regression Analysis
Dr. Linta Rose
rose.l@incois.gov.in
Recall: Covariance
n
( x X )( y
i i Y )
cov ( x , y ) i 1
n 1
Correlation coefficient
Pearson’s Correlation
cov ariance( x, y )
Coefficient is
standardized
r
covariance (unitless):
var x var y
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
Linear Correlation
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X
Linear Correlation
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Linear Correlation
No relationship
X
Linear regression
In correlation, the two variables are treated as equals. In
regression, one variable is considered independent (=predictor)
variable (X) and the other the dependent (=outcome) variable Y.
Prediction
If you know something about X, this knowledge helps you predict
something about Y.
Uses of Regression Analysis
Regression analysis serves Three major purposes.
1.Description
2.Control
3.Prediction
The several purposes of regression analysis frequently
overlap in practice
What is “Linear”?
Remember this:
Y=mX+B? m
What’s Slope?
A slope of 2 means that every 1-unit change in X yields a
2-unit change in Y.
Predicted value for an individual…
yˆ b0 b1 x + random errori
Fixed – Follows a normal
exactly distribution
on the
line
The values of the regression parameters b0, and b1 are
not known. We estimate them from data.
Regression Line Statistical relation between Lot size and Man-Hour
180
160
140
120
100
Man-Hour
80
60
(Xi ,Yi )
40
20
0
0 10 20 30 40 50 60 70 80 90
Lot size
We will write an estimated regression line based on
sample data as
yˆ b0 b1 x
The method of least squares chooses the values for b0,
and b1 to minimize the sum of squared errors
n n 2
SSE ( yi yˆ i ) 2 y b0 b1 x
i 1 i 1
Minimise the sum of square of errors
Using Calculus
Solve for b0, and b1 to get the position of the line
n n n n
( x i x )( y i y ) n xi yi x i yi
b1 i 1
n
i 1
n
i 1
n
i 1
i 1
( xi x ) 2 n x i2 ( x i ) 2
i 1 i 1
or
Sy
b1 r b 0 y b1 x
Sx
The Fit Parameters
Define sums of squares:
The quality of fit is parameterized by r2 the correlation coefficient
Sy
b1 r
Sx
Estimation of Mean Response
Fitted regression line can be used to estimate the
mean value of y for a given value of x.
Example
The weekly advertising expenditure (x) and weekly sales
(y) are presented in the following table.
y x
1250 41
1380 54
1425 63
1425 54
1450 48
1300 46
1400 62
1510 61
1575 64
1650 71
Point Estimation of Mean Response
From previous table we have:
n 10 x 564 x 32604 2
y 14365 xy 818755
The least squares estimates of the regression coefficients
are:
n xy x y 10(818755) (564)(14365)
b1 10.8
n x 2 ( x ) 2 10(32604 ) (564) 2
b0 1436.5 10.8(56.4) 828
Point Estimation of Mean Response
The estimated regression function is:
ŷ 828 10.8x
Sales 828 10 .8 Expenditur e
This means that if the weekly advertising expenditure is
increased by $1 we would expect the weekly sales to
increase by $10.8.
Point Estimation of Mean Response
Fitted values for the sample data are obtained by
substituting the x value into the estimated regression
function.
For example if the advertising expenditure is $50,
then the estimated Sales is:
Sales 828 10.8(50) 1368
This is called the point estimate (forecast) of the mean
response (sales).
Residual
The difference between the observed value yi and the
corresponding fitted value ŷi
e i y i yˆ i
Residuals are highly useful for studying whether a
given regression model is appropriate for the data at
hand.
Example: weekly advertising expenditure
y x y-hat Residual (e)
1250 41 1270.8 -20.8
1380 54 1411.2 -31.2
1425 63 1508.4 -83.4
1425 54 1411.2 13.8
1450 48 1346.4 103.6
1300 46 1324.8 -24.8
1400 62 1497.6 -97.6
1510 61 1486.8 23.2
1575 64 1519.2 55.8
1650 71 1594.8 55.2
Regression Standard Error
Approximately 95% of the observations should fall within
plus/minus 2*standard error of the regression from the
regression line, which is also a quick approximation of a
95% prediction interval.
For simple linear regression standard error is the square
root of the average squared residual.
1 1
s y. x ( yi yˆ i ) 2
2 2
e
n2 n2
i
s y.x s y.x
2
To estimate standard error, use
s estimates the standard deviation of the error term in the
statistical model for simple linear regression.
The standard error of Y given X is the average variability around the regression
line at any given value of X. It is assumed to be equal at all values of X.
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Regression Standard Error
y x y-hat Residual (e) square(e)
1250 41 1270.8 -20.8 432.64
1380 54 1411.2 -31.2 973.44
1425 63 1508.4 -83.4 6955.56
1425 54 1411.2 13.8 190.44
1450 48 1346.4 103.6 10732.96
1300 46 1324.8 -24.8 615.04
1400 62 1497.6 -97.6 9525.76
1510 61 1486.8 23.2 538.24
1575 64 1519.2 55.8 3113.64
1650 71 1594.8 55.2 3047.04
y-hat = 828+10.8X total 36124.76
Sy.x 67.19818
Analysis of Residual
To examine whether the regression model is
appropriate for the data being analyzed, we can
check the residual plots.
Residual plots are:
Plot a histogram of the residuals
Plot residuals against the fitted values.
Plot residuals against the independent variable.
Plot residuals over time if the data are chronological.
Residual plots
The residuals should
have no systematic
pattern. Degree Days Residual Plot
The residual plot to right 1
shows a scatter of the
Residuals
0.5
points with no 0
individual observations 0 20 40 60
-0.5
or systematic change as
-1
x increases.
Degree Days
Residual plots
The points in this
residual plot have a
curve pattern, so a
straight line fits poorly
Residual plots
The points in this plot
show more spread for
larger values of the
explanatory variable x,
so prediction will be less
accurate when x is large.
ANOVA
Analysis of variance (ANOVA) is a statistical technique that is
used to check if the means of two or more groups are
significantly different from each other.
An ANOVA test is a way to find out if survey or experiment
results are significant.
Compares the samples on the basis of their means