KEMBAR78
03 - Simple Linear Regression | PDF | Errors And Residuals | Regression Analysis
0% found this document useful (0 votes)
9 views13 pages

03 - Simple Linear Regression

The document provides an overview of regression analysis, focusing on simple linear regression to predict the value of a dependent variable based on one independent variable. It discusses estimating model coefficients, assessing model validity through R², F-tests, and t-tests, and highlights the importance of checking assumptions related to the error term in regression models. Examples, including dog show performance and car sale prices, illustrate the application of these concepts in practical scenarios.

Uploaded by

mukta.vrumol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

03 - Simple Linear Regression

The document provides an overview of regression analysis, focusing on simple linear regression to predict the value of a dependent variable based on one independent variable. It discusses estimating model coefficients, assessing model validity through R², F-tests, and t-tests, and highlights the importance of checking assumptions related to the error term in regression models. Examples, including dog show performance and car sale prices, illustrate the application of these concepts in practical scenarios.

Uploaded by

mukta.vrumol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ECON 251 Introduction

Research Methods  In this section we employ Regression Analysis


to examine the relationship among quantitative variables.
 The technique is used to predict the value of one variable
(the dependent variable, y) based on the value of other
variables (independent variables x1, x2,…xk.)
 To start we will focus on “simple” “linear” regression, which
means we are using only one independent variable to
explain the behavior of the dependent variable, and the
relationship between the two variables is “linear,” not curved.
3. Statistical Inference: Simple Linear  In the next section we will relax both of these assumptions,
Regression and Correlation and use “multiple” regression and “curvilinear” relationships.
1 2

The Population Regression Model The Population Regression Model


 The estimated sample regression line
 The simple linear population regression line
Since b0 and b1 are estimates of population
parameters based on sample data, they are
sample statistics.
b0 and b1 are unknown population parameters.
We estimate these parameters by taking a
Y = dependent variable sample from the data. y
X = independent variable Y
b 0 = y-intercept = sample regression line b1 = Rise/Run
b 1 = slope of the line x = independent variable Rise
Run
e = error variable Rise b0 = y-intercept of sample line b0
b1 = Rise/Run x
b0 Run b1 = slope of the sample line
X e = residual
3 4

1
Estimating the Model Coefficients
Compare two lines, the first upward sloping, the second horizontal
 Given a set of x and y data, there are a number of ways Line 1 (upward sloping)
which could be used to formulate the line that best Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 =
characterizes those data points.
 When we conduct regression analysis, we select the line Line 2 (horizontal)
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 =
that minimizes the sum of squared vertical differences
between the points and the line.
4 (2,4)
 We often refer to this technique as “ordinary least squares” w
(OLS) regression. 3 w (4,3.2) The smaller the sum of
y w 2.5 squared differences,
w 2 the better the fit of the
(1,2) w
w
w w (3,1.5) line to the data.
w w w w w 1
w w w w w
w
5 1 2 3 4 6
x

Calculating coefficients Example – Dog Show


 A dog trainer has noticed that his dog Chip seems to perform
To calculate the estimates of the The regression equation that better in dog shows if he has been given “speedy biscuits—
coefficients that minimize the estimates the equation of the the biscuit with a zip!” prior to the show. To examine this
differences between the data first order linear model is:
points and the line, use the possible relationship, he records the number of biscuits Chip
formulas: consumed prior to each show, and the score (1-10, with 10
being the highest) from each show.
• Find b0
• Find b1
• Interpret the slope coefficient.

7 8

2
Example – Dog Show Example – Dog Show
# of Show y
Dog biscuits score xi -  yi -  (xi - )*(yi - ) (xi - )2
xi yi
8
1 2 4
2 6 8
3 4 7 6
4 8 9
Total
4 This is the ____ of the line.
Avg
For each additional biscuit
cov (x, y) = sx2 = This is the intercept. Don’t given to Chip, the
2 interpret as the estimated estimated score Chip
score Chip receives if he is receives increases by __.
given ____ biscuits.

9 2 4 6 8 x 10

Example: Odometer Reading Using the computer


 A car dealer wants to find  Tools  Data analysis  Regression  [Shade the y range and the
the relationship between x range]  OK
the odometer reading and
the selling price of used cars.
 A random sample of 100 cars is
selected, and the data
recorded.
 Find the regression line.

Independent variable x
Dependent variable y
11 12

3
Example – Armani’s Pizza
6533
 Armani’s Pizza is
considering locating at
the OSU campus. To do
0
their financial analysis,
No data
they first need to
estimate sales for their
product.
The intercept is b0 = 6533.
This is the slope of the line.  They have data from
For each additional mile on the their existing 10 locations
odometer, the price decreases by on other college
an average of $0.0312
Do not interpret the intercept as the campuses.
“Price of cars that have not been driven”
13 14

Example – Armani’s Pizza Example – Armani’s Pizza Excel Output


 Begin by plotting the data

 Followed by calculating the statistics of your regression


 and interpret the results
15 16

4
Model Assessment 1. R2 (Coefficient of Determination) to Evaluate Model
 Once our model is estimated, the next step is to assess how
successful we have been in accomplishing our objective.
 When we want to measure the strength of the linear
relationship, we use the coefficient of determination.
 Recall that our principal objective in regression analysis is to
understand the behavior of the dependent variable.
 There are three means by which we will make this
assessment:
1. R2 (Coefficient of Determination)  To understand the significance of this coefficient note:
2. F-Test for Overall Validity of the Model SST = SSR + SSE
― this test becomes much more important when using multiple The regression model
regression. Sum of Squares Regression (SSR)
Overall variability in y
3. T-test for the Slope
― using b1 (estimate of the slope) Sum of Squares Total (SST) = The error
17 + Sum of Squares for Error
18 (SSE)

Two data points (x1,y1) and (x2,y2) of a certain sample are shown. Variation in y (SST) = SSR + SSE
y2  R2 measures the proportion of the variation in y that is
explained by the variation in x.

y1

x1 x2  R2 takes on any value between zero and one.


Total variation in y = Variation explained by the + Unexplained variation (error)
regression line
R2 = 1: ______ match between the line and the data points.
R2 = 0: There is ___ linear relationship between x and y.
SST = SSR + SSE 19 20

5
Example – Car Sale Price Example – Armani Pizza
 Find the Coefficient of Determination for the Car Sale Price  Find the Coefficient of Determination for the Armani’s Pizza
Example. What does it tell you about the model? example; what does it tell you about the model? You are
 Solution given values of SSR = 81702.499, and SSE = 6331.901.
• Using the computer
―From the regression output we have

21 22

Example – Armani Pizza 2. The F-Test for Overall Validity of the Model
 Solving with Excel  We will rely much more heavily on this test when we introduce multiple
regression. This is because when doing simple regression this F-test
exactly duplicates the results of the t-test for slope we will do next.
Because it approaches the question differently however, it remains
useful.
 The multiple regression version of this test asks:
• Is there at least one independent variable linearly related to the
dependent variable?
 To answer the question, we test the hypothesis:
H0: b1 = b2 = … = bk = 0
H1: At least one bi is not equal to zero.
 If at least one bi i is not equal to zero, the model is valid.
 In simple regression, the test simplifies to:
H0: b1 = 0
H1: b1 ≠ 0.
23  If you reject H0 in favor of H1, the model is valid. 24

6
F Distribution Hypothesis Testing
 Non-negative, positively skewed distribution  To test these hypotheses we perform an analysis of variance
procedure.
 The Center of this distribution is always “1”. The inverse of a  The F test
value below 1 on one side of the distribution gives you the • Construct the F statistic
equivalent value above 1 on the right so of the distribution, • k is the number of independent variables used in the model
and vice-versa. SST = SSR + SSE.
MSR MSR=SSR/k
 Degrees of freedom are k for the numerator and n-k-1 for Large F results from a large SSR.
Then, much of the variation in y is
F=
the denominator explained by the regression model.
MSE
The null hypothesis should MSE=SSE/(n-k-1)
Fa,k,n-k-1 be rejected; thus, the model is valid.
F>Fa,k,n-k-1
a • Rejection region
 There is an important caveat to this test. It is always an upper-tail test
(despite what H1 would suggest).
25 26
0

Example – Car Sale Price Example – Car Sale Price


Fa,k,n-k-1 = F0.05,1,100-1-1=3.94 Also, the p-value (Significance F) = 4.4435E-24
F = 182.1056 >3.94 Clearly, a = 0.05>4.4435E-24, and the null hypothesis
MSR is rejected.
SSR
MSR/MSE
Conclusion: There is sufficient evidence to reject
the null hypothesis in favor of the alternative
hypothesis.
b1 is not equal to zero, thus, x is linearly related to y.
This linear regression model is valid

1. Draw the F distribution for this test, labeling as much as


SSE
possible.
P-value for the F-
MSE test for Overall 2. Using the Armani Pizza example output (four slides back),
Validity of the conduct the F-test for overall model validity.
model
27 28

7
3. T-Test for the slope Hypothesis Testing
 When no linear relationship exists between two variables,  We can test if the slope is non-zero in simple regression.
the regression line should be horizontal. • We can draw inference about b1 from b1 by testing
―H0: b1 = 0
q
―H1: b1 ≠ 0 (or < 0, or > 0)
q
q
• The test statistic is
qq q
q
q
q q

q q

The standard error of b1.


Linear relationship. No linear relationship.
Different inputs (x) yield Different inputs (x) yield
different outputs (y). the same output (y). • If the error variable is normally distributed, the statistic is
The slope is not equal to zero The slope is equal to zero
Student t distributed with d.f. = ____.
29 30

Example – Car Sale Price Example – Armani Pizza


 Solving by hand  Evaluate the model used in the Armani’s Pizza example by
• To compute “t” we need the values of b1 and sb1. testing the value of the slope. You are given:

• H 0: b 1 = 0
• H 1: b 1 ≠ 0
There is _________ evidence to infer
that the odometer reading affects the
 Using the computer auction selling price.

 Test statistic is ____ extreme than the critical value, therefore,


31 _____________________________________________ 32

8
Example – Armani Pizza Example – Artificial Data
 Solving with Excel  Run regressions on the following data sets. Calculate slope,
intercept and R2 in each case. Which model is the best?

 How does the result from the t-test for the slope compare to
the F-test for the overall validity of the model?
33 34

Assumption Violations & Other Dangers Residual Analysis


 The error (e ) term is a critical part of the regression model.  Examining the residuals (or standardized residuals), we
There are three requirements involving the distribution of e can identify violations of the required conditions.
which must be satisfied for our model results to be valid.  Start by graphing the residuals.
―The probability distribution of e is normal, with a mean of 0. • Create a histogram of the residuals (assumption #1).
―The standard deviation of e is se for all values of x. • Place the residuals on the vertical axis, and graph against
―The set of errors associated with different values of y are all ^ (assumption #2), and if a time
the predicted values of y (y)
independent. series, against time (assumption #3).
 Other assumptions that when violated can threaten the 1. Normality of errors.
usefulness of your results:
• Use Excel to obtain the standardized residual histogram.
―No unnecessary outliers
• Examine the histogram and look for a bell shaped diagram
―No serious multicollinearity
with mean close to zero.
– Only a possible problem in multiple regression. We will address later.
35 36

9
1. Normality of errors 2. Constant Variance of Errors (homoskedasticity)
A Partial list of
Standard residuals
 The residuals, our estimate of the errors, seem to have an
approximately equal “spread” around the regression line.

^y +
++
Residual
+ +
+ + ++
+ +
+ + + +
+ ++ + +
+
+ +
+ + + + + ++
+ + y^ +
+ +
+ + + ++ +
+ ++
+ ++

37 38

2. Constant Variance of Errors (homoskedasticity) 3. Independence of errors (Non-Autocorrelation)


 When the requirement of a constant variance is violated we  A time series exists if the data is collected over time. When
have heteroskedasticity.
the data is a time series, error terms are frequently related to
 In this case, the “spread” of the residuals varies at different one another, thereby violating the independence
points along the regression line.
assumption.
^y
+
++
 When consecutive error terms measured across time are
Residual correlated to one-another, the errors are autocorrelated
+
+ + +
+ ++ (sometimes referred to as serially correlated).
+ + + ++ + +
+ + +
+
+
+ + ++ +  Examining the residuals over time, no pattern should be
+ ^y
+
+ + + +
++ + observed if the errors are independent.
+ + +
+ ++
+ + +++

The spread increases with ^y


39 40

10
3. Independence of errors (Non-Autocorrelation) 4. No unnecessary outliers
 Patterns in the appearance of the residuals over time  An outlier is an observation that is unusually small or large.
indicates that autocorrelation exists.  Several possibilities need to be investigated when an outlier is
observed:
Residual Residual
• There was an error in recording the value.
+
• The point does not belong in the sample.
+ ++
+
+ +
• The observation is valid.
+ +
0
+
+
+
0 + +  Identify outliers from the scatter diagram.
+ Time Time
+ + +
+ + + +
+  It is customary to suspect an observation is an outlier if its |standard
+ +
++ residual| > 2.
+
 Once identified, the cause will determine the appropriate response. If it
Note the runs of positive residuals, Note the oscillating behavior of the
is either the first or second situation, delete it, and re-estimate your
replaced by runs of negative residuals residuals around zero. model. If, on the other hand, it appears to be a valid observation, keep
it in, but recognize the impact it is having on your model.
41 42

4. No unnecessary outliers 5. No serious multicollinearity


 Multicollinearity is defined as a linear association between
An outlier
An influential observation independent variables used in a regression.
 Obviously, this cannot be a problem when we are conducting
+ + simple linear regression, as there is only one independent
+ +
+ … but, some outliers variable.
+ +
+ +
may be very influential  We will introduce this topic when doing multiple regression.
+
+ +
+ + + At that point you will see that when we do multiple
+ + regression, multicollinearity is a common problem.
+

The outlier causes a shift


in the regression line
43 44

11
Applying the Regression Equation Estimating the value of y
 Once we have assessed how well the model fits the data,  Predict the selling price of a three-year-old Taurus with
and verified the assumptions hold, we are then ready to use 40,000 miles on the odometer.
our model to:
• Estimate the value of the dependent variable based on the
value of the independent variable.
 Predict Chip’s score in the dog show if he is given 5
• Establish confidence intervals and prediction intervals for our biscuits.
estimates of the value of the dependent variable.
• Interpret the slope of the regression, and thereby gain insight
into the relationship between our independent variable and
our dependent variable.
 Predict the amount of annual pizza sales for Armani at OSU
45
—assume 36,000 students. 46

Prediction interval (and confidence interval) Example – Car Sale Price (“prediction” interval)
 We frequently want more than just a point estimate for our  Provide an interval estimate for the bidding price on a Ford
estimate of the dependent variable. Taurus with 40,000 miles on the odometer.
 Two types of intervals can be used to discover how closely  Solution
the predicted value will match the true value of y. • The dealer would like to predict the price of a single car
• Prediction interval - for a particular value of y,
• Confidence interval - for the expected value of y – we won’t
cover this one The prediction interval(95%) =
The prediction interval
t.025,98

47 48

12
Example – Armani Pizza
 Provide a 95% interval for your Armani’s Pizza estimate from
above (36,000 students). You are interested in an interval
estimate for a single store, and are given the following:

49

13

You might also like