DAM Class 21-24 Regression Analysis
DAM Class 21-24 Regression Analysis
Regression Analysis
Class 21, 22, 23, and 24
(Doane, Seward, and Chowdhury-Chapter 13)
Jaggia and Kelly- Chapter 14
P R O F. A R N A B A D H I K A R I
Simple Linear Regression
1
Scatter Plots
➢ Scatter plots can convey patterns in data pairs that would not be apparent from
a table.
➢A scatter plot is a starting point for bivariate data analysis in which we
investigate the association and relationship between two quantitative variables.
2
Scatter Plots
➢A scatter plot is typically the precursor to more complex analytical techniques.
➢The relationship between the average price per gallon of regular unleaded
gasoline to the average price per gallon of premium gasoline for all 50 states.
➢Figure below shows a Positive association
3
Relationship between
Correlation and Causation
Feature Correlation Causal Relationship
Definition Variables move together One variable causes the other
Directionality No assumption A leads to B
Proof Required Simple statistical analysis Experimental or longitudinal data
A company finds that higher A/B testing (Split testing) shows that
advertising spend is correlated with changing the "Buy Now" button color
higher sales. As advertising from blue to red increases conversions by
increases, sales tend to increase — 15%. here's strong evidence that the
Example
but this doesn't prove that ads button color change caused more users to
caused the sales to go up. Both buy. Typically established through
could be related due to a third experiments, like randomized controlled
factor (e.g., seasonal demand). trials (RCTs) or A/B testing.
4
Correlation Coefficient, r
➢ A visual display is a good first step in analysis, but we would also like to quantify
the strength of the association between two variables.
➢Therefore, accompanying the scatter plot is the sample correlation coefficient (also
called the Pearson correlation coefficient.)
➢This statistic measures the degree of linearity in the relationship between two
random variables X and Y and is denoted r.
➢Its value will fall in the interval [−1, 1].
➢When r is near 0, there is little or no linear relationship between X and Y.
➢An r value near +1 indicates a strong positive relationship, while an r value near −1
indicates a strong negative relationship.
5
Covariance
➢The covariance of two random variables X and Y (denoted σXY ) measures the
degree to which the values of X and Y change together.
➢The units of measurement for the covariance are unpredictable because the
magnitude and/or units of measurement of X and Y may differ. For this reason,
analysts generally work with the correlation coefficient, which is a standardized
value of the covariance that ensures a range between −1 and +1.
6
Correlation Coefficient, r
➢ Conceptually, a correlation coefficient is the covariance divided by the product of
the standard deviations (denoted σX and σY for a population or sX and sY for a sample).
For a population, the correlation coefficient is indicated by the lowercase Greek letter
ρ (rho), while for a sample we use the lowercase Roman letter r.
➢The sample correlation coefficient is a statistic that describes the degree of linearity
between paired observations on two quantitative variables X and Y.
-1 ≤ r ≤ +1.
7
Tests for Significant Correlation
Using Student’s t
➢ Step 1: State the Hypothesis and determine test direction: One-tailed vs. Two-
tailed
Left-tailed test Two-tailed test Right-tailed test
H0: = 0 H0: = 0 H0: = 0
H1: < 0 H1: ≠ 0 H1: > 0
= 0 (population correlation coefficient equals zero)
8
Tests for Significant Correlation
Using Student’s t
➢Step 4: Make the Decision
9
Problem 1
In its admission decision process, a university’s MBA programme examines an
applicant’s GMAT scores, comprising verbal and quantitative components. Now,
30 MBA applicants are randomly selected from 1961 MBA applicant records at a
public university in the Midwest. The scatter plot shows the correlation
coefficient (r=0.4356) between verbal and quantitative components. Is the
correlation coefficient (r=0.4356) between verbal and quantitative components
significant?
10
Problem 1-Solution
H0: = 0
H1: ≠ 0
T est T wo
Sample size 30
Degree of freedom 29
Level of significantce 0.05
correlation co-efficient 0.435649659
t calc 2.56104826
t crit 2.045229642
Critical value appro ach
Decision Reject Null hypothesis
P-value appro ach
p-value for Left-tailed
0.007951011
test
p-value for Right-tailed
0.007951011
test
p-value for two-tailed
0.015902023
test
Decision Reject null hypothesis
Interpretation:
The admissions office recognizes that these scores tend to vary together for the applicants
11
Correlation Coefficient, r
➢Table below shows that, as sample size increases, the critical value of r becomes smaller.
➢Thus, in very large samples, even very small correlations could be “significant.”
➢In a larger sample, smaller values of the sample correlation coefficient can be considered
“significant.”
➢While a larger sample does give a better estimate of the true value of ρ, a larger sample does
not mean that the correlation is stronger nor does its increased significance imply increased
importance.
➢In large samples, small correlations may be significant, even if the scatter plot shows little
evidence of linearity. Thus, a significant correlation may lack practical importance.
12
Simple Linear Regression
➢ Simple Regression analyzes the relationship between two variables.
➢It specifies one dependent (response) variable and one independent (predictor) variable.
➢The hypothesized relationship here will be linear of the form
➢ Y = slope x X + y-intercept.
➢The response variable is the dependent variable. This is the Y variable.
➢The predictor variable is the independent variable. This is the X variable.
➢Only the dependent variable (not the independent variable) is treated as a random variable.
13
Simple Linear Regression
➢The intercept and slope of an estimated regression can provide useful information.
➢The slope tells us how much, and in which direction, the response variable will
change for each one unit increase in the explanatory variable.
14
Fitting a Regression on a Scatter Plot-
Example
15
Simple Linear Regression Model
➢The assumed model for a linear relationship is
▪ y = 0 + 1x + .
➢The relationship holds for all pairs (xi , yi ).
➢The error term 𝜀 is not observable and is assumed to be independently normally distributed
with a mean of 0 and a and standard deviation 𝛼.
➢The unknown parameters are:
▪ 0 Intercept
▪ 1 Slope
➢The fitted model or regression model is used to predict the expected value of Y for a given
value of X and is given below.
➢The fitted coefficients are
▪ b0 the estimated intercept
▪ b1 the estimated slope
16
Ordinary Least Squares (OLS) Formulas
Slope and Intercept
➢The ordinary least squares method (OLS) estimates the slope and intercept of the
regression line so that the sum of residuals is minimized, which will ensure the best fit.
➢The sum of the residuals = 0.
➢The sum of the squared residuals is SSE.
17
Ordinary Least Squares (OLS) Formulas
➢The OLS estimator for the slope is:
or
18
Sources of Variation in Y
➢ In a regression, we seek to explain the variation in the dependent variable around its mean.
We express the total variation as a sum of squares (denoted SST):
19
Sources of Variation in Y
➢The explained variation in Y (denoted SSR) is the sum of the squared differences between the
conditional mean 𝑦ෝ𝑖 (conditioned on a given value 𝑥𝑖) and the unconditional mean 𝑦ത (same for all
𝑥𝑖):
➢The unexplained variation in Y (denoted SSE) is the sum of squared residuals, sometimes referred
to as the error sum of squares.
➢If the fit is good, SSE will be relatively small compared to SST. If each observed data value 𝑦𝑖 is
exactly the same as its estimate 𝑦ෝ𝑖 (i.e., a perfect fit), then SSE will be zero. There is no upper limit
on SSE.
20
Ordinary Least Squares (OLS) Formulas
*
The OLS formulas give unbiased and consistent estimates* of β0 and β1. The
OLS regression line always passes through the point (തx, yത ) for any data, as
illustrated in Figure below.
21
Prediction Using Regression
➢ One of the main uses of regression is to make predictions.
➢Once we have a fitted regression equation that shows the estimated relationship
between X (the independent variable) and Y (the dependent variable), we can
plug in any value of X to obtain the prediction for Y.
Expression Prediction
If the firm spends $10 million on advertising,
Sales = 268 + 7.37 Ads
what is its predicted sales?
If an employee has four dependents, what is the
DrugCost = 410 + 550 Dependents
predicted annual drug cost?
What is the predicted rent on an 800-square-foot
Rent = 150 + 1.05 SqFt
apartment?
MPG = 49.22 − 0.079 Horsepower If an engine has 200 horsepower, what is the
predicted fuel efficiency?
22
Simple Linear Regression-
Examples
Expression Explanation
Each extra $1 million of advertising will generate $7.37
million of sales on average. The firm would average $268
Sales = 268 + 7.37 Ads million of sales with zero advertising. However, the intercept
may not be meaningful because Ads = 0 may be outside the
range of observed data.
Each extra dependent raises the mean annual prescription drug
DrugCost = 410 + 550
cost by $550. An employee with zero dependents averages
Dependents
$410 in prescription drugs.
Each extra square foot adds $1.05 to monthly apartment rent.
Rent = 150 + 1.05 SqFt The intercept is not meaningful because no apartment can have
SqFt = 0.
Each unit increase in engine horsepower decreases the fuel
MPG = 49.22 − 0.079
efficiency by 0.079 miles per gallon. The intercept is not
Horsepower
meaningful because a zero horsepower engine does not exist.
23
Prediction Using Regression
Expression Explanation
If the firm spends $10 million on advertising, its predicted
Sales = 268 + 7.37 Ads sales would be:
Sales = 268 + 7.37(10) = 341.7.
If an employee has four dependents, the predicted annual
DrugCost = 410 + 550 Dependents drug cost would be:
DrugCost = 410 + 550(4) = 2,610.
The predicted rent on an 800-square-foot apartment is:
Rent = 150 + 1.05 SqFt
Rent = 150 + 1.05(800) = 990.
If an engine has 200 horsepower, the predicted fuel
MPG = 49.22 − 0.079 Horsepower efficiency :
MPG = 49.22 − 0.079(200) = 33.42.
24
Prediction Using Regression
➢Predictions from our fitted regression model are stronger within the range of our
sample x values.
➢The relationship seen in the scatter plot may not be true for values far outside
our observed x range.
➢Extrapolation outside the observed range of x is always tempting but should be
approached with caution.
25
Simple Linear Regression-
Cause and Effect?
➢ When interpreting a regression model, one must remember that the proposed
relationship between the explanatory variable and response variable is not an
assumption of causation.
➢One cannot conclude that the explanatory variable causes a change in the response
variable.
➢Example
• Crime Rate = 0.125 + 0.031 Unemployment Rate
➢The slope value, 0.031, means that for each one-unit increase in the unemployment
rate, we expect to see an increase of .031 in the crime rate.
➢Does this mean being out of work causes crime to increase?
➢No, there are many lurking variables that could further explain the change in crime
rates (e.g., poverty rate, education level, or police presence.)
➢When we propose a regression model, we might have a causal mechanism in mind, but
cause and effect are not proven by a simple regression. We cannot assume that the
explanatory variable is “causing” the variation we see in the response variable.
26
Fitting a Regression on a
Scatter Plot
➢ Slope Interpretation:
▪The slope of -0.0785 says that for each
additional unit of engine horsepower, the
miles per gallon decreases by 0.0785 mile.
▪ This estimated slope is a statistic because a
different sample might yield a different
estimate of the slope.
➢Intercept Interpretation:
▪The intercept value of 49.216 suggests that
when the engine has no horsepower, the
fuel efficiency would be quite high.
▪However, the intercept has little meaning in
this case, not only because zero horsepower
makes no logical sense, but also because
extrapolating to x = 0 is beyond the range of
the observed data.
27
Regression Caveats
➢The “fit” of the regression does not depend on the sign of its slope. The sign of the
fitted slope merely tells whether X has a positive or negative association with Y.
➢View the intercept with skepticism unless x = 0 is logically possible and is within
the observed range of X.
➢Regression does not demonstrate cause and effect between X and Y. A good fit
only shows that X and Y vary together. Both could be affected by another variable
or by the way the data are defined.
28
Assessing Fit- Coefficient of Determination
➢Because the magnitude of SSE is dependent on sample size and on the units of
measurement (e.g., dollars, kilograms, ounces), we want a unit-free benchmark to
assess the fit of the regression equation.
➢We can obtain a measure of relative fit by comparing SST to SSR. Recall that total
variation in Y can be expressed as
➢The first proportion, SSR/SST, has a special name: coefficient of determination or R2.
You can calculate this statistic in two ways.
29
Assessing Fit- Coefficient of Determination
➢The range of the coefficient of determination is 0 ≤ R2 ≤ 1. The highest possible R2 is 1
because, if the regression gives a perfect fit, then SSE = 0
➢The lowest possible R2 is 0 because, if knowing the value of X does not help predict
the value of Y, then SSE = SST.
➢Because a coefficient of determination always lies in the range 0 ≤ R2 ≤ 1, it is often
expressed as a percent of variation explained.
➢The unexplained variation reflects factors not included in our model or just plain
random variation.
➢In a bivariate regression, R2 is the square of the correlation coefficient r.
➢Thus, if r = .50, then R2 = .25.
➢It is tempting to think that a low R2 indicates that the model is not useful. Yet in some
applications (e.g., predicting crude oil future prices), even a slight improvement in
predictive power can translate into millions of dollars.
30
Test For Significance -Standard Error
of Regression
➢The standard error (𝑠𝑒 ) is an overall measure of model fit.
➢If the fitted model’s predictions are perfect (SSE = 0), then s = 0. Thus, a small 𝑠𝑒
indicates a better fit.
31
Test For Significance -Confidence
Intervals for Slope and Intercept
➢Once we have the standard error se, we construct confidence intervals for the
coefficients from the formulas shown below.
32
Hypothesis Tests
➢Is the true slope different from zero? This is an important question because if β1 = 0,
then X is not associated with Y and the regression model collapses to a constant β0 plus a
random error term:
➢ NOTE: The test for zero slope is the same as the test for zero correlation. That is, the t test for zero
slope will always yield exactly the same tcalc as the t test for zero correlation.
33
Analysis of Variance (ANOVA):
Overall Fit
Decomposition of Variance
The decomposition of variance may be written as
34
ANOVA Table
35
Problem 2
Investigate the relationship between the number of hours a student studies (X)
and his/her exam score (Y), where X is an independent variable. Please refer to
the workbook “Problem 2” of the Excel problem sheet “Simple Linear regression
Analysis.”
36
Problem 2-Solution
Regression Statistics
Multiple R 0.627790986
R Square 0.394121523
Adjusted R Square 0.318386713
Standard Error 14.00249438
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 1020.34121 1020.34121 5.203968 0.0519722
Residual 8 1568.55879 196.0698488
Total 9 2588.9
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0%
Intercept 49.47712665 10.06646125 4.915046652 0.0011713 26.263825 72.6904279 26.26382539
Study Hours (X) 1.964083176 0.86097902 2.281220716 0.0519722 -0.021338 3.94950436 -0.021338004
37
Confidence and Prediction
Intervals for Y
➢ The regression line is an estimate of the conditional mean of Y, that is, the expected value of Y for a
given value of X, denoted E(Y | xi).
➢But the estimate may be too high or too low. To make this point estimate more useful, we need
an interval estimate to show a range of likely values.
➢To do this, we insert the xi value into the fitted regression equation, calculate the estimated 𝑦ෝ𝑖 , and
use the following formulas.
➢Prediction intervals are wider than confidence intervals because individual Y values vary more than
the mean of Y. The first formula gives a confidence interval for the conditional mean of Y, while the
second is a prediction interval for individual values of Y.
38
Difference between the
Confidence and Prediction
Intervals for Y
Feature Confidence Interval (CI) Prediction Interval (PI)
Interval estimate for the mean of Interval estimate for an
Definition the dependent variable (Y) at a individual future value of Y at a
given X given X
Shows how precisely you’ve Shows the range in which a new
Purpose
estimated the mean response observation is likely to fall
Wider (includes both model
Width Narrower uncertainty and individual
variability)
Based on standard error of the Based on standard error of the
Formula basis
mean individual prediction
"What's the likely average score "What score might a new student
Use case
for students who study 5 hours?" who studies 5 hours get?"
39
Problem 2-Prediction interval
Exam Score (Y) Study Hours (X) (xi-Xbar)^2 Predicted Confidenc Level of Degree of Respective t- Standard Lower threshold of Upper
Score=49.4771+1.9641*Stud e level signifcance freedom statistic error prediction interval threshold of
y hours (residual) prediction
interval
53 1 90.25 51.4412 95 0.05 8 2.306004135 14.00249438 12.67715434 90.20524566
74 5 30.25 59.2976 95 0.05 8 2.306004135 14.00249438 23.71477549 94.88042451
59 7 12.25 63.2258 95 0.05 8 2.306004135 14.00249438 28.65437756 97.79722244
43 8 6.25 65.1899 95 0.05 8 2.306004135 14.00249438 30.96225231 99.41754769
56 10 0.25 69.1181 95 0.05 8 2.306004135 14.00249438 35.23771508 102.9984849
84 11 0.25 71.0822 95 0.05 8 2.306004135 14.00249438 37.20181508 104.9625849
96 14 12.25 76.9745 95 0.05 8 2.306004135 14.00249438 42.40307756 111.5459224
69 15 20.25 78.9386 95 0.05 8 2.306004135 14.00249438 43.91405963 113.9631404
84 15 20.25 78.9386 95 0.05 8 2.306004135 14.00249438 43.91405963 113.9631404
83 19 72.25 86.795 95 0.05 8 2.306004135 14.00249438 48.95722674 124.6327733
Total 264.5
40
Residual Tests
➢Three Important Assumptions
▪The errors (residuals) are normally distributed.
▪The errors have constant variance (i.e., they are homoscedastic).
▪The errors are independent (i.e., they are nonautocorrelated).
➢Violation of Assumption 1: Non-normal Errors
▪Non-normality of errors is a mild violation since the regression parameter
estimates b0 and b1 and their variances remain unbiased and consistent.
▪Confidence intervals for the parameters may be untrustworthy because normality
assumption is used to justify using Student’s t distribution.
➢Non-normal Errors
▪A large sample size would compensate.
▪Outliers could pose serious problems.
41
Residual Tests
➢Normal Probability Plot
• The Normal Probability Plot tests the assumption
◦ H0: Errors are normally distributed
◦ H1: Errors are not normally distributed
If H0 is true, the residual probability plot should be linear as shown in the example.
42
Residual Tests
➢What to Do About Non-Normality?
▪Trim outliers only if they clearly are mistakes.
▪Increase the sample size if possible.
▪Try a logarithmic transformation of both X and Y.
43
Residual Tests
➢Violation of Assumption 2: Nonconstant Variance
The ideal condition is if the error magnitude is constant (i.e., errors are homoscedastic).
▪Heteroscedastic (nonconstant) errors increase or decrease with X.
▪In the most common form of heteroscedasticity, the variances of the estimators are likely to
be understated.
▪This results in overstated t-statistics and artificially narrow confidence intervals.
44
Residual Tests
➢Tests for Heteroscedasticity
▪Plot the residuals against X. Ideally, there is no pattern in the residuals moving from left to right.
▪Although many patterns of non-constant variance might exist, the “fan-out” pattern (increasing
residual variance) is most common.
▪Less frequently, we might see a “funnel-in” pattern, which shows decreasing residual variance.
▪The residuals always have a mean of zero, whether the residuals exhibit homoscedasticity or
heteroscedasticity.
45
Residual Tests
➢Transform both X and Y, for example, by taking logs.
➢Although violation of the constant variance assumption can widen the confidence
intervals for the coefficients, heteroscedasticity does not bias the estimates.
➢Violation of Assumption 3: Autocorrelated Errors
▪Autocorrelation is a pattern of non-independent errors.
▪In a first-order autocorrelation, et is correlated with et-1.
▪The estimated variances of the OLS estimators are biased, resulting in confidence intervals
that are too narrow, overstating the model’s fit.
➢Runs Test for Autocorrelation
▪In the runs test, count the number of residual sign reversals (i.e., how often does the
residual cross the zero centerline?).
▪If the pattern is random, the number of sign changes should be n/2.
▪Fewer than n/2 would suggest positive autocorrelation.
▪More than n/2 would suggest negative autocorrelation.
46
Residual Tests
Durbin-Watson (DW) Test
47
Problem 2- Durbin-Watson
(DW) Test
Observation Predicted Exam Residuals Standard Actual Sco Nature of Square of Difference= (et- Difference^2= Durbin-Watson Nature of auto-
Score (Y) Residuals re (Y)=Pre the errors (et)^2 et-1) (et-et-1)^2 test statistic correlation
dicted Scor residuals
e ( Yˆ)
)+Residual
1 51.44120983 1.55879017 0.118075152 53 No issues 0.01394174 0 0 1.780975162 positive autocorrelation
2 59.29754253 14.70245747 1.113680937 74 No issues 1.24028523 0.995605785 0.99123088
3 63.22570888 -4.225708885 -0.32008876 59 No issues 0.10245682 -1.433769701 2.055695555
4 65.18979206 -22.18979206 -1.68083115 43 No issues 2.82519334 -1.360742382 1.851619829
5 69.11795841 -13.11795841 -0.99365839 56 No issues 0.987357 0.687172755 0.472206395
6 71.08204159 12.91795841 0.978508801 84 No issues 0.95747947 1.972167192 3.889443432
7 76.97429112 19.02570888 1.441158347 96 No issues 2.07693738 0.462649546 0.214044602
8 78.93837429 -9.938374291 -0.75281143 69 No issues 0.56672505 -2.193969775 4.813503373
9 78.93837429 5.061625709 0.383407745 84 No issues 0.1470015 1.136219173 1.290994008
10 86.79470699 -3.794706994 -0.28744126 83 No issues 0.08262248 -0.670849001 0.450038382
48
Unusual Observations
➢Standardized Residuals
▪ In a regression, we look for observations that are unusual.
▪An observation could be unusual because its Y-value is poorly predicted by the regression
model (unusual residual) or because its unusual X-value greatly affects the regression line
(high leverage).
▪Tests for unusual residuals and high leverage are important diagnostic tools in evaluating the
fitted regression.
▪If Standardized=>2, then it is classified as unusual.
▪If Standardized=>3, then it is classified as Outlier.
49
High Leverage
➢ A high leverage statistic indicates that the observation is far from the mean of X.
➢Such observations have great influence on the regression estimates because they are
at the “end of the lever.”
➢The leverage for observation i is denoted hi and is calculated as
➢As a rule of thumb for a simple regression, a leverage statistic that exceeds 4/n is
unusual (if 𝑥𝑖 = 𝑥,ҧ the leverage statistic ℎ𝑖 is 1/n so the rule of thumb is just four
times this value).
50
Problem 2- Leverage Test
Exam Score (Y) Study Hours (X) (xi-Xbar)^2 Leverage
53 1 90.25 0.44120983 unusual
74 5 30.25 0.21436673 usual
59 7 12.25 0.1463138 usual
43 8 6.25 0.12362949 usual
56 10 0.25 0.10094518 usual
84 11 0.25 0.10094518 usual
96 14 12.25 0.1463138 usual
69 15 20.25 0.17655955 usual
84 15 20.25 0.17655955 usual
83 19 72.25 0.3731569 usual
Total 264.5
51
Other Regression Problems
➢Outliers
Outliers may be caused by
▪an error in recording data
▪impossible data
▪an observation that has been influenced by an unspecified “lurking” variable that
should have been controlled but wasn’t.
To fix the problem,
▪delete the observation(s)
▪ Delete the data
▪ Formulate a multiple regression model that includes the lurking variable.
52
Other Regression Problems
➢Model Misspecification
▪If a relevant predictor has been omitted, then the model is misspecified.
▪Use multiple regression instead of bivariate regression.
▪Well-conditioned data values are of the same general order of magnitude.
➢ILL-Conditioned Data
▪Ill-conditioned data have unusually large or small data values and can cause loss
of regression accuracy or awkward estimates.
▪Avoid mixing magnitudes by adjusting the magnitude of your data before
running the regression.
53
Other Regression Problems
➢ Spurious correlation
➢In a spurious correlation, two variables appear related because of the way they
are defined.
▪For example, consider the hypothesis that a state’s spending on education is a
linear function of its prison population. Such a hypothesis seems absurd, and we
would expect the regression to be insignificant. But if the variables are defined
as totals without adjusting for population, we will observe a significant
correlation.
54
Other Regression Problems
➢Model Form and Variable Transforms
▪Sometimes a nonlinear model is a better fit than a linear model.
55
Multiple Linear Regression
56
Introduction
➢Multiple regression is an extension of simple regression to include more than one
independent variable.
➢Limitations of simple regression:
▪ often simplistic
▪ biased estimates if relevant predictors are omitted
▪ lack of fit does not show that X is unrelated to Y if the true model is multivariate
➢Y is the response variable and is assumed to be related to the k predictors (X1, X2, … Xk)
by a linear equation called the population regression model:
➢The unknown regression coefficients β0, β1, β2, . . . , βk are parameters and are denoted
by Greek letters.
57
Regression Terminology, continued
The sample estimates of the regression coefficients are denoted by Roman
letters b0, b1, b2, . . . , bk. The predicted value of the response variable is
denoted ŷ and is calculated by inserting the values of the predictors into
the estimated regression equation:
58
Regression Terminology,
continued
Figure below illustrates the idea of a multiple regression model. One of our
objectives in regression modeling is to know whether or not we
havea parsimonious model. A parsimonious regression model is a lean model,
that is, one that has only useful predictors. If an estimated coefficient has a
positive (+) sign, then higher X values are associated with higher Y values, and
conversely if an estimated coefficient has a negative sign.
59
Regression Terminology,
continued
To obtain a fitted regression, we need n observed values of the response variable
Y and its proposed predictors 𝑋1 , 𝑋2 , … , 𝑋𝑘 . Graphical examples are shown
below for one and two predictor variables.
60
Data Format
➢A common mistake is to assume that the model with the best fit is preferred.
➢Sometimes a model with a low R2 may give useful predictions, while a
model with a high R2 may conceal problems.
➢Should thoroughly analyze.
61
Four Criteria for Regression
Assessment
➢Four Criteria for Regression Assessment
➢Logic: Is there an a priori reason to expect a causal relationship between the predictors
and the response variable?
➢Fit: Does the overall regression show a significant relationship between the predictors
and the response variable?
➢Parsimony: Does each predictor contribute significantly to the explanation? Are some
predictors not worth the trouble?
➢Stability: Are the predictors related to one another so strongly that regression estimates
become erratic?
62
F Test for Significance
• For a regression with k predictors, the hypotheses to be tested are
H0: All the true coefficients are zero (β1 = β2 = ⋯ = βk = 0)
H1: At least one of the coefficients is nonzero
• In other words,
H0: b 1 = b 2 = … = b k= 0
H1: At least one of the coefficients is nonzero
63
F Test for Significance
➢When Fcalc is close to 1 the values of MSR and MSE are close in magnitude.
This suggests that none of the predictors provides a good predictive model
for Y (i.e., all βj are equal to 0).
➢When the value of MSR is much greater than MSE, this suggests that at least
one of the predictors in the regression model is significant (i.e., at least one βj is
not equal to 0).
➢Coefficient of Determination (R2)
▪R2, the coefficient of determination, is the most common measure of
overall fit.
▪It can be calculated one of two ways.
64
F Test for Significance
➢For the home price data, the R2 statistic indicates that 95.6 percent of the variation in
selling price is “explained” by our three predictors
➢While this indicates a very good fit, there is still some unexplained variation.
➢Adding more predictors can never decrease the R2.
➢However, when R2 is already high, there is not a lot of room for improvement.
Adjusted R2
➢It is generally possible to raise the coefficient of determination R2 by including
addition predictors.
➢The adjusted coefficient of determination is done to penalize the inclusion of useless
predictors.
➢The adjusted R2 is always less than the coefficient of determination.
➢For n observations and k predictors, the adjusted R2 can be computed from:
65
F Test for Significance
➢As you add predictors, R2 will not decrease. But may increase, remain the same, or
decrease, depending on whether the added predictors increase R2 sufficiently to offset
the penalty.
2
➢If 𝑅𝑎𝑑𝑗 is substantially smaller than R2, it suggests that the model contains useless
predictors. For the home price data with three predictors, both statistics are similar
2
(R2 = .956 and 𝑅𝑎𝑑𝑗 = .951), which suggests that the model does not contain useless
predictors.
2
➢There is no fixed rule of thumb for comparing R2 and 𝑅𝑎𝑑𝑗 .
2
➢A smaller gap between R2 and 𝑅𝑎𝑑𝑗 indicates a more parsimonious model.
➢A large gap would suggest that if some weak predictors were deleted, a
leaner model would be obtained without losing very much predictive power.
66
How Many Predictors?
➢One way to prevent overfitting the model is to limit the number of predictors
based on the sample size.
➢When n/k is small, the R2 no longer gives a reliable indication of fit.
➢Suggested rules are:
➢Evan’s Rule (conservative): n/k ≥ 0 (at least 10 observations per predictor)
➢Doane’s Rule (relaxed): n/k ≥ 5 (at least 5 observations per predictor)
➢If we cannot reject the hypothesis that a coefficient is zero, then the corresponding
predictor does not contribute to the prediction of Y.
67
How Many Predictors?
➢The test statistic for the coefficient of predictor Xj is
➢To reject H0 we can compare tcalc to ta for the different hypotheses or if p-value .
➢The confidence interval for coefficient j is
68
Standard Error
➢The standard error of the regression (se) is another important measure of fit.
➢For n observations and k predictors,
69
Problem 1
Investigate how home size (denoted by SqFt), thousand square feet
(denoted by Lotsize), and number of bathrooms (denoted by Baths)
influence the selling price (denoted by Price), where SqFt, Lotsize, and
Baths are independent variables. Please refer to the workbook “Problem
1” of the Excel problem sheet “Multiple Linear regression Analysis.”
70
Problem 1- solution
Regression Statistics
Multiple R 0.977709
R Square 0.955915
Adjusted R Square 0.950828
Standard Error 20.3056
Observations 30
ANOVA
df SS MS F Significance F
Regression 3 232450.1 77483.36 187.9217 9.74E-18
Residual 26 10720.25 412.3173
Total 29 243170.3
71
What is Multicollinearity?
➢Multicollinearity occurs when the independent variables X1, X2, …, Xm are
intercorrelated instead of being independent.
➢Collinearity occurs if only two predictors are correlated.
➢The degree of multicollinearity is the real concern.
➢Variance Inflation
▪Multicollinearity induces variance inflation when predictors are strongly
intercorrelated.
▪This results in wider confidence intervals for the true coefficients b1, b2, …, bm
and makes the t statistic less reliable.
▪The separate contribution of each predictor in “explaining” the response variable
is difficult to identify.
➢Klein’s Rule: We should worry about the stability of the regression coefficient
estimates only when a pairwise predictor correlation exceeds the multiple
correlation coefficient R
72
Problem 1- Multicollinearity
SqFt LotSize Baths
SqFt 1
LotSize 0.615298 1
Baths 0.686313 0.380535 1
Regression Statistics
Multiple R 0.977709
R Square 0.955915
Adjusted R Square 0.950828
Standard Error 20.3056
Observations 30
73
Variance Inflation Factor
(VIF)
➢The matrix scatter plots and correlation matrix only show correlations between
any two predictors.
➢The variance inflation factor (VIF) is a more comprehensive test for
multicollinearity.
➢For a given predictor j, the VIF is defined as
74
Variance Inflation Factor
(VIF)
➢There is no limit on the magnitude of the VIF.
➢A VIF of 10 says that the other predictors “explain” 90% of the variation in predictor j.
➢A large VIF is a warning to consider whether predictor j really belongs to the model.
VIF Interpretation
1 No multicollinearity.
1<VIF<5 Moderate correlation (usually acceptable).
VIF>5 Potential multicollinearity problem.
VIF> 10 Serious multicollinearity concern.
75
Problem 1-Variance Inflation
Factor (VIF)
Variables VIF
SqFt 2.61
LOT size 1.61
Baths 1.9
76
Analyze residuals to check for violations
of residual assumptions.
➢ The least squares method makes several assumptions about the (unobservable)
random errors 𝜀𝑗. Clues about these errors may be found in the residuals ε.
➢Assumption 1: The errors are normally distributed.
➢Assumption 2: The errors have constant variance (i.e., they are homoscedastic).
➢Assumption 3: The errors are independent (i.e., they are non-autocorrelated).
➢Non-Normal Errors
➢Test
H0: Errors are normally distributed
H1: Errors are not normally distributed
➢Create a histogram of residuals (plain or standardized) to visually reveal any outliers
or serious asymmetry.
➢The normal probability plot will also visually test for normality.
➢Except when there are major outliers, non-normal residuals are usually considered a
mild violation.
77
Analyze residuals to check for violations
of residual assumptions.
➢Non-constant Variance (Heteroscedasticity)
• The hypotheses are:
H0: Errors have constant variance (homoscedastic)
H1: Errors have non-constant variance (heteroscedastic)
• Constant variance can be visually tested by examining scatter plots of the
residuals against each predictor.
• Ideally there will be no pattern.
• Violation of the constant variance assumption is potentially serious.
78
Analyze residuals to check for violations
of residual assumptions.
➢The errors are independent (i.e., they are non-autocorrelated).
▪Autocorrelation is a pattern of non-independent errors that violates the
assumption that each error is independent of its predecessor.
▪This is a problem with time series data.
▪Autocorrelated errors result in biased estimated variances, which will result in
narrow confidence intervals and large t statistics.
▪The model’s fit may be overstated.
▪Test the hypotheses:
H0: Errors are non-autocorrelated
H1: Errors are autocorrelated
▪We will use the observable residuals e1, e2, …, en for evidence of autocorrelation
and the Durbin-Watson test statistic DW.
79
Analyze residuals to check for violations
of residual assumptions.
➢Test the hypotheses:
H0: Errors are nonautocorrelated
H1: Errors are autocorrelated
➢We will use the observable residuals e1, e2, …, en for evidence of autocorrelation
and the Durbin-Watson test statistic DW:
81
Identify unusual residuals and
tell when they are outliers.
➢Unusual Observations
➢An observation may be unusual because
▪ The fitted model’s prediction is poor (unusual residuals), or
▪ One or more predictors may have a large influence on the regression
estimates (unusual leverage).
➢Unusual Residuals
▪ To check for unusual residuals, simply inspect the residuals to find
instances where the model does not predict well.
▪ Apply the Empirical Rule. Standardize the residuals. Those greater than
2𝑠𝑒 from zero, are unusual. Those more than 3𝑠𝑒 from zero are outliers.
82
Binary or Categorical
Predictor
➢What Is a Binary or Categorical Predictor?
▪A binary predictor has two values (usually 0 and 1) to denote the presence or
absence of a condition.
▪For example, for n graduates from an MBA program:
Employed = 1
Unemployed = 0
▪These variables are also called dummy, dichotomous, or indicator variables.
▪For easy understandability, name the binary variable the characteristic that is
equivalent to the value of 1.
83
Testing a Binary for
Significance
➢More than one binary occurs when the number of categories to be coded
exceeds two.
➢For example, for the variable GPA by class level, each category is a binary
variable:
Freshman = 1 if a freshman, 0 otherwise
Sophomore = 1 if a sophomore, 0 otherwise
Junior = 1 if a junior, 0 otherwise
Senior = 1 if a senior, 0 otherwise
Master’s = 1 if a master’s candidate, 0 otherwise
Doctoral = 1 if a PhD candidate, 0 otherwise
84
Effects of a Binary Predictor
➢A binary predictor is sometimes called a shift variable because it shifts the
regression plane up or down.
➢Suppose X1 is a binary predictor that can take on only the values of 0 or 1.
➢Its contribution to the regression is either b1 or nothing, resulting in an intercept of
either b0 (when X1 = 0) or b0 + b1 (when X1 = 1).
➢The slope does not change, only the intercept is shifted. For example:
85
Problem 2
Investigate how home size (denoted by SqFt) and location (binary
variable) can influence the selling price (denoted by Price), where home
size and location are independent variables. Please refer to the workbook
“Problem 2” of the Excel problem sheet “Multiple Linear Regression
Analysis.” It shows that 20 home sales are in two different subdivisions,
Oakknoll and Hidden Hills. OakKnoll = 1 if home is in Oak Knoll
subdivision, 0 otherwise
86
Problem 2-Solution
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.96038
R Square 0.922329
Adjusted R Square 0.913192
Standard Error 29.66971
Observations 20
ANOVA
df SS MS F Significance F
Regression 2 177706.8 88853.4 100.9363 3.69E-10
Residual 17 14964.95 880.2914
Total 19 192671.7
Coefficients
Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%
Upper 95.0%
Intercept 10.61853 44.77248 0.237166 0.815362 -83.8431 105.0802 -83.8431 105.0802
Price 0.198706 0.014186 14.00761 9.13E-11 0.168777 0.228634 0.168777 0.228634
SqFt 33.5383 14.33282 2.339965 0.031743 3.298686 63.77792 3.298686 63.77792
87
Purpose of data conditioning
and stepwise regression.
➢Outliers
An outlier may be due to an error in recording the data. If so, the observation
should be deleted. But how can you tell? Impossible or truly bizarre data values
are apparent reasons to discard an observation.
➢Missing Predictors
An outlier also may be an observation that has been influenced by an unspecified
“lurking” variable that should have been controlled but wasn’t. In this case, we
should try to identify the lurking variable and formulate a multiple regression
model that includes both predictors. If there are unspecified “lurking” variables,
our fitted regression model will not give accurate predictions.
88
Purpose of data conditioning
and stepwise regression.
➢Ill-Conditioned Data
▪All variables in the regression should be of the same general order of magnitude (not too
small, not too large). If your coefficients come out in exponential notation (e.g., 7.3154
E+06), you probably should adjust the decimal point in one or more variables to a
convenient magnitude, as long as you treat all the values in the same data column
consistently.
➢Missing Data
▪If many values in a data column are missing, we might want to discard that variable. If
a Y data value is missing, we must discard the entire observation. If any X data values are
missing, the conservative action is to discard the entire observation. However, because
discarding an entire observation would mean losing other good information, statisticians
have developed procedures for imputing missing values, such as using the mean of
the X data column or by a regression procedure to “fit” the missing X-value from the
complete observations. Imputing missing values requires specialized software and expert
statistical advice.
89
Purpose of data conditioning
and stepwise regression.
➢Model Specification Errors
If you estimate a linear model when actually a nonlinear model is required, or
when you omit a relevant predictor, then you have a misspecified model. How
can you detect misspecification? You can…
What are the cures for misspecification? Start by looking for a missing relevant
predictor, seek a model with a better theoretical basis, or try transforming your
variables (e.g., using a logarithm transform, which can create a linear model
from a nonlinear model).
90
Purpose of data conditioning
and stepwise regression.
➢Stepwise and Best Subsets Regression
▪Stepwise regression uses the power of the computer to fit the best model using 1, 2, 3, . . . , k predictors.
▪backward elimination,
▪we start with all predictors in the model, removing at each stage the predictor with the highest p-value
greater than α, stopping when there are no more predictors with p-values greater than α to be removed.
▪This method tends to increase effective α and may not yield the highest R2 for the number of predictors.
▪forward selection,
▪we start with the single best predictor, adding at each stage the next best predictor in terms of
increasing R2, until there are no more predictors with p-values less than α to be added to the model.
▪A drawback of this method is that adding a predictor may render one or more already-added predictors
insignificant, and special software is needed for all but the simplest models.
▪best subsets regression, we try all possible combinations of predictors and then choose the best model for
the number of predictors we feel are justified (e.g., by Evans’ Rule). This is the most comprehensive
method, though it may present a lot of output.
91
Problem 1-Backward
elimination
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.975238
R Square 0.951089
Adjusted R Square
0.947466
Standard Error
20.98831
Observations 30
ANOVA
df SS MS F Significance F
Regression 2 231276.6 115638.3 262.5106 2.03E-18
Residual 27 11893.74 440.509
Total 29 243170.3
Coefficients
Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%
Upper 95.0%
Intercept -23.2059 30.51527 -0.76047 0.453566 -85.818 39.40632 -85.818 39.40632
SqFt 0.187073 0.012524 14.93671 1.43E-14 0.161375 0.21277 0.161375 0.21277
LotSize 6.60309 1.465181 4.506672 0.000115 3.596787 9.609393 3.596787 9.609393
92