MULTIPLE REGRESSION PART 1
Topics Outline
Multiple Regression Model
Inferences about Regression Coefficients
F Test for the Overall Fit
Residual Analysis
Collinearity
Multiple Regression Model
Multiple regression models use two or more explanatory (independent) variables to predict the
value of a response (dependent) variable. With k explanatory variables, the multiple regression
model is expressed as follows:
y
1 x1
2 x2
k xk
Here , 1 , 2 ,, k are the parameters and the error term is a random variable which
accounts for the variability in y that cannot be explained by the linear effect of the k explanatory
variables. The assumptions about the error term in the multiple regression model parallel those
for the simple regression model.
Regression Assumptions
1. Linearity
The error term is a random variable with a mean 0.
Implication: For given values of x1 , x2 ,, xk , the expected, or average, value of y is given by
E ( y)
y
1 x1
2 x2
k xk
(The relationship is linear, because each term on the right-hand side of the equation is additive, and
the regression parameters do not enter the equation in a nonlinear manner, such as i2 xi . The graph
of the relationship is no longer a line, however, because there are more than two variables involved.)
2. Independence
The values of are statistically independent.
Implication: The value of y for a particular set of values for the explanatory variables is not
related to the value of y for any other set of values.
3. Normality
The error term is a normally distributed random variable (with mean 0 and standard deviation
Implication: Because , 1 , 2 ,, k are constants for the given values of x1 , x2 ,, xk ,
the response variable y is also a normally distributed random variable
(with mean y
).
1 x1
2 x2
k x k and standard deviation
4. Equal spread
The standard deviation
of is the same for all values of the explanatory variables x1 , x2 ,, xk .
Implication: The standard deviation of y about the regression line equals and is the same for
all values of x1 , x2 ,, xk .
-1-
).
Assumption 1 implies that the true population surface (plane, line) is
E ( y)
1 1
x2
xk
Sometimes we refer to this surface as the surface (plane, line) of means.
In the simple linear regression, the slope
represents the change in the mean of y per unit
change in x and does not take into account any other variables.
In the multiple regression model, the slope 1 represents the change in the mean of y per unit
change in x1 , taking into account the effect of x2 , x3 ,..., x k .
The estimation process for multiple regression is shown in Figure 1. As in the case of simple
linear regression, you use a simple random sample and the least squares method that is,
n
minimizing the sum of squared residuals: min
yi
y i
to compute sample regression
i 1
coefficients a, b1 ,, bk as estimates of the population parameters , 1 ,, k .
(In multiple regression, the presentation of the formulas for the regression coefficients involves
the use of matrix algebra and is beyond the scope of this course.)
Multiple Regression Model
1 x1
2 x2
k xk
( st.dev. of )
Sample Data
True Population Surface
1 x1
2 x2
k xk
x1
x2
.
.
.
.
.
.
xk
.
.
.
.
.
.
Regression Parameters
, 1 , 2 ,, k ,
The values of
a, b1 , b2 ,, bk , s
provide the estimates of
, 1 , 2 ,, k ,
Compute the sample statistics
a, b1 , b2 ,, bk , s
and the estimated regression equation
y a b1 x1 b2 x2 bk xk
Figure 1 The estimation process for multiple regression
-2-
The sample statistics a, b1 ,, bk provide the following estimated multiple regression equation
y
a b1 x1
b2 x2 bk xk
where a is again the y-intercept, and b1 through bk are the slopes. This is the equation of the
fitted surface also known as the least squares surface (plane, line).
Graphically, you are no longer fitting a line to a set of points. If there are exactly two explanatory
variables, you are fitting a plane to the data in three-dimensional space. There is one dimension
for the response variable and one for each of the two explanatory variables.
If there are more than two explanatory variables, then you can only imagine the regression surface;
drawing in four or more dimensions is impossible.
Interpretation of Regression Coefficients
The intercept a is the predicted value of y when all of the xs equal zero. (Of course, this makes
sense only if it is practical for all of the xs to equal zero, which is seldom the case.)
Each slope coefficient is the predicted change in y per unit change in a particular x,
holding constant the effect of the other x variables. For example, b1 is the predicted change in y
when x1 increases by one unit and the other xs in the equation, x2 through x k , remain constant.
-3-
Example 1
OmniFoods
OmniFoods is a large food products company. The company is planning a nationwide introduction
of OmniPower, a new high-energy bar. Originally marketed to runners, mountain climbers, and
other athletes, high-energy bars are now popular with the general public. OmniFoods is anxious to
capture a share of this thriving market. The business objective facing the marketing manager at
OmniFoods is to develop a model to predict monthly sales volume per store of OmniPower bars
and to determine what variables influence sales. Two explanatory variables are considered here:
x1 the price of an OmniPower bar, measured in cents and
x2 the monthly budget for in-store promotional expenditures, measured in dollars.
In-store promotional expenditures typically include signs and displays, in-store coupons, and free
samples. The response variable y is the number of OmniPower bars sold in a month.
Data are collected (and stored in OmniPower.xlsx) from a sample of 34 stores in a supermarket
chain selected for a test-market study of OmniPower.
1
2
Number
of Bars
4141
3842
Price
(cents)
59
59
Promotion
($)
200
200
33
34
3354
2927
99
99
600
600
Store
Here is the regression output.
Regression Statistics
Multiple R
0.8705
R Square
0.7577
Adjusted R Square
0.7421
Standard Error
638.0653
Observations
34
ANOVA
df
Regression
2
Residual
31
Total
33
SS
39472730.77
12620946.67
52093677.44
MS
19736365.39
407127.31
F
48.48
Significance F
0.0000
Coefficients
5837.5208
-53.2173
3.6131
Standard Error
628.1502
6.8522
0.6852
t Stat
9.2932
-7.7664
5.2728
P-value
0.0000
0.0000
0.0000
Lower 95%
4556.3999
-67.1925
2.2155
Intercept
Price
Promotion
-4-
Upper 95%
7118.6416
-39.2421
5.0106
The computed values of the regression coefficients are a = 5,837.5208, b1 = 53.2173, b2 = 3.6131.
Therefore, the multiple regression equation (representing the fitted regression plane) is
5837.5208 53.2173 x1 3.6131x2
or
Predicted Bars =5,837.5208 53.2173Price + 3.6131Promotion
Interpretation of intercept
The sample y-intercept (a = 5,837.5208 6,000) estimates the number of OmniPower bars sold in
a month if the price is $0.00 and the total amount spent on promotional expenditures is also $0.00.
Because these values of price and promotion are outside the range of price and promotion used in
the test-market study, and because they make no sense in the context of the problem, the value of
a has little or no practical interpretation.
Interpretation of slope coefficients
The slope of price with OmniPower sales (b1 = 53.2173) indicates that, for a given amount of
monthly promotional expenditures, the predicted sales of OmniPower are estimated to decrease
by 53.2173 53 bars per month for each 1-cent increase in the price.
The slope of monthly promotional expenditures with OmniPower sales (b2 = 3.6131) indicates that,
for a given price, the estimated sales of OmniPower are predicted to increase by 3.6131 4 bars for
each additional $1 spent on promotions.
These estimates allow you to better understand the likely effect that price and promotion
decisions will have in the marketplace. For example, a 10-cent decrease in price is predicted to
increase sales by 532.173 532 bars, with a fixed amount of monthly promotional expenditures.
A $100 increase in promotional expenditures is predicted to increase sales by 361.31 361 bars,
for a given price.
Predicting the Response Variable
What are the predicted sales for a store charging 79 cents per bar during a month in which
promotional expenditures are $400?
Using the multiple regression equation with x1 = 79 and x2 = 400,
y = 5837.5208 53.2173(79) + 3.6131(400)
= 3,078.57
Thus, stores charging 79 cents per bar and spending $400 in promotional expenditures will sell
3,078.57 3,079 OmniPower bars per month.
-5-
Interpretation of se , r2, and r
The interpretation of these quantities is almost exactly the same as in simple regression.
The standard error of estimate se is essentially the standard deviation of residuals,
but it is now given by the following equation
ei2
i
se
n k 1
where n is the number of observations and k is the number of explanatory variables in the equation.
Fortunately, you can interpret s e exactly as before. It is a measure of the typical prediction error
when the multiple regression equation is used to predict the response variable.
The coefficient of determination r 2 is again the proportion of variation in the response variable
y explained by the combined set of explanatory variables x1 , x2 ,, xk . In fact, it even has the
same formula as before:
Regression Sum of Squares
Total Sum of Squares
r2
SSR
SST
In the OmniPower example (see Excel output),
SSR = 39,472,730.77
and
SST = 52,093,677.44
Thus,
r2
SSR
SST
39,472,730 .77
52,093,677 .44
0.7577
The coefficient of determination indicates that 75.77% 76% of the variation in sales is
explained by the variation in the price and in the promotional expenditures.
The square root of r 2 is the correlation r between the fitted values y and the observed values y
of the response variable in both simple and multiple regression.
A graphical indication of the correlation can be seen in the plot of fitted (predicted) y values versus
observed y values. If the regression equation gave perfect predictions, all of the points in this plot
would lie on a 45 line each fitted value would equal the corresponding observed value. Although
a perfect fit virtually never occurs, the closer the points are to a 45 line, the better the fit is.
0.7577 0.87 indicating a strong
The correlation in the OmniPower example is r
relationship between the two explanatory variables and the response variable. This is confirmed
by the scatterplot of y values versus y values:
-6-
Predicted versus Observed Bars
Predicted Bars
6000
5000
4000
3000
2000
1000
0
0
1000
2000
3000
4000
5000
6000
Observed Bars
Inferences about Regression Coefficients
t Tests for Significance
In a simple linear regression model, to test a hypothesis H 0 :
0 concerning the population
b
slope , we used the test statistic t
with df = n 2 degrees of freedom.
SE b
Similarly, in multiple regression, to test a hypothesis concerning the population slope j
for variable x j (holding constant the effects of all other explanatory variables),
we use the test statistic t
bj
SEb j
H0 :
Ha :
with df = n k 1, where k is the number of the explanatory
variables in the regression equation.
In our example, to determine whether variable x2 (amount of promotional expenditures) has a
significant effect on sales, taking into account the price of OmniPower bars, the null and
alternative hypotheses are
H0 : 2 0
Ha :
The test statistic is t
b2
SEb2
3.6131
0.6852
5.2728 with df = n k 1 = 34 2 1 = 31
The P-value is extremely small. Therefore, we reject the null hypothesis that there is no significant
relationship between x2 (promotional expenditures) and y (sales) and conclude that there is a strong
significant relationship between promotional expenditures and sales, taking into account the price x1 .
For the slope of sales with price, the respective test statistic and P-value are: t = 7.7664, P-value
Thus, there is a significant relationship between price x1 and sales, taking into account the
promotional expenditures x2 .
-7-
0.
If we fail to reject the null hypothesis for a multiple regression coefficient, it does not mean that
the corresponding explanatory variable has no linear relationship to y. It means that the
corresponding explanatory variable contributes nothing to modeling y after allowing for all the
other explanatory variables.
The parameter
in a multiple regression model can be quite different from zero even when it is
possible there is no simple linear relationship between x j and y. The coefficient of x j in a multiple
regression depends as much on the other explanatory variables as it does on x j . It is even possible
that the multiple regression slope changes sign when a new variable enters the regression model.
Confidence Intervals
To estimate the value of a population slope
confidence interval
bj
in multiple regression, we can use the following
t * SE j
where t* is the critical value for a t distribution with df = n k 1 degrees of freedom.
To construct a 95% confidence interval estimate of the population slope 1 (the effect of price
x1 on sales y, holding constant the effect of promotional expenditures x2 ), the critical value of t
at the 95% confidence level with 31 degrees of freedom is t* = 2.0395.
(Note: For df = 30, the t-Table gives t* = 2.042.)
Then, using the information from the Excel output,
b1 t * SE1 = 53.2173
2.0395(6.8522) = 53.2173
13.9752 = 67.1925 to 39.2421
Taking into account the effect of promotional expenditures, the estimated effect of a 1-cent
increase in price is to reduce mean sales by approximately 39.2 to 67.2 bars. You have 95%
confidence that this interval correctly estimates the relationship between these variables.
From a hypothesis-testing viewpoint, because this confidence interval does not include 0,
you conclude that the regression coefficient 1 , has a significant effect.
The 95% confidence interval for the slope of sales with promotional expenditures is
b2
t * SE2 = 3.6131
2.0395(0.6852) = 3.6131
1.3975 = 2.2156 to 5.0106
Thus, taking into account the effect of price, the estimated effect of each additional dollar of
promotional expenditures is to increase mean sales by approximately 2.22 to 5.01 bars. You have
95% confidence that this interval correctly estimates the relationship between these variables.
From a hypothesis-testing viewpoint, because this confidence interval does not include 0,
you can conclude that the regression coefficient 2 has a significant effect.
-8-
F Test for the Overall Fit
In simple linear regression, the t test and the F test provide the same conclusion; that is, if the
null hypothesis is rejected, we conclude that
0 . In multiple regression, the t test and the F
test have different purposes. The t test of significance for a specific regression coefficient in
multiple regression is a test for the significance of adding that variable into a regression model,
given that the other variables are included. In other words, the t test for the regression coefficient
is actually a test for the contribution of each explanatory variable. The overall F test is used to
determine whether there is a significant relationship between the response variable and the entire
set of explanatory variables. We also say that it determines the explanatory power of the model.
The null and alternative hypotheses for the F test are:
H0 :
H a : At least one
(There is no significant relationship between the
response variable and the explanatory variables.)
(There is a significant relationship between the response
variable and at least one of the explanatory variables.)
Failing to reject the null hypothesis implies that the explanatory variables are of little or no use in
explaining the variation in the response variable; that is, the regression model predicts no better
than just using the mean. Rejection of the null hypothesis implies that at least one of the
explanatory variables helps explain the variation in y and therefore, the regression model is useful.
The ANOVA table for multiple regression has the following form.
Source of
Variation
Degrees
Sum
of
of Squares
Freedom
Regression
SSR
MSR
Error
nk1
SSE
MSE
Total
n 1
SST
Mean Squares
(Variance)
SSR
k
SSE
n k 1
F statistic
F
P-value
MSR
MSE
Prob > F
The F test statistic follows an F-distribution with k and (n k 1) degrees of freedom.
For our example, the hypotheses are:
H0 :
Ha :
and/or
0
2
is not equal to zero
The corresponding F distribution has df1 = 2 and df2 = n 2 1 = 34 3 = 31 degrees of
freedom. The test statistic is F = 48.4771 and the corresponding P-value is
P-value = FDIST(48.4771,2,31) = 0.00000000029
We reject H 0 and conclude that at least one of the explanatory variables (price and/or
promotional expenditures) is related to sales.
-9-
Residual Analysis
Three types of residual plots are appropriate for multiple regression.
1. Residuals versus y s (the predicted values of y)
This plot should look patternless. If the residuals show a pattern (e.g. a trend, bend, clumping),
there is evidence of a possible curvilinear effect in at least one explanatory variable, a possible
violation of the assumption of equal variance, and/or the need to transform the y variable.
2. Residuals versus each x
Patterns in the plot of the residuals versus an explanatory variable may indicate the existence of a
curvilinear effect and, therefore, the need to add a curvilinear explanatory variable to the
multiple regression model.
3. Residuals versus time
This plot is used to investigate patterns in the residuals in order to validate the independence
assumption when one of the x-variables is related to time or is itself time.
Below are the residual plots for the OmniPower sales example. There is very little or no pattern
in the relationship between the residuals and the predicted value of y, the value of x1 (price), or
the value of x2 (promotional expenditures). Thus, you can conclude that the multiple regression
model is appropriate for predicting sales.
There is no need to plot the residuals versus time because the data were not collected in time order.
Residuals versus Predicted Bars
1500
1000
Residuals
500
0
0
1000
2000
3000
4000
-500
-1000
-1500
-2000
Predicted Bars
- 10 -
5000
6000
Promotion Residual Plot
1500
1500
1000
1000
500
500
Residuals
Residuals
Price Residual Plot
0
0
50
100
150
-500
0
0
200
400
-1000
-1000
-1500
-1500
-2000
-2000
600
800
-500
Promotion
Price
The third regression assumption states that the errors are normally distributed. We can check it
the same way as we did it in simple regression by forming a histogram or a normal probability
(Q-Q) plot of the residuals. If the third assumption holds, the histogram should be approximately
symmetric and bell-shaped, and the points in the normal probability plot should be close to a 450
line. But if there is an obvious skewness, too many residuals more than, say, two standard
deviations from the mean, or some other nonnormal property, this indicates a violation of the
third assumption.
Neither the histogram, nor the normal probability plot for the OmniPower example shows any
severe signs of departure from normality.
Q-Q Normal Plot of Residual / Data Set #2
3.5
10
2.5
Standardized Q-Value
12
8
6
-3.5
4
2
1.5
0.5
-2.5
-1.5
-0.5
-0.5
0.5
-1.5
-2.5
1126.47
694.56
262.64
-169.27
-601.18
-1033.09
0
-1465.01
Frequency
Histogram of Residual / Data Set #2
-3.5
Z-Value
- 11 -
1.5
2.5
3.5
Collinearity
Most explanatory variables in a multiple regression problem are correlated to some degree with
one another. For example, in the OmniPower case the correlation matrix is
Price ( x1 )
Price ( x1 )
Promotion ( x2 )
Bars (y)
1.0000
Promotion ( x2 )
0.0968
1.0000
Bars (y)
0.7351
0.5351
1.0000
The correlation between price and promotion is 0.0968. Thus, we find some degree of linear
association between the two explanatory variables.
Low correlations among the explanatory variables generally do not result in serious deterioration
of the quality of the least squares estimates. However, when the explanatory variables are highly
correlated, it becomes difficult to determine the separate effect of any particular explanatory
variable on the response variable. We interpret the regression coefficients as measuring the change
in the response variable when the corresponding explanatory variable increases by 1 unit while all
the other explanatory variables are held constant. The interpretation may be impossible when the
explanatory variables are highly correlated, because when the explanatory variable changes by 1
unit, some or all of the other explanatory variables will change.
Collinearity (also called multicollinearity or intercorrelation) is a condition that exists when
two or more of the explanatory variables are highly correlated with each other. When highly
correlated explanatory variables are included in the regression model, they can adversely affect the
regression results. Two of the most serious problems that can arise are:
1. The estimated regression coefficients may be far from the population parameters, including
the possibility that the statistic and the parameter being estimated may have opposite signs.
For example, the true slope 2 might actually be +10 and b2 , its estimate, might turn out to be 3.
2. You might find a regression that is very highly significant based on the F test but for which
not even one of the t tests of the individual x variables is significant. Thus, variables that are
really related to the response variable can look like they arent related, based on their P-values.
In other words, the regression result is telling you that the x variables taken as a group explain
a lot about y, but it is impossible to single out any particular x variables as being responsible.
Statisticians have developed several routines for determining whether collinearity is high enough
to cause problems. Here are the three most widely used techniques:
1. Pairwise correlations between xs
The rule of thumb suggests that collinearity is a potential problem if the absolute value of the
correlation between any two explanatory variables exceeds 0.7.
(Note: Some statisticians suggest a cutoff of 0.5 instead of 0.7.)
2. Pairwise correlations between y and xs
The rule of thumb suggests that collinearity may be a serious problem if any of the pairwise
correlations among the x variables is larger than the largest of the correlations between the y
variable and the x variables.
- 12 -
3. Variance inflation factors
The statistic that measures the degree of collinearity of the j-th explanatory variable with the
other explanatory variables is called variance inflation factor (VIF) and is found as:
VIF j
1
1 r j2
where r j2 is the coefficient of determination for a regression model using variable x j as the
response variable and all other x variables as explanatory variables.
The VIF tells how much the variance of the regression coefficient has been inflated due to
collinearity. The higher the VIF, the higher the standard error of its coefficient and the less it can
contribute to the regression model. More specifically, the r j2 shows how well the j-th explanatory
variable can be predicted by the other explanatory variables. The 1 r j2 term measures what that
explanatory variable has left to bring to the model. If r j2 is high, then not only is that variable
superfluous, but it can damage the regression model.
Since r j2 cannot be less than zero, the minimum value of the VIF is 1.
If a set of explanatory variables is uncorrelated, then each r j2 = 0.0 and each VIFj is equal to 1.
As r j2 increases, VIFj increases also. For example, if r j2 = 0.9, then VIFj = 1/(10.9) = 10;
if r j2 = 0.99, then VIFj = 1/(10.99) = 100.
How large the VIFs must be to suggest a serious problem with collinearity is not completely clear.
In general, any individual VIFj larger than 10 is considered as an indication of a potential
collinearity problem. (Note: Some statisticians suggest using the cutoff of 5 instead of 10.)
In the OmniPower sales data, the correlation between the two explanatory variables, price and
promotional expenditure, is 0.0968. Because there are only two explanatory variables in the model,
VIF1
VIF2
1
1
0.0968
1.009
Since all VIFs (two in this example) are less than 10 (or, less than the more conservative value of 5),
you can conclude that there is no problem with collinearity for the OmniPower sales data.
One solution to the collinearity problem is to delete the variable with the largest VIF value.
The reduced model is often free of collinearity problems.
Another solution is to redefine some of the variables so that each x variable has a clear, unique role in
x
explaining y. For example, if x1 and x2 are collinear, you might try using x1 and the ratio 2 instead.
x1
If possible, every attempt should be made to avoid including explanatory variables that are
highly correlated. In practice, however, strict adherence to this policy is rarely achievable.
- 13 -