MULTIPLE LINEAR
REGRESSION
MULTIPLE REGRESSION
Multiple Regression Model
Least Squares Method
Coefficient of Determination
Model Assumptions
Testing for Significance
Using the Estimated Regression Equation for
Estimation and Prediction
Categorical Independent Variables
MULTIPLE LINEAR REGRESSION
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique
that uses several explanatory variables to predict the outcome of a response variable.
Multiple regression model
y = β0 + β1x1 + β2x2 + … + βnxn + ε
where
n is the number of observations
Y is the dependent variable
x is explanatory variables
β0 is the y-intercept (constant term)
β1, β2 and βn is the slope coefficients for each explanatory variable
ε error or residual
Multiple regression equation
E(y) = β0 + β1x1 + β2x2 + … + βnxn
where
n is the number of observations
Y is the dependent variable
x is explanatory variables
β0 is the y-intercept (constant term)
β1, β2 and βn is the slope coefficients for each explanatory variable
Estimated multiple regression equation
ŷ = b0 + b1x1 + b2x2 + … + bnxn
where
n is the number of observations
Y is the dependent variable
x is explanatory variables
b0 is the y-intercept (constant term)
b1, b2 and bn is the slope coefficients for each explanatory variable
ESTIMATION PROCESS
Multiple Regression Model
Sample Data:
E(y) = 0 + 1x1 + 2x2 +. . .+ pxp + x1 x2 . . . xp y
Multiple Regression Equation . . . .
E(y) = 0 + 1x1 + 2x2 +. . .+ pxp . . . .
Unknown parameters are
0, 1, 2, . . . , p
Estimated Multiple
Regression Equation
b0, b1, b2, . . . , bp
provide estimates of yˆ = b0 + b1 x1 + b2 x2 + ... + bp xp
0, 1, 2, . . . , p Sample statistics are
b0, b1, b2, . . . , bp
LEAST SQUARES METHOD
• Least Squares Criterion
min ( yi − yˆ i )2
Computation of Coefficient Values
The formulas for the regression coefficients
b0, b1, b2, . . . bp involve the use of matrix algebra.
We will rely on computer software packages to
perform the calculations.
LEAST SQUARES METHOD
The least squares method is a procedure for using sample data to find the estimated regression equation
Predict values of the dependent variables are computed using estimated multiple regression
equation
y = b0 + b1x1 + b2x2 + … + bnxn
If 2 independent variables are used to compute the output then equation is
y = b0 + b1x1 + b2x2
Where
b0 is intercept
b1 is slope of first variable
b2 is slope of first variable
Example – Butler Trucking Company
Assignment Miles (x) Time (y)
1 100 9.3
2 50 4.8
3 100 8.9
4 100 6.5
5 50 4.2
6 80 6.2
7 75 7.4
8 65 6.0
9 90 7.6
10 90 6.1
Linear Equation ???
Predict value for y using any value for x = 110
Example – Butler Trucking Company
Assignment Miles (x1) Deliveries (x2) Time (y)
1 100 4 9.3
2 50 3 4.8
3 100 4 8.9
4 100 2 6.5
5 50 2 4.2
6 80 2 6.2
7 75 3 7.4
8 65 4 6.0
9 90 3 7.6
10 90 2 6.1
Linear Equation is y = ???? R Square = ????
Multiple R = ???
Predict value for y using any value for x1 and x2 Adjusted R square = ???
Example – Butler Trucking Company
Linear Equation is y = -0.8687 + 0.0611 x1 + 0.9234 x2
Predict value for y using any value for x
R Square = 0.9038
Multiple R = 0.9506 (SQRT of R square)
Adjusted R square = 0.8763
Multiple Coefficient of Determination
Relationship Among SST, SSR, SSE
SST = SSR + SSE
i
( y − y ) 2
= i
( ˆ
y − y ) 2
+ i i
( y − ˆ
y )2
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Coefficient of Determination (R square)
The Coefficient of Determination is the measure of the variance in response variable ‘y’ that can be
predicted using predictor variables x1, x2, .. xn It shows the accuracy of regression line.
Measure the goodness of fit
The value of the coefficient of Determination varies from 0 to 1. 0 means there is no linear
relationship between predictor variables x1, x2, .. xn and response variable ‘y’ and 1 mean there is a
perfect linear relationship between input and output.
R2 is calculated using Sum of Square (SS)
R2 = SSR/SST
Types of Sum of Square
1. Regression Sum of Square (SSR)
Relationship
2. Residual (Error) Sum of Square (SSE)
SST = SSR + SSE
3. Total Sum of Square (SST)
SSR = SSR(x1) + SSR(x2) …SSR(xn)
Adjusted Multiple Coefficient
of Determination
Adding independent variables, even ones that are
not statistically significant, causes the prediction
errors to become smaller, thus reducing the sum of
squares due to error, SSE.
Because SSR = SST – SSE, when SSE becomes smaller,
SSR becomes larger, causing R2 = SSR/SST to
increase.
The adjusted multiple coefficient of determination
compensates for the number of independent
variables in the model.
Adjusted Multiple Coefficient
of Determination
n−1
Ra2 = 1 − ( 1 − R 2 )
n−p−1
Example – Butler Trucking Company – ANOVA Summary
Relationship
SST = SSR + SSE
SSR = SSR(x1) + SSR(x2) …SSR(xn)
SSR = 15.871 + 5.729 = 21.6
SSE = 2.299
SST = 15.871 + 5.729 + 2.299= 23.899
MSR = SSR / k
MSR (X1) = 15.871 / 1 = 15.871 Degree of freedom for SSR(miles) is k = 1
MSR (X2) = 5.729 / 1 = 5.729 Degree of freedom for SSR(deliveries) is k = 1
MSE = SSE / DE = 2.299/7 = 0.328 MSE = SSE / n-k-1
F value for x1 = 15.871/ 0.328 = 48.3 n = 10 (total number of observations)
F value for x2 = 5.729 / 0.328 = 17.4 k = 2 (total number of independent variables)
Correlation Coefficient (Multiple R)
Correlation Coefficient = (Sign of slope m) SQRT(Coefficient of determination)
= (Sign of slope m) SQRT(R2)
= + SQRT (0.9038) = + 0.9506 Adjusted R - squared
R2 = SSR/SST
= 21.6/23.899
= 0.9038 Where
n = 10
K = 2 (number of independent
variable)
Multiple Regression Model
Example: Programmer Salary Survey
A software firm collected data for a sample of 20
computer programmers. A suggestion was made that
regression analysis could be used to determine if
salary was related to the years of experience and the
score on the firm’s programmer aptitude test.
The years of experience, score on the aptitude test
test, and corresponding annual salary ($1000s) for a
sample of 20 programmers is shown on the next slide.
Multiple Regression Model
Exper. Test Salary Exper. Test Salary
(Yrs.) Score ($000s) (Yrs.) Score ($000s)
4 78 24.0 9 88 38.0
7 100 43.0 2 73 26.6
1 86 23.7 10 75 36.2
5 82 34.3 5 81 31.6
8 86 35.8 6 74 29.0
10 84 38.0 8 87 34.0
0 75 22.2 4 79 30.1
1 80 23.1 6 94 33.9
6 83 30.0 3 70 28.2
6 91 33.0 3 89 30.0
Multiple Regression Model
Suppose we believe that salary (y) is related to
the years of experience (x1) and the score on the
programmer aptitude test (x2) by the following
regression model:
y^ = b0 + b1x1 + b2x2
where
y = annual salary ($000)
x1 = years of experience
x2 = score on programmer aptitude test
Solving for the Estimates of 0, 1, 2
Least Squares
Input Data Output
x1 x2 y Computer b0 =
Package b1 =
4 78 24
for Solving b2 =
7 100 43
Multiple
. . . R2 =
. . . Regression
3 89 30 Problems etc.
Solving for the Estimates of b0, b1, b2
Regression Equation Output
Predictor Coef SE Coef T p
Constant 3.17394 6.15607 0.5156 0.61279
Experience 1.4039 0.19857 7.0702 1.9E-06
Test Score 0.25089 0.07735 3.2433 0.00478
Estimated Regression Equation
SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)
Note: The predicted salary will be in thousands of dollars.
Now predict salary for given experience = 5 years
and Test Score value = 70
Interpreting the Coefficients
In multiple regression analysis, we interpret each
regression coefficient as follows:
bi represents an estimate of the change in y
corresponding to a 1-unit increase in xi when all
other independent variables are held constant.
Interpreting the Coefficients
b1 = 1.404
Salary is expected to increase by $1,404 for
each additional year of experience (when the variable
score on programmer attitude test is held constant).
Interpreting the Coefficients
b2 = 0.251
Salary is expected to increase by $251 for each
additional point scored on the programmer aptitude
test (when the variable years of experience is held
constant).
MODEL FITTING
Model fitting is a measure of how well a machine learning model generalizes to similar data
to that on which it was trained. A model that is well-fitted produces more accurate outcomes.
Model fitting is the essence of machine learning. If your model doesn’t fit your data correctly,
the outcomes it produces will not be accurate enough to be useful for practical decision-
making.
A measure of model fit tells us how well our regression line captures the underlying
data.
Testing for Significance
In simple linear regression, the F and t tests provide
the same conclusion.
In multiple regression, the F and t tests have different
purposes.
Testing for Significance: F Test
The F test is used to determine whether a significant
relationship exists between the dependent variable
and the set of all the independent variables.
The F test is referred to as the test for overall
significance.
Testing for Significance: t Test
If the F test shows an overall significance, the t test is
used to determine whether each of the individual
independent variables is significant.
A separate t test is conducted for each of the
independent variables in the model.
We refer to each of these t tests as a test for individual
significance.
Testing for Significance: F Test
Hypotheses H0: 1 = 2 = . . . = p = 0
Ha: One or more of the parameters
is not equal to zero.
Test Statistics F = MSR/MSE
Rejection Rule Reject H0 if p-value < or if F > F
where F is based on an F distribution
with p d.f. in the numerator and
n - p - 1 d.f. in the denominator.
F Test for Overall Significance
Hypotheses H0: 1 = 2 = 0
Ha: One or both of the parameters
is not equal to zero.
Rejection Rule For = .05 and d.f. = 2, 17; F.05 = 3.59
Reject H0 if p-value < .05 or F > 3.59
TESTING FOR SIGNIFICANCE
In a multiple regression equation, the mean or expected value of y is a linear function of x:
E(y) = ẞ0 + ẞ1 x1 + ẞ2 x2 + ẞn xn
If the value of ẞ1 = ẞ2 = ẞn = 0 , E(y) = ẞ0 + (0)x = ẞ0.
Hypothesis
H0 : ẞ1 = ẞ2 = ẞn = 0
Ha : One or more parameters are not equal
Two tests are commonly used. Both require an estimate of variance of e in the regression model
T – test
F - test
F TEST
𝑀𝑆𝑅
F=
𝑀𝑆𝐸
10.8
F= = 32.92 SSR for miles and deliveries = 15.871 + 5.729 = 21.6
0.328
MSR for miles and deliveries = SSR/p = 21.6 / 2 = 10.8
MSE = SSE/n-p-1 = 2.299/7 = 0.328
Critical value for F .01= 9.55 for degree of freedom 2 as numerator and 7
as denominator
And value of F = 32.92 which is very high so we reject it H0 : ẞ1 = ẞ2
Testing for Significance: t Test
Hypotheses H0 : i = 0
H a : i 0
bi
Test Statistics t=
sbi
Rejection Rule Reject H0 if p-value < or
if t < -t or t > t where t
is based on a t distribution
with n - p - 1 degrees of freedom.
t Test for Significance
of Individual Parameters
Hypotheses H0 : i = 0
H a : i 0
Rejection Rule For = .05 and d.f. = 17, t.025 = 2.11
Reject H0 if p-value < .05, or
if t < -2.11 or t > 2.11
T Test – for each
parameter
𝑏1 0.06113
t= = = 6.18
𝑆𝑏1 0.00989
𝑏2 0.923 Critical value for t .005= 3.499 for degree of freedom 7
t= = = 4.18
𝑆𝑏2 0.221
For b1 6.18 > 3.499
Also for b2 4.18 > 3.499
We reject H0 for both input parameters
Anova Table
R OUTPUT FOR 𝑏1 𝑏1
t= 𝑆𝑏1 =
PRODUCTION DATA 𝑆𝑏1
𝑡
𝑏1 = 𝑡 ∗ 𝑆𝑏1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1209.433 8066.345 0.150 0.882
Month -3.286 78.110 -0.042 0.967
MachineHours 44.707 4.237 10.551 1.71e-10 ***
ProductionRuns 931.803 07.050 8.704 6.86e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If number of observations are not given
Residual standard error: 4222 on 24 degrees of freedom 24 = n-k-1
Multiple R-squared: 0.8639, Adjusted R-squared: 0.8469 24 = n-3-1
F-statistic: 50.79 on 3 and 24 DF, p-value: 1.519e-10 n = 28
Categorical Independent Variables
In many situations we must work with categorical
independent variables such as gender (male, female),
method of payment (cash, check, credit card), etc.
For example, x2 might represent gender where x2 = 0
indicates male and x2 = 1 indicates female.
In this case, x2 is called a dummy or indicator variable.
CATEGORICAL INDEPENDENT VARIABLES IN REGRESSION
ANALYSIS
A categorical variable (also called qualitative variable) refers to a characteristic that can't be quantifiable.
Categorical variables can be either nominal or ordinal.
ID Gender Age Income Education
Risk Age Pressure Smoker
12 57 150 Yes 1 Male 25 $50,000 High School
23 48 169 Yes
2 Female 30 $60,000 College
13 60 157 No
55 52 144 Yes 3 Male 28 $55,000 College
28 71 154 No
4 Female 35 $70,000 Graduate
17 66 162 No
19 69 143 Yes
In this example, "Gender" and "Education" are categorical
In this example, ”Smoker" is categorical variables. "Gender" is nominal, while "Education" is ordinal.
variables. Nominal data.
Categorical variables have to convert into numerical variable to use for regression purpose.
REGRESSION ANALYSIS WITH DUMMY VARIABLES
Linear regression use quantitative variables referred to as “numeric” variables, these are
variables that represent a measurable quantity.
Examples include:
•Number of square feet in a house
•The population size of a city
•Age of an individual
Dummy Variables: Numeric variables used in regression analysis to represent categorical
data that can only take on one of two values: zero or one.
But if we wish to use categorical variables as predictor variables. These are variables that
take on names or labels and can fit into categories. Examples include:
•Gender (e.g. “male”, “female”)
•Marital status (e.g. “married”, “single”, “divorced”)
The solution is to use dummy variables.
Create these variables that take on one of two values: zero or one.
The number of dummy variables we must create equals k-1 where k is the number of different
values the categorical variable can take on.
Example 1: Create a Dummy Variable with Only Two Values
The categorical variable Gender can
take on two different values (“Male” or
“Female”),
we only need to create k-1 = 2-1 = 1
dummy variable.
To create this dummy variable, we can
choose one of the values (“Male” or
“Female”) to represent 0 and the other
to represent 1.
Example 2: Create a Dummy Variable with Multiple Values
Suppose we have the following dataset and we would like to use marital status and age to
predict income:
Since it is currently a categorical
variable that can take on three
different values (“Single”, “Married”,
or “Divorced”), we need to create k-1
= 3-1 = 2 dummy variables.
Let “Single” be our baseline value
since it occurs most often.
Net Asset 5 Year Average Expense Ratio Morningst
Fund Name Fund Type Value ($) Return (%) (%) ar Rank
Amer Cent Inc & Growth Inv DE 28.88 12.39 0.67 2-Star CATEGORICAL DATASET –
American Century Intl. Disc IE 14.37 30.53 1.41 3-Star EXAMPLE (CONVERT INTO
American Century Tax-Free Bond FI 10.73 3.34 0.49 4-Star NUMERICAL)
American Century Ultra DE 24.94 10.88 0.99 3-Star
Ariel DE 46.39 11.32 1.03 2-Star
Artisan Intl Val IE 25.52 24.95 1.23 3-Star
Artisan Small Cap DE 16.92 15.67 1.18 3-Star
Baron Asset FI 50.67 16.77 1.31 5-Star
Brandywine DE 36.58 18.14 1.08 4-Star
Brown Cap Small IE 35.73 15.85 1.20 4-Star
Net Asset 5 Year Average Expense Ratio
Fund Name Fund Type Value ($) Return (%) (%) Morningstar Rank
Amer Cent Inc & Growth Inv 1 28.88 12.39 0.67 2
American Century Intl. Disc 2 14.37 30.53 1.41 3
American Century Tax-Free
Bond 3 10.73 3.34 0.49 4
American Century Ultra 1 24.94 10.88 0.99 3
Ariel 1 46.39 11.32 1.03 2
Artisan Intl Val 2 25.52 24.95 1.23 3
Artisan Small Cap 1 16.92 15.67 1.18 3
Baron Asset 3 50.67 16.77 1.31 5
Brandywine 1 36.58 18.14 1.08 4
Brown Cap Small 2 35.73 15.85 1.20 4
ASSUMPTIONS OF MULTIPLE LINEAR REGRESSION
Assumption 1: Linear Relationship
Multiple linear regression assumes that there is a linear relationship between each predictor
variable and the response variable.
How to Determine if this Assumption is Met
The easiest way to determine if this assumption is met is to create a scatter plot of each
predictor variable and the response variable. Or based on Correlation Coefficient value
Assumption 2. No Multicollinearity: None of the predictor or input variables are highly correlated with
each other.
Multicollinearity occurs when two or more independent variables in a regression model are highly correlated,
making it difficult to determine the individual contribution of each variable to the dependent variable.
When one or more predictor variables are highly correlated, the regression model suffers
from multicollinearity, which causes the coefficient estimates in the model to become unreliable.
How to Determine if this Assumption is Met
Create correlation matrix
OR
Determine if this assumption is met is to calculating the VIF (variance inflation factor) value for each
predictor variable.
VIF values start at 1 and have no upper limit. VIF of 1 indicates No correlation
•VIF < 5: Low multicollinearity.
•5 < VIF < 10: Moderate multicollinearity.
•VIF > 10: High multicollinearity.
Assumption 3: Independence or No Autocorrelation in the Residuals
Multiple linear regression assumes that each observation in the dataset is independent.
In the residual, the value of y(x) is independent to y(x+1)
No correlation between consecutive residuals. It’s assumed that the residuals are independent.
How to Determine if this Assumption is Met
One way to determine if this assumption is met is to perform a Durbin-Watson test, which is
used to detect the presence of autocorrelation in the residuals of a regression.
The Durbin-Watson test is a statistical test used to detect the presence of autocorrelation in the
residuals of a regression analysis.
In general, if test value is less than 1.5 (Positive) or greater than 2.5 (Negative) then there is
potentially a serious autocorrelation problem.
if test value is between 1.5 and 2.5 then autocorrelation is likely not a cause for concern.
The DW statistic always produces
a value between 0 and 4.
Correlation vs. Autocorrelation
Correlation measures the relationship between two variables, whereas autocorrelation measures the
relationship of a variable with lagged values of itself.
Assumption 4: The variance of the residuals is constant (Homoscedasticity)
Multiple linear regression assumes that the residuals have constant variance at every point in
the linear model.
The amount of error in the residuals is similar at each point of the linear model.
How to Determine if this Assumption is Met
The simplest way to determine if this assumption is met is to create a plot of standardized
residuals versus predicted values.
Assumption 5: Multivariate Normality
Multiple linear regression assumes that the residuals of the model are normally distributed.
How to Determine if this Assumption is Met
There are two common ways to check if this assumption is met:
1. Check the assumption by comparing value of residual and Z Score
2. 2. Check the assumption visually using a Histogram
NONLINEAR REGRESSION
Nonlinear regression is a mathematical model that fits an equation to certain data using
a generated line.
As is the case with a linear regression that uses a straight-line equation
while nonlinear regression relates the two variables in a nonlinear (curved) relationship.
First is Linear and other two non linear
COEFFICIENTS
T value for variable Time
Standard error for variable Share
Coefficient for variable Work
Define on VIF value for variable Rating
Calculate VIF value for given R square value
Df Sum Sq Mean Sq F value Pr(>F)
SquareFeet 1 2803636 2.24e-13 ***
Bedrooms 1 799206 2.38e-05 ***
Bathrooms 1 427256 0.00168 **
Residuals 124 5138423
1. Find the value for Multiple R,
2. R square,
3. Adjusted R square and
4. standard error.
5. F value
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -56.4083 172.0038 -0.328 0.743504
SquareFeet 0.3564 _____ 3.341 0.001102 **
Bedrooms 104.5993 29.1233 ______ 0.000472 ***
Bathrooms _______ 42.1867 3.211 0.001685 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 203.6 on 124 degrees of freedom
Multiple R-squared: 0.4396, Adjusted R-squared: ____________
F-statistic: 32.42 on 3 and 124 DF, p-value: 1.535e-15
1. Fill all the tables in table.
2. Create a least squares line (equation) for multiple linear regression for the given analysis.
3. Predict the value for price if the area is 2000 sq ft, No of bedrooms is 3 and No of bathrooms is 2
1. Create a least squares line (equation) for multiple linear regression for the given analysis.
2. Predict the value for price if the area is 2000 sq ft, No of bedrooms is 3 and No of bathrooms is 2
Q - Below is the regression output related to the annual post-college earning (USD) which is based on the college
annual education cost (USD), its graduation rate (%), and debt which is the percentage of students paying loans (%).
1. Interpret the value of the coefficient of
determination.
2. Comment on the significance of predictors
mentioning their hypotheses and specific p-value
at a 5% level of significance.
3. Predict the earnings if the cost = 25000,
grad = 60, and debt = 80.
4. Interpret the values of VIF for Grad.
VIF Value for Grad: 22.56
EXAMPLE -
VIF Value –
Miles 1.026963
Deliveries 15.26963
1. Interpret the significance of each input based on the p-value.
2. Interpret based on R2 value.
3. Write the equation.
4. Find output where the value of miles is 10 and deliveries is 30.
VIF Value –
Month MachineHours ProductionRuns
1.045953 1.063087 1.102359
Durbin-Watson test -
1. Interpret the significance of each input based on the p-value
Autocorrelation D-W Statistic p-value 2. Interpret based on VIF value of each predictor.
1.6143758 1.304288 0.028 3. Interpret based on Durbin value.