NOTE
Slides marked as Extra study are not as a part of syllabus.
Those are provided for add-on knowledge.
1
Disclaimer
This presentation is purely for academic purpose and does not carry
any commercial value.
All non-academic images used in this presentation are property of
respective image holder(s). Images are used only for indicative
purpose and does not carry any other meaning.
2
Please follow this…
3
Regression Analysis
Data Analysis with Excel
www.pibm.in
4
TEXT BOOK
Hector Guerrero, “Excel Data Analysis -Modeling and Simulation”,
Springer-Verlog
Camm, Cochran, Fry, Ohlmann, Anderson, Sweeney, Willians,
“Essentials of Business Analytics”, CenageLearning
REFERENCE LINK FOR EXCEL
https://www.excel-easy.com/
https://support.office.com/en-us/article/excel-for-windows-
training-9bc05390-e94c-46af-a5b3-
d7c22f6990bb?wt.mc_id=otc_home&ui=en-US&rs=en-US&ad=US
5
Table of Content
1 Linear regression Toy store : Case Study
2 Modeling
3 Estimation
4 Inference
5 Prediction
6 Example 2
6
Data Set
Toy-Sale2.xlsx
QuantitySales.xlsx
7
Linear Regression
What is Linear Regression?
Linear regression attempts to fit a linear relation between a variable of
interest and a set of variables that may be related to the variable of
interest.
8
Linear Regression for Business Statistics
Example
There is a Sales manager of a toys retail company which sells
various kinds of toys in the local market. This sales manager
needs to make some kind of projections about the number of
monthly units that the retail company will be able to sell of
this particular toy in the coming half year. In the past she has
been making such projections based on her gut feeling and
now wishes to be a little more scientific about the whole
process.
9
Linear Regression
Overview of Regression
1.Modeling - Developing a regression model
2.Estimation - Using software to estimate the model
3.Inference - Interpreting the estimated regression model
4.Prediction - Making predictions about the variable of interest
10
1. Modeling
Developing a regression model
To develop regression model implies representing your knowledge
and beliefs about process and equation
Example (continued…) Based on her experience, the manager figures
out that the monthly unit sales depend on three important variables,
the price at which the toy is sold, the monthly amount that the
company spends on advertising the toy and the monthly amount
spent on promotions for the toy.
11
1. Modeling
Question that comes to our mind is, are these only factor on which sales
depends?
Certainly not.
The level of excitement amongst consumers about this toy,
The month of year for which sales are being considered,
Perhaps some micro factors such as countries, GDP and so
on. …….
There are lots of factors that might impact sales.
12
1. Modeling
Sales is variable of interest or Y variable or dependent variable or L.H.S.
variable.
The R.H.S. Variables will be used to explain sales. These are Price of toy in
particular month, Advertise Expenditure & promotional Expenditure.
Β0, β1, β2, β3 are coefficients of equation
Sales depends on Advertise Expenditure & promotional Expenditure
The Regression Line is
13
1. Modeling
14
Toy-sales2.xlsx
Month Unit Sales Price ($) Adexp ('000$) Promexp ('000$)
1 73959 8.75 50.04 61.13
2 71544 8.99 50.74 60.19
3 78587 7.50 50.14 59.16
4 80364 7.25 50.27 60.38
5 78771 7.40 51.25 59.71
6 71986 8.50 50.65 59.88
7 74885 8.40 50.87 60.14
8 73345 7.90 50.15 60.08
9 76659 7.25 48.24 59.90
10 71880 8.70 50.19 59.68
11 73598 8.40 51.11 59.83
12 74893 8.10 51.49 59.77
15
Data Analysis Toolpak - Excel
1. On the Data tab, in the Analysis group, click
Data Analysis
2. Select Regression and click OK.
3. Select the Y Range (A1:A25). This is the
predictor variable (also called dependent
variable).
4. Select the X Range(B1:E25). These are the
explanatory variables (also called independent
variables). These columns must be adjacent to
each other.
5. Check Labels.
6. Click in the Output Range box and select cell
H1.
7. Check Residuals.
8. Click OK.
16
Data Analysis Toolpak - Excel
SUMMARY OUTPUT Coefficients Standard Error t Stat P-value
Regression Statistics Intercept -25096.8329 24859.61131 -1.0095425 0.324773
Multiple R 0.926739
R Square 0.858845 Price ($) -5055.26987 526.3995537 -9.6034843 6.22E-09
Adjusted R Adexp
Square 0.837671 ('000$) 648.6121403 209.0048787 3.10333493 0.005602
Standard
Error 1274.936 Promexp
Observations 24 ('000$) 1802.610956 392.8485427 4.5885647 0.000178
17
2. Estimation
Using software to estimate the model
18
2. Estimation
β0 = -25096.83
β1 = -5055.27
β2 = 648.61
β3 = 1802.61
β1 represents extent of impact of price has on sales
β2 represents extent of impact of Advertisement Expenditure has on sales
β3 represents extent of impact of promotional Expenditure has on sales
19
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.926738719
R Square 0.858844653
Adjusted R
Square 0.83767135
Standard Error 1274.93553
Observations 24
ANOVA
df SS MS F Significance F
Regression 3 197798832.8 65932944.3 40.56262 1.085E-08
Residual 20 32509212.11 1625460.61
Total 23 230308045
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -25096.8329 24859.61131 -1.0095425 0.324773 -76953.073 26759.4074 -76953.0733 26759.40743
Price ($) -5055.26987 526.3995537 -9.6034843 6.22E-09 -6153.3201 -3957.2196 -6153.32009 -3957.21964
Adexp ('000$) 648.6121403 209.0048787 3.10333493 0.005602 212.6356 1084.58868 212.6356042 1084.588676
20
2. Estimation
R2 Statistic is measure of goodness of
fit
SUMMARY OUTPUT
0 ≤ R2 ≤ 1
Regression Statistics That is, The Value of R2 always lie between 0 and
Multiple R 0.926738719 1
R Square 0.858844653 R2 is proportion of variation in Y variable
Adjusted R Square 0.83767135 explained by regression model
Standard Error 1274.93553
Observations 24
R2 = 0.86 %
This model is able to explain about 0.86 %
variation of changes in unit sales of toy.
The R2 closer to 1, the better the regression line fits the data
21
Significance F and P-values
To check if your results are reliable (statistically significant), look at Significance F .
F = 1.085E-08 = 0.0000 < 0.05
1) If this value F < 0.05, then model is OK.
If Significance F > 0.05, 1.it's probably better to stop using this set of independent variables.
2. Delete a variable with a high P-value (greater than 0.05) and rerun the regression until Significance
F drops below 0.05.
2) Most or all P-values should be bellow 0.05 ( P-value for intercept > 0.05, P-values for independent are 0.000,
0.005 )
ANOVA
df SS MS F Significance F
Regression 3 197798832.8 65932944.3 40.56262 1.085E-08
Residual 20 32509212.11 1625460.61
Total 23 230308045
Coefficients Standard Error t Stat P-value Lower 95%
Intercept -25096.8329 24859.61131 -1.0095425 0.324773 -76953.073
Price ($) -5055.26987 526.3995537 -9.6034843 6.22E-09 -6153.3201
Adexp ('000$) 648.6121403 209.0048787 3.10333493 0.005602 212.6356
22
3. Inference
Interpreting the estimated regression model
When the variable X1 increases by one unit, the Y variable increases by β1
units all other variables in Model are being kept at the same level.
Similarly When the variable X2 increases by one unit, the Y variable
increases by β2 units all other variables in Model are being kept at the same
level. ……..
23
3. Inference
Interpreting the estimated regression model
24
4. Prediction
Making predictions about the variable of interest
For the coming six months, company management is considering three
alternative scenarios for selling this particular toy.
Scenario 1: Price = 9.10$, AdExp = 52,000$, PromExp = 61,000$
Scenario 2: Price = 7.10$, AdExp = 48,000$, PromExp = 57,000$
Scenario 3: Price = 8.10$, AdExp = 50,000$, PromExp = 60,000$
Which scenario to implement to maximize unit sales?
25
H I J K L M N
16 Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
17 Intercept -25096.8329 24859.61131 -1.0095425 0.324773 -76953.073 26759.4074
18 Price ($) -5055.26987 526.3995537 -9.6034843 6.22E-09 -6153.3201 -3957.2196
19 Adexp 648.6121403 209.0048787 3.10333493 0.005602 212.6356 1084.58868
('000$)
20 Promexp 1802.610956 392.8485427 4.5885647 0.000178 983.14326 2622.07865
('000$)
21
22 Price AdExp PromExp predicted
=I$17+I$18*J23+I$19*K23+I$20
23 scenario 1 9.1 52 61 72587 *L23
=I$17+I$18*J24+I$19*K24+I$20
24 2 7.1 48 57 72893 *L24
=I$17+I$18*J25+I$19*K25+I$20
25 3 8.1 50 60 74543 *L25
26
4. Predictions
Price AdExp PromExp predicted
scenario 1 9.1 52 61 72587
2 7.1 48 57 72893
3 8.1 50 60 74543
27
4. Prediction
Making predictions about the variable of interest
Scenario 1: Price = 9.10$, AdExp = 52,000$, PromExp = 61,000$
Predicted Sales = 72587.31
Scenario 2: Price = 7.10$, AdExp = 48,000$, PromExp = 57,000$
Predicted Sales = 72892.96
Scenario 3: Price = 8.10$, AdExp = 50,000$, PromExp = 60,000$
Predicted Sales = 74542.75
28
Residuals
The residuals show you how far away the actual data points are from the predicted data points
(using the equation).
For example, the first data point equals 73959
Using the equation, the predicted for first month is
-25096.83 -5055.27 * 8.75 648.61* 50.04 + 1802.61 *61.13 = 73319.71
residual = 73959 - 73319.71 = 639.28.
Month Unit Sales Price ($) Adexp ('000$) Promexp ('000$) prediction residuals
1 73959 8.75 50.04 61.13 73319.71 639.2850022
2 71544 8.99 50.74 60.19 70866.02 677.9755706
3 78587 7.50 50.14 59.16 76152.52 2434.480039
4 80364 7.25 50.27 60.38 79699.84 664.1576281
5 78771 7.40 51.25 59.71 78369.44 401.5575512
29
Ex 2.
We have data from Company ELCAB
produces 7 items. The data on Quantity
Sold, price of item, Advertising cost for A B C D
Item is given 1 Quantity Sold Price Advertising
The question is: is there a relation between 2 8500 $2 $2,800
Quantity Sold (Output) and Price and 3 4700 $5 $200
Advertising cost (Input). 4 5800 $3 $400
In other words: can we predict Quantity 5 7400 $2 $500
Sold if we know Price and Advertising? 6 6200 $5 $3,200
7 7300 $3 $1,800
Here Price, Advertising cost are Input
variables. Quantity Sold is dependent 8 5600 $4 $900
variable. QuantitySales.XLSX
30
1. Modeling
The regression line is:
Quantity Sold = β0 + β1 * Price + β2 * Advertising cost
Quantity Sold is dependent variable
Price & Advertising cost are Independent Variables
β 0, β1, β2 are coefficients of equation
31
Data Analysis Toolpak - Excel
1. On the Data tab, in the Analysis group, click
Data Analysis
2. Select Regression and click OK.
3. Select the Y Range (A1:A8). This is the predictor
variable (also called dependent variable).
4. Select the X Range(B1:C8). These are the
explanatory variables (also called independent
variables). These columns must be adjacent to
each other.
5. Check Labels.
6. Click in the Output Range box and select cell E1.
7. Check Residuals.
8. Click OK.
32
2. Estimation
Quantity Sold = β0 + β1 * Price + β2 * Advertising cost
Quantity Sold = 8536.214 - 835.722 * Price + 0.592 * Advertising cost
β0 = 8536.214
β1 = -835.722
β2 = 0.592
33
2. Estimation
R2 = 0.962, which is a very good
SUMMARY OUTPUT fit.
Regression Statistics 96% of the variation in Quantity
Multiple R 0.980681 Sold is explained by the
independent variables Price and
R Square 0.961736 Advertising.
Adjusted R Square 0.942604
.
Standard Error 310.5239
Observations 7
The R2 closer to 1, the better the regression line fits the data
34
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.980681
R Square 0.961736
Adjusted R
Square 0.942604
Standard Error 310.5239
Observations 7
ANOVA
df SS MS F Significance F
Regression 2 9694300 4847150 50.26854 0.001464
Residual 4 385700.4 96425.11
Total 6 10080000
Upper
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% 95.0%
Intercept 8536.214 386.9117 22.06243 2.5E-05 7461.975 9610.453 7461.975 9610.453
Price -835.722 99.65304 -8.38632 0.001106 -1112.4 -559.041 -1112.4 -559.041
Advertising 0.592228 0.104347 5.675579 0.004755 0.302515 0.881942 0.302515 0.881942
35
Significance F and P-values
To check the results are reliable or not (statistically significant),
F = 0.001. Since F is less than 0.05, Model is OK.
Most or all P-values should be below 0.05. In this example it is (0.000,
0.001 and 0.005).
Note : If F > 0.05, Donot use this set of independent variables.
1. Delete a variable with a high P-value (greater than 0.05)
2. Rerun the regression until Significance F drops below 0.05.
36
3. Inference
When the variable Price increases by one unit, the Quantity Sold decreases
by - 835.722 units all other variables in Model are being kept at the same
level.
Similarly When the Advertising cost increases by one unit, the Quantity
Sold increases by 0.592 units all other variables in Model are being kept at
the same level. ……..
Quantity Sold = β0 + β1 * Price + β2 * Advertising cost
Quantity Sold = 8536.214 - 835.722 * Price + 0.592 * Advertising cost
37
4. Predictions
if price equals $4 and Advertising cost equals $3000, you might be want to
predict a Quantity Sold
if price equals $3 and Advertising cost equals $2000, you might be want to
predict a Quantity Sold
if price equals $2 and Advertising cost equals $1500, you might be want to
predict a Quantity Sold
Quantity Sold = 8536.214 - 835.722 * Price + 0.592 * Advertising cost
38
4. Predictions
B C D E F G H I
15 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0%
16 Intercept 8536.214 386.9117 22.06243 2.5E-05 7461.974654 9610.453111 7461.975
17 Price -835.722 99.65304 -8.38632 0.001106 -1112.40356 -559.041143 -1112.4
18 Advertising 0.592228 0.104347 5.675579 0.004755 0.302515325 0.881941666 0.302515
19
Advertising Predicted
Price
cost Quantity
$4 $3,000 6970.01 =C17+C18*F22+C19*G22
$3 $2,000 7213.50 =C17+C18*F23+C19*G23
$2 $1,500 7753.11 =C17+C18*F24+C19*G24
39
4. Predictions
Predicted
Price Advertising cost
Quantity Sold
1 $4 $3,000 6970
2 $3 $2,000 7214
3 $2 $1,500 7753
40
Residuals
Quantity Advertising predicted Quantity
Price Residual
Sold Cost Sold
8500 $2 $2,800 $8,523.01 ($23.01)
4700 $5 $200 $4,476.05 $223.95
5800 $3 $400 $6,265.94 ($465.94)
7400 $2 $500 $7,160.88 $239.12
6200 $5 $3,200 $6,252.73 ($52.73)
7300 $3 $1,800 $7,095.06 $204.94
5600 $4 $900 $5,726.33 ($126.33)
41
HOMEWORK PROBLEMS
42
Home Work
Month Temperature Jackets Sold
Nov 15 100
Dec 10 150
Jan 8 250
Feb 12 100
Mar 16 50
Apr 22 10
Data of winter season jacket sold data with temperature in each month.
Temperature is the independent variable because one cannot control the
temperature.
Jackets Sold is the dependent variable because based on the temperature
increase and decreases jacket sale
43
Home Work
Fit Regression line using Data Analysis Toolpak, Write equation of
Regression line with proper notations.
Estimate the parameters.
Find R2, F, P-Values.
Find prediction for Jackets Sold when temperature of month June, Sep, Oct
when is 25, 13, 14
Find predictions and residual for month Nov, Dec, Jan, Feb, Mar, Apr
44
PGDM
Happy Analyzing !!!!!! ….!!!
45