0% found this document useful (0 votes)

86 views17 pages

Project 2: Submitted By: Sumit Sinha Program & Group: Pgpbabionline May19 - A

This document summarizes the exploratory data analysis and modeling performed on a customer satisfaction dataset. Key steps included: 1. Checking for missing data and outliers, computing descriptive statistics, and visualizing variable relationships. Multicollinearity was found between some variables. 2. Performing principal component analysis and extracting 4 factors explaining 80% of variance, which were interpreted and labeled. 3. Conducting multiple linear regression with the 4 factors as independent variables and customer satisfaction as the dependent variable. The model was found to fit the data well with an R-squared of 0.745.

Uploaded by

sumit sinha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views17 pages

Project 2: Submitted By: Sumit Sinha Program & Group: Pgpbabionline May19 - A

Uploaded by

sumit sinha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Project 2

Submitted by: Sumit Sinha

Program & Group: PGPBABIOnline May19_A

Perform exploratory data analysis on the dataset. Showcase some charts, graphs.
Check for outliers and missing values

Importing the “Factor-Hair-Revised.csv” file into R format we started checking

some basic descriptive statistics. On the very beginning the “ID” column was
dropped as it will not bear any informational value for further proceedings.
To understand the contents of the data we used “str” function and then we took
help of function “describe” from “psych” library to have a complete view of the
descriptive statistics of the data.
Code:

Output:
Continuing our effort to understand the data the boxplot for each of the variables
were created.
Also, with the help of “ggpairs” function of R, we created a chart that summarizes
the distribution of all variables, scatter plots and correlation coefficients to
understand the relationship between the dependent and independent variables
in a single view.
Next we tried to see what the missing value counts for each of the variables. In
the present case the data does not have any missing value for any of the
variables.
Code:

Output:
From the boxplot chart we infer that although for few variables there are
evidence of statistical outliers but by rating wise they are well within the limit of
1-10, the limit prescribed by the survey. So there is no requirement of outlier
treatment in this case.
From the “ggpairs” chart we see that:
a) The dependent variable “Satisfaction” have good correlation with
CompRes,ProdLine,OrdBilling, DelSpeed variables
b) OrdBilling and DelSpeed are highly correlated
c) WartyClaim and TechSupport are highly correlated
d) CompRes and DelSpeed are highly correlated
e) OrdBilling and CompRes are highly correlated
f) CompRes and OrdBilling are highly correlated
g) Ecom and SalesFImage are highly correlated
Is there evidence of multicollinearity ? Showcase your analysis

The evidence of multicollinearity is discussed in the last problem only with the
help of “ggpairs” chart. Next we have analyzed the issue with correlation chart.
Code:

Output:

Visually also it is evident that multicollinearity exists between independent

variables and these can be briefly summarized as follows:
a) OrdBilling and DelSpeed are highly correlated
b) WartyClaim and TechSupport are highly correlated
c) CompRes and DelSpeed are highly correlated
d) OrdBilling and CompRes are highly correlated
e) CompRes and OrdBilling are highly correlated
f) Ecom and SalesFImage are highly correlated

Perform simple linear regression for the dependent variable with every
independent variable

Before proceeding to perform simple linear regression we will state the

assumptions of multiple linear regression and where we stand in the present case
in respect to these assumptions.
1) Linearity of dependent variable and independent variables : From “ggpairs”
chart we have seen dependent variable satisfaction has linear relationship
with rest of the independent variables.
2) Homoscedasticity of the error term: We will check this in the model
diagnostics.
3) Errors are normally distributed : We will check this in the model
diagnostics.
4) Independent variables does not show evidence of multicollinearity: We
have seen that there is definite multicollinrarity in our model and this could
be the principal reason behind the non-acceptance of this model and
ultimately we proceed to Factor analysis to use the factors as dependent
variables in the model.
Code:
Output:

From the model output we can say that R-squared parameter generated by the
regression which measures how much of the variation in outcome can be
explained by the variation in the independent variables, is high at 0.8021, if we
check p-value associated with each of independent variables we found that many
of the model variables like (TechSup,CompRes,Advertising,etc) are not significant
to be in the model (at alpha=5%)
Overall the p-value associated with F-statistics give us a value lesser than 0.05
suggesting the model is significant in the other words at least one of the
independent variables is significant.
Model Diagnostics check:
Code:
Output:

The Residual Vs. Fitted chart shows residuals or errors are scattered with no
systematic pattern. This indicates the variables are linearly related.
The points in the normal Q-Q plot shows that most of the points are on the line,
indicating residuals following normal distribution.
Scale location chart shows points are randomly distributed around the horizontal
line, indicating model seems to meet constant variance assumption.
Model Equation:
Below is the model equation generated from the estimates in the model output :
Satisfaction = -0.66 + 0.37*ProdQual -0.44*Ecom + 0.033*TechSup +
0.16*CompRes -0.02*Advertising + 0.14ProdLine + 0.80*SalesFImage-
0.038*CompPricing -0.10*WartyClaim + 0.14*OrdBilling + 0.16*DelSpeed
Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name
the Factors
To perform Principal component analysis (PCA) we have modified the data that
we were recently worked with. We dropped the dependent variable
“Satisfaction” from the analysis. Below is the code for PCA and the associated
output and inferences:
Code:

Output and inferences:

We first do assumption testing for PCA which will be valid for progressing on
Factor analysis (FA) as well. We first convert the data into matrix form before
running the functions allied with assumption testing.

In KMO test we find the parameter value as 0.65 which indicates that sample size
is adequate for proceeding for PCA/FA.
Bartlett Test value is 619.27 and associated p-value is less than alpha=0.05. So,
Bartlett’s test is statistically significant and it also indicates there existed sufficient
co-rellations in the matrix to proceed for PCA/FA

The determinant value is positive, although it is very small.

So now, from KMO Test, Bartlett’s test and Determinant of correlation matrix we
have satisfied all the assumptions for performing PCA/FA.
Although we have been asked to Perform PCA/Factor analysis by extracting 4
factors, we will test how many Principal components/factors ideally could be
considered in the given case. For that we plotted the Scree plot as follows:
If we follow Kaiser rule to chose the optimum number of Principal components
depending on Eigen value, then we see 4 is the number as the 4th point on the
Scree plot is just above Eigen Value=1.
We run the PCA model with “principal” function in R on the correlation matrix of
the data we have with “varimax” rotation and number of factor=4.

Below is the detailed output from this model:

From the PCA model output we can say that 4 Principal components which have
Eigen value>1 can explain 80 % of variance with a big reduction in dimension
(from 11variables to only 4 factors). Principal components 1,2,3,4 can explain
26%, 20%, 17% and 16% variance respectively.
Next we do the factor analysis.
For FA the assumptions and the test that we conducted on the given data are all
same as we did for the PCA. So, we are good to proceed for the factor analysis.
We first run the “fa” function on the correlation matrix of the data we have.
Then we group the factors very similar way as we did in PCA.
Code:

Output is as below:

4 factors PA1,2,3,4 can explain 69% of the variance, PA1 being the dominant
variance explainer as it is alone explaining 24% of variance.
From the initial 11 variables we have only 4 factors with 69% explanation of
variance.
So the grouping of the variables in respect to factors is very clear here from the
diagram above. The grouped variables are represented by the factors and we
name them accordingly to their inherent characteristics:
PA1= “Buy_Ease”
PA2=”Marketing”
PA3= “After_Sales”
PA4=”Positioning”
Next we get the scores for the 4 factors and then we merge these scores with the
dependent variable “Satisfaction” from the initial data that we used.

We label the data having factors accordingly to their inherent characteristics and
dependent variables:
Thus our data having a dependent variable and 4 factors processed by the
dimensionality reduction method of Factor analysis is ready for final regression
analysis.

Perform multiple linear regression with customer satisfaction as dependent

variables and the four factors as independent variables. Comment on the
Model output and validity. Your remarks should make it meaningful for
everybody.
Before proceeding for the modeling we split the data into training and validation.
On the training data we will build the model and we will check the performance of
this model on the validation data.
(Note: Due to a small sample size, very small change in the proportion of Training
and validation data can give different fit, different accuracy of the model)

Considering the “Satisfaction” variable as dependent variable we will model the

data on the rest of the variables.

Below is the output of the model_a:

Other than “After_Sales” variable all variables are significant in the model. With
the present set of variables the model fit parameter R-Squared is 0.745 and
adjusted R-Squared is 0.73.
p-value associated with F-statistic signifies that this model is significant to explain
the variance in the dependent variable “Satisfaction”.
We will proceed to build a model without the variable (“After_Sales”) in a new
iteration and check how that performs.

Below is the output of the model_b:

In this model we find that all the all variables are significant to the model and the
R-squared is same as model_a but adjusted R-Squared is marginally higher than
model_a.
p-value associated with F-statistic signifies that this model (model_b) is also
significant to explain the variance in the dependent variable “Satisfaction”.

Model Diagnostics:
We will run model diagnostics on model_b to be sure if it meets linear regression
assumptions.
Output:

We also checked the multicollinearity in the model_b:

Output:

VIF pertaining to all the variables is less than 2 depicting the fact that there is
almost no multicollinearity among the independent variables.
Performance Check:
We will validate both the models (model_a and model_b) on its predictive
accuracy comparing the observed values of Satisfaction and Predicted values of
Satisfaction. We can generate many metrics like Correlation coefficient, Root
Mean Square Error (RMSE), Mean Absolute Deviation (MAD) for the linear
regression models and compare. Here we are comparing with only Correlation
coefficient.
Code:

Output:

We can infer from the result that Model_b exhibits better accuracy than Model_a.
Model_b also includes less variables/factors to predict Satisfaction which is
operationally advantageous. We can possibly improve this model (accuracy of
model) by including interaction effect of independent variables.
We were able to predict the Satisfaction aspect with good level of accuracy even
after facing the multicollinearity in the original data, and then proceeding to
dimensionality reduction method like PCA/FA to manage this issue and finally get
a model out of it.

Model Equation:
Below is the model equation generated from the estimates in the model_b output
Satisfaction = 6.9488+0.6757* Buy_Ease + 0.5326* Marketing
+0.5816*Positioning

B) " Data - Frame".: Outliers and Missing Values (8 Marks) Answer
100% (1)
B) " Data - Frame".: Outliers and Missing Values (8 Marks) Answer
23 pages
Market Segmentation - Product Service Management
No ratings yet
Market Segmentation - Product Service Management
16 pages
Project 2 Factor Hair Revised Case Study
No ratings yet
Project 2 Factor Hair Revised Case Study
25 pages
Market Segmentation Statistics Project
100% (5)
Market Segmentation Statistics Project
14 pages
Factor-Hair RV PDF
No ratings yet
Factor-Hair RV PDF
23 pages
Assignment 2 - Factor Hair
No ratings yet
Assignment 2 - Factor Hair
39 pages
Factor Hair Revised Project Report PDF
No ratings yet
Factor Hair Revised Project Report PDF
23 pages
Project 2 - Ashwini Krishnan - Factor Analysis
No ratings yet
Project 2 - Ashwini Krishnan - Factor Analysis
20 pages
Ash Hair Salon DM-word
No ratings yet
Ash Hair Salon DM-word
6 pages
Factor Analysis for Marketers
No ratings yet
Factor Analysis for Marketers
32 pages
Advance Stats Assignment
No ratings yet
Advance Stats Assignment
18 pages
Pratik Zanke Factor Hair Revised
No ratings yet
Pratik Zanke Factor Hair Revised
37 pages
Data Analysis for Marketing Experts
100% (2)
Data Analysis for Marketing Experts
24 pages
Sample Exam Answers CMA
No ratings yet
Sample Exam Answers CMA
9 pages
Factor Analysis
No ratings yet
Factor Analysis
11 pages
Advanced Statistics Project
No ratings yet
Advanced Statistics Project
12 pages
CCW331 Set4
No ratings yet
CCW331 Set4
5 pages
PCA and Clustering Analysis Guide
No ratings yet
PCA and Clustering Analysis Guide
20 pages
Data Science Project Analysis
No ratings yet
Data Science Project Analysis
21 pages
Business Research Method: Factor Analysis
100% (1)
Business Research Method: Factor Analysis
52 pages
Project Report - Advanced - Stats - Final PDF
No ratings yet
Project Report - Advanced - Stats - Final PDF
25 pages
MBA 2020-21 Factor Analysis
No ratings yet
MBA 2020-21 Factor Analysis
33 pages
Factor Analysis 2023
No ratings yet
Factor Analysis 2023
11 pages
Hair Salon PCA & Regression Analysis
33% (3)
Hair Salon PCA & Regression Analysis
11 pages
Assignment 2 PDF
No ratings yet
Assignment 2 PDF
25 pages
SMDM Predictive Modeling Business Report 05.02.2022 PDF
No ratings yet
SMDM Predictive Modeling Business Report 05.02.2022 PDF
38 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
70 pages
Factor Analysis
No ratings yet
Factor Analysis
16 pages
Name: Tathagata Pratihar SAP ID: 80303190102 Sec: B Sub: BRM
No ratings yet
Name: Tathagata Pratihar SAP ID: 80303190102 Sec: B Sub: BRM
12 pages
Sessions 21-24 Factor Analysis - Ppt-Rev
No ratings yet
Sessions 21-24 Factor Analysis - Ppt-Rev
61 pages
Dsur I Chapter 17 Efa
No ratings yet
Dsur I Chapter 17 Efa
47 pages
Factor Analysis SPSS
No ratings yet
Factor Analysis SPSS
3 pages
8 Dimensionality Reduction
No ratings yet
8 Dimensionality Reduction
49 pages
Project - 2 (Factor-Hair-Revised) - Solution - Amit Tawade Nov10 PDF
100% (2)
Project - 2 (Factor-Hair-Revised) - Solution - Amit Tawade Nov10 PDF
15 pages
Project - 2 (Factor-Hair-Revised) - Solution - Amit Tawade Nov10
100% (5)
Project - 2 (Factor-Hair-Revised) - Solution - Amit Tawade Nov10
15 pages
Business Research Methods Guide
No ratings yet
Business Research Methods Guide
13 pages
Predictive Modeling for Analysts
100% (1)
Predictive Modeling for Analysts
28 pages
Exploratory Factor Analysis
100% (1)
Exploratory Factor Analysis
33 pages
Data Analysis for Market Segmentation
No ratings yet
Data Analysis for Market Segmentation
36 pages
Session5 Factor Analysis Handout
No ratings yet
Session5 Factor Analysis Handout
16 pages
Data Mining Project DSBA PCA Report Final
No ratings yet
Data Mining Project DSBA PCA Report Final
21 pages
Most Important Findings 1zm31 Per Subject
No ratings yet
Most Important Findings 1zm31 Per Subject
24 pages
Linear Regression Datascience Basit PDF
No ratings yet
Linear Regression Datascience Basit PDF
19 pages
2b Factor Anaysis
No ratings yet
2b Factor Anaysis
24 pages
Linear Regression Firm Basit PDF
No ratings yet
Linear Regression Firm Basit PDF
21 pages
Factor Analysis Guide: Techniques & Applications
No ratings yet
Factor Analysis Guide: Techniques & Applications
22 pages
Factor Analysis: Presented By: Gurvinder Kaur
No ratings yet
Factor Analysis: Presented By: Gurvinder Kaur
37 pages
Mod 3
No ratings yet
Mod 3
50 pages
PCA Business Report - Part 1
No ratings yet
PCA Business Report - Part 1
31 pages
MiM Predictive Analytics Sessions 1 2 (PCA)
No ratings yet
MiM Predictive Analytics Sessions 1 2 (PCA)
26 pages
ACPusing R
No ratings yet
ACPusing R
25 pages
Session 13 - Factor Analysis
No ratings yet
Session 13 - Factor Analysis
22 pages
Advanced Statistics-Project
No ratings yet
Advanced Statistics-Project
16 pages
8 Factor Analysis
No ratings yet
8 Factor Analysis
25 pages
8 Factor Analysis
No ratings yet
8 Factor Analysis
25 pages
Factor Analysis
No ratings yet
Factor Analysis
4 pages
Grabovoi - Codes For Various Types of Diesease Version 1-NUMBER SEQUENCE GG
93% (30)
Grabovoi - Codes For Various Types of Diesease Version 1-NUMBER SEQUENCE GG
39 pages
Executive Summary - Exploratory Data Analysis - Menu Analysis - Recommendation
No ratings yet
Executive Summary - Exploratory Data Analysis - Menu Analysis - Recommendation
17 pages
Client Master List: National Securities Depository Limited
No ratings yet
Client Master List: National Securities Depository Limited
2 pages
Project6 Time Series
No ratings yet
Project6 Time Series
14 pages
Fitting Generalized Additive Models With The GAM Procedure in SAS 9.2
No ratings yet
Fitting Generalized Additive Models With The GAM Procedure in SAS 9.2
14 pages
Equity/Fixed Income Valuation Role
No ratings yet
Equity/Fixed Income Valuation Role
2 pages
Credit Model Validation Manager
No ratings yet
Credit Model Validation Manager
2 pages
HSBC Model Review Job in Bangalore
No ratings yet
HSBC Model Review Job in Bangalore
2 pages
R Package for Cluster Analysis
No ratings yet
R Package for Cluster Analysis
36 pages
Data Visualization Using Tableau: Individual Assignment
No ratings yet
Data Visualization Using Tableau: Individual Assignment
4 pages
Tantrika Guru Ya Tantra Sadhana Paddhati - Swami Nigamananda Saraswati Dev PDF
No ratings yet
Tantrika Guru Ya Tantra Sadhana Paddhati - Swami Nigamananda Saraswati Dev PDF
296 pages
Team: Independent Model Review Work Location: Bangalore Job Description
No ratings yet
Team: Independent Model Review Work Location: Bangalore Job Description
2 pages
JPMC How We Do Business' Principles
No ratings yet
JPMC How We Do Business' Principles
2 pages
Shri Vidya Sadhana II Shyamakanta Dwivedi Anand PDF
No ratings yet
Shri Vidya Sadhana II Shyamakanta Dwivedi Anand PDF
480 pages
Saral Havan Vidhi Sadharan Hom - Randhir Prakashan
100% (1)
Saral Havan Vidhi Sadharan Hom - Randhir Prakashan
68 pages
Tantracharya Gopinath Kaviraj Aur Yoga Tantra Sadhana - Ramesh Chandra Avasthi PDF
No ratings yet
Tantracharya Gopinath Kaviraj Aur Yoga Tantra Sadhana - Ramesh Chandra Avasthi PDF
128 pages
Probability: Multiple Choice
100% (2)
Probability: Multiple Choice
53 pages
Topic 1 Wble
No ratings yet
Topic 1 Wble
58 pages
M. Ataharul Islam, Abdullah Al-Shiha - Foundations of Biostatistics (2018, Springer) PDF
No ratings yet
M. Ataharul Islam, Abdullah Al-Shiha - Foundations of Biostatistics (2018, Springer) PDF
471 pages
Flow Chat Strobe
No ratings yet
Flow Chat Strobe
5 pages
Higher Ed Teacher Effectiveness Scale
No ratings yet
Higher Ed Teacher Effectiveness Scale
19 pages
Particle Filters: Texpoint Fonts Used in Emf. Read The Texpoint Manual Before You Delete This Box.: Aaaaaaaaaaaaa
No ratings yet
Particle Filters: Texpoint Fonts Used in Emf. Read The Texpoint Manual Before You Delete This Box.: Aaaaaaaaaaaaa
57 pages
TBCH08
No ratings yet
TBCH08
8 pages
Factors Affecting Cost Overruns in Construction Projects in KENHA
No ratings yet
Factors Affecting Cost Overruns in Construction Projects in KENHA
16 pages
Qualitative Quantitative Research Methodology Exploring The Interactive Continuum
100% (4)
Qualitative Quantitative Research Methodology Exploring The Interactive Continuum
237 pages
MAT 120 Final Project
No ratings yet
MAT 120 Final Project
11 pages
2007e Kempfert Becker - Empirical Axial Resistences Sheet Piles
No ratings yet
2007e Kempfert Becker - Empirical Axial Resistences Sheet Piles
7 pages
What Is Descriptive Research
No ratings yet
What Is Descriptive Research
4 pages
Group Assignment Cover Sheet: Student Details
No ratings yet
Group Assignment Cover Sheet: Student Details
27 pages
CE-613 - DOC - 02 Descriptive Stat, Frequency Plot
No ratings yet
CE-613 - DOC - 02 Descriptive Stat, Frequency Plot
62 pages
Advances and Opportunities in Process Data Analytics. - 1
No ratings yet
Advances and Opportunities in Process Data Analytics. - 1
9 pages
Chapter 12: Chi-Square and Nonparametric Tests
No ratings yet
Chapter 12: Chi-Square and Nonparametric Tests
43 pages
T4 Probability
No ratings yet
T4 Probability
7 pages
Final Exam-Mmw
No ratings yet
Final Exam-Mmw
7 pages
Social Statistics For A Diverse Society Eighth Edition PDF Download
No ratings yet
Social Statistics For A Diverse Society Eighth Edition PDF Download
312 pages
Data Mining for Bank Marketing
No ratings yet
Data Mining for Bank Marketing
15 pages
Strategic Management CH 3 and 4
No ratings yet
Strategic Management CH 3 and 4
82 pages
Syllab Yale MGT595 2018 Full
No ratings yet
Syllab Yale MGT595 2018 Full
9 pages
Guide For PETA 1 4th Quarter
No ratings yet
Guide For PETA 1 4th Quarter
4 pages
Intro to Agricultural Statistics
No ratings yet
Intro to Agricultural Statistics
15 pages
Mann Whitney U Test
No ratings yet
Mann Whitney U Test
9 pages
Gbs10e PPT ch09
No ratings yet
Gbs10e PPT ch09
47 pages
Year 8 Maths Frameworking Guide
100% (1)
Year 8 Maths Frameworking Guide
264 pages
Capitulo 9
No ratings yet
Capitulo 9
21 pages
May 2018 PDF
No ratings yet
May 2018 PDF
4 pages
My Final Year Project (1) - Biggy - Biggy
No ratings yet
My Final Year Project (1) - Biggy - Biggy
22 pages

Project 2: Submitted By: Sumit Sinha Program & Group: Pgpbabionline May19 - A

Uploaded by

Project 2: Submitted By: Sumit Sinha Program & Group: Pgpbabionline May19 - A

Uploaded by

Project 2

Submitted by: Sumit Sinha

Program & Group: PGPBABIOnline May19_A

Importing the “Factor-Hair-Revised.csv” file into R format we started checking

Visually also it is evident that multicollinearity exists between independent

Before proceeding to perform simple linear regression we will state the

Output and inferences:

The determinant value is positive, although it is very small.

Below is the detailed output from this model:

Perform multiple linear regression with customer satisfaction as dependent

Considering the “Satisfaction” variable as dependent variable we will model the

Below is the output of the model_a:

Below is the output of the model_b:

We also checked the multicollinearity in the model_b:

You might also like