Project 2
Submitted by: Sumit Sinha
Program & Group: PGPBABIOnline May19_A
Perform exploratory data analysis on the dataset. Showcase some charts, graphs.
Check for outliers and missing values
Importing the “Factor-Hair-Revised.csv” file into R format we started checking
some basic descriptive statistics. On the very beginning the “ID” column was
dropped as it will not bear any informational value for further proceedings.
To understand the contents of the data we used “str” function and then we took
help of function “describe” from “psych” library to have a complete view of the
descriptive statistics of the data.
Code:
Output:
Continuing our effort to understand the data the boxplot for each of the variables
were created.
Also, with the help of “ggpairs” function of R, we created a chart that summarizes
the distribution of all variables, scatter plots and correlation coefficients to
understand the relationship between the dependent and independent variables
in a single view.
Next we tried to see what the missing value counts for each of the variables. In
the present case the data does not have any missing value for any of the
variables.
Code:
Output:
From the boxplot chart we infer that although for few variables there are
evidence of statistical outliers but by rating wise they are well within the limit of
1-10, the limit prescribed by the survey. So there is no requirement of outlier
treatment in this case.
From the “ggpairs” chart we see that:
a) The dependent variable “Satisfaction” have good correlation with
CompRes,ProdLine,OrdBilling, DelSpeed variables
b) OrdBilling and DelSpeed are highly correlated
c) WartyClaim and TechSupport are highly correlated
d) CompRes and DelSpeed are highly correlated
e) OrdBilling and CompRes are highly correlated
f) CompRes and OrdBilling are highly correlated
g) Ecom and SalesFImage are highly correlated
Is there evidence of multicollinearity ? Showcase your analysis
The evidence of multicollinearity is discussed in the last problem only with the
help of “ggpairs” chart. Next we have analyzed the issue with correlation chart.
Code:
Output:
Visually also it is evident that multicollinearity exists between independent
variables and these can be briefly summarized as follows:
a) OrdBilling and DelSpeed are highly correlated
b) WartyClaim and TechSupport are highly correlated
c) CompRes and DelSpeed are highly correlated
d) OrdBilling and CompRes are highly correlated
e) CompRes and OrdBilling are highly correlated
f) Ecom and SalesFImage are highly correlated
Perform simple linear regression for the dependent variable with every
independent variable
Before proceeding to perform simple linear regression we will state the
assumptions of multiple linear regression and where we stand in the present case
in respect to these assumptions.
1) Linearity of dependent variable and independent variables : From “ggpairs”
chart we have seen dependent variable satisfaction has linear relationship
with rest of the independent variables.
2) Homoscedasticity of the error term: We will check this in the model
diagnostics.
3) Errors are normally distributed : We will check this in the model
diagnostics.
4) Independent variables does not show evidence of multicollinearity: We
have seen that there is definite multicollinrarity in our model and this could
be the principal reason behind the non-acceptance of this model and
ultimately we proceed to Factor analysis to use the factors as dependent
variables in the model.
Code:
Output:
From the model output we can say that R-squared parameter generated by the
regression which measures how much of the variation in outcome can be
explained by the variation in the independent variables, is high at 0.8021, if we
check p-value associated with each of independent variables we found that many
of the model variables like (TechSup,CompRes,Advertising,etc) are not significant
to be in the model (at alpha=5%)
Overall the p-value associated with F-statistics give us a value lesser than 0.05
suggesting the model is significant in the other words at least one of the
independent variables is significant.
Model Diagnostics check:
Code:
Output:
The Residual Vs. Fitted chart shows residuals or errors are scattered with no
systematic pattern. This indicates the variables are linearly related.
The points in the normal Q-Q plot shows that most of the points are on the line,
indicating residuals following normal distribution.
Scale location chart shows points are randomly distributed around the horizontal
line, indicating model seems to meet constant variance assumption.
Model Equation:
Below is the model equation generated from the estimates in the model output :
Satisfaction = -0.66 + 0.37*ProdQual -0.44*Ecom + 0.033*TechSup +
0.16*CompRes -0.02*Advertising + 0.14ProdLine + 0.80*SalesFImage-
0.038*CompPricing -0.10*WartyClaim + 0.14*OrdBilling + 0.16*DelSpeed
Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name
the Factors
To perform Principal component analysis (PCA) we have modified the data that
we were recently worked with. We dropped the dependent variable
“Satisfaction” from the analysis. Below is the code for PCA and the associated
output and inferences:
Code:
Output and inferences:
We first do assumption testing for PCA which will be valid for progressing on
Factor analysis (FA) as well. We first convert the data into matrix form before
running the functions allied with assumption testing.
In KMO test we find the parameter value as 0.65 which indicates that sample size
is adequate for proceeding for PCA/FA.
Bartlett Test value is 619.27 and associated p-value is less than alpha=0.05. So,
Bartlett’s test is statistically significant and it also indicates there existed sufficient
co-rellations in the matrix to proceed for PCA/FA
The determinant value is positive, although it is very small.
So now, from KMO Test, Bartlett’s test and Determinant of correlation matrix we
have satisfied all the assumptions for performing PCA/FA.
Although we have been asked to Perform PCA/Factor analysis by extracting 4
factors, we will test how many Principal components/factors ideally could be
considered in the given case. For that we plotted the Scree plot as follows:
If we follow Kaiser rule to chose the optimum number of Principal components
depending on Eigen value, then we see 4 is the number as the 4th point on the
Scree plot is just above Eigen Value=1.
We run the PCA model with “principal” function in R on the correlation matrix of
the data we have with “varimax” rotation and number of factor=4.
Below is the detailed output from this model:
From the PCA model output we can say that 4 Principal components which have
Eigen value>1 can explain 80 % of variance with a big reduction in dimension
(from 11variables to only 4 factors). Principal components 1,2,3,4 can explain
26%, 20%, 17% and 16% variance respectively.
Next we do the factor analysis.
For FA the assumptions and the test that we conducted on the given data are all
same as we did for the PCA. So, we are good to proceed for the factor analysis.
We first run the “fa” function on the correlation matrix of the data we have.
Then we group the factors very similar way as we did in PCA.
Code:
Output is as below:
4 factors PA1,2,3,4 can explain 69% of the variance, PA1 being the dominant
variance explainer as it is alone explaining 24% of variance.
From the initial 11 variables we have only 4 factors with 69% explanation of
variance.
So the grouping of the variables in respect to factors is very clear here from the
diagram above. The grouped variables are represented by the factors and we
name them accordingly to their inherent characteristics:
PA1= “Buy_Ease”
PA2=”Marketing”
PA3= “After_Sales”
PA4=”Positioning”
Next we get the scores for the 4 factors and then we merge these scores with the
dependent variable “Satisfaction” from the initial data that we used.
We label the data having factors accordingly to their inherent characteristics and
dependent variables:
Thus our data having a dependent variable and 4 factors processed by the
dimensionality reduction method of Factor analysis is ready for final regression
analysis.
Perform multiple linear regression with customer satisfaction as dependent
variables and the four factors as independent variables. Comment on the
Model output and validity. Your remarks should make it meaningful for
everybody.
Before proceeding for the modeling we split the data into training and validation.
On the training data we will build the model and we will check the performance of
this model on the validation data.
(Note: Due to a small sample size, very small change in the proportion of Training
and validation data can give different fit, different accuracy of the model)
Considering the “Satisfaction” variable as dependent variable we will model the
data on the rest of the variables.
Below is the output of the model_a:
Other than “After_Sales” variable all variables are significant in the model. With
the present set of variables the model fit parameter R-Squared is 0.745 and
adjusted R-Squared is 0.73.
p-value associated with F-statistic signifies that this model is significant to explain
the variance in the dependent variable “Satisfaction”.
We will proceed to build a model without the variable (“After_Sales”) in a new
iteration and check how that performs.
Below is the output of the model_b:
In this model we find that all the all variables are significant to the model and the
R-squared is same as model_a but adjusted R-Squared is marginally higher than
model_a.
p-value associated with F-statistic signifies that this model (model_b) is also
significant to explain the variance in the dependent variable “Satisfaction”.
Model Diagnostics:
We will run model diagnostics on model_b to be sure if it meets linear regression
assumptions.
Output:
The Residual Vs. Fitted chart shows residuals or errors are scattered with no
systematic pattern. This indicates the variables are linearly related.
The points in the normal Q-Q plot shows that most of the points are on the line,
indicating residuals following normal distribution.
Scale location chart shows points are randomly distributed around the horizontal
line, indicating model seems to meet constant variance assumption.
We also checked the multicollinearity in the model_b:
Output:
VIF pertaining to all the variables is less than 2 depicting the fact that there is
almost no multicollinearity among the independent variables.
Performance Check:
We will validate both the models (model_a and model_b) on its predictive
accuracy comparing the observed values of Satisfaction and Predicted values of
Satisfaction. We can generate many metrics like Correlation coefficient, Root
Mean Square Error (RMSE), Mean Absolute Deviation (MAD) for the linear
regression models and compare. Here we are comparing with only Correlation
coefficient.
Code:
Output:
We can infer from the result that Model_b exhibits better accuracy than Model_a.
Model_b also includes less variables/factors to predict Satisfaction which is
operationally advantageous. We can possibly improve this model (accuracy of
model) by including interaction effect of independent variables.
We were able to predict the Satisfaction aspect with good level of accuracy even
after facing the multicollinearity in the original data, and then proceeding to
dimensionality reduction method like PCA/FA to manage this issue and finally get
a model out of it.
Model Equation:
Below is the model equation generated from the estimates in the model_b output
Satisfaction = 6.9488+0.6757* Buy_Ease + 0.5326* Marketing
+0.5816*Positioning