Project - Advanced Statistics
Regression Model
1. Project Objective:
The objective of the project is to use the dataset 'Factor-Hair-Revised.csv' to build an optimum
regression model to predict satisfaction levels associated with different factors.This exploration
report will consists of the following:
>Importing the dataset in R
>Run an analysis on the data, Check distribution patterns
>Graphical exploration
>Understanding the structure of dataset
>Describe statistics
2 Data Analysis – A step by step data exploration consists of the following steps:
1. Environment Set up and Data Import
2. Variable Identification
3. Segregate Data
4. Graphic Analysis
5. Perform exploratory data analysis
6. Run an analysis on the data, Check distribution patterns
7. Graphical exploration
8. Find multicollinearity and showcase analysis
9. Perform simple linear regression for the dependent variable with independent variable
10. Perform PCA/Factor analysis by extracting 4 factors
11. Interpret the output and name the Factors
12. Perform Multiple linear regression with customer satisfaction as dependent variables and
the four factors as independent variables
13. Comment on the Model output and validity. Your remarks should make it meaningful for
everybody
Feature Exploration
Environment Set up and Data Import
## Set working directory
setwd("C:/Users/satyam.sharma/Desktop/R programming/Advance stats")
install package for doing the analysis
> install.packages("readr")install.packages("nortest")
install.packages("foreign")
install.packages("MASS")
install.packages("lattice")
install.packages("corrplot")
install.packages("nFactors")
install.packages("psych")
Import data set
Hair= read.csv("Factor-Hair-Revised.csv", header = TRUE,)
Open library for reading the csv files
>library(readr)
Know dimension of the data
dim(Hair)
[1] 100 13
We find out that there are 13 variables in the data with 100 entries
Check class of the dataset
class(Hair)
[1] "data.frame"
We found out that the data is in correct dataframe format and is fit for doing analysis
Variable identification
### check structure of the data
str(Hair)
'data.frame': 100 obs. of 13 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ ProdQual : num 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
$ Ecom : num 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
$ TechSup : num 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
$ CompRes : num 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
$ Advertising : num 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
$ ProdLine : num 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
$ SalesFImage : num 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
$ ComPricing : num 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
$ WartyClaim : num 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
$ OrdBilling : num 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
$ DelSpeed : num 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
$ Satisfaction: num 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 …
Only first column ID is in integer format otherwise all other entries are in numerical format.
The satisfaction numbers are scores from 1 to 10.
We found out that there are 13 variables all in total.
Since ID is just the serial number we can remove the ID from the data
## Now we have to find out that wheather there are any missing values in the dataset
any(is.na(Data1Hair))
[1] FALSE
The answer is that there is no missing value in the data and the data is fit for the analysis.
Use data explorer library to find out the missing value
library("DataExplorer")
Plotting of Dataset to know the missing value
plot_intro(Data1Hair)
This also confirms that there are no missing values
> ## Now check for the outlier values by initial doing data summarization
> summary(Data1Hair)
ID ProdQual Ecom TechSup CompRes Advertising
Min. : 1.00 Min. : 5.000 Min. :2.200 Min. :1.300 Min. :2.600 Min. :1.900
1st Qu.: 25.75 1st Qu.: 6.575 1st Qu.:3.275 1st Qu.:4.250 1st Qu.:4.600 1st Qu.:3.175
Median : 50.50 Median : 8.000 Median :3.600 Median :5.400 Median :5.450 Median
:4.000
Mean : 50.50 Mean : 7.810 Mean :3.672 Mean :5.365 Mean :5.442 Mean :4.010
3rd Qu.: 75.25 3rd Qu.: 9.100 3rd Qu.:3.925 3rd Qu.:6.625 3rd Qu.:6.325 3rd Qu.:4.800
Max. :100.00 Max. :10.000 Max. :5.700 Max. :8.500 Max. :7.800 Max. :6.500
ProdLine SalesFImage ComPricing WartyClaim OrdBilling DelSpeed
Min. :2.300 Min. :2.900 Min. :3.700 Min. :4.100 Min. :2.000 Min. :1.600
1st Qu.:4.700 1st Qu.:4.500 1st Qu.:5.875 1st Qu.:5.400 1st Qu.:3.700 1st Qu.:3.400
Median :5.750 Median :4.900 Median :7.100 Median :6.100 Median :4.400 Median :3.900
Mean :5.805 Mean :5.123 Mean :6.974 Mean :6.043 Mean :4.278 Mean :3.886
3rd Qu.:6.800 3rd Qu.:5.800 3rd Qu.:8.400 3rd Qu.:6.600 3rd Qu.:4.800 3rd Qu.:4.425
Max. :8.400 Max. :8.200 Max. :9.900 Max. :8.100 Max. :6.700 Max. :5.500
Satisfaction
Min. :4.700
1st Qu.:6.000
Median :7.050
Mean :6.918
3rd Qu.:7.625
Max. :9.900
At the initial looking there is no visible outliers in the data.
Let’s do detailed analysis
##using plot density to see the normal distribution
> plot_density(Data1Hair)
Density plots reveal some are left skewed like Delivery Speed and Tech support to some extent
while Sales Force Image are right skewed.
But most entries show a normal distribution.
###Using box plot
boxplot(Data1Hair)
In box plots we can see there are some outliers in Ecommerce, Sales Image and order billing.
### Now we move onto the factor analysis and find correlations in the data.
## For doing correlation analysis we have to remove the dependent variable Satisfaction
Haircor = Data1Hair[,1:11]
> dim(Haircor)
[1] 100 11
The last column satisfaction has been successfully removed from the data.
Hair_correlationdata = cor(Haircor)
print(Hair_correlationdata,digits = 3)
We have limit the correlation data up to 3 decimal places for better visualisation and easy
analysis
Let’s also do correlation plotting to get the bigger picture.
corrplot(Hair_correlationdata, method = "number")
corrplot(Hair_correlationdata, method = "shade")
From both these graphs we can see highcorrelation between different variables like Ecom has
corelation with sales Image; Complaint resolation has correlation with delivery speed; Order
billing has high correlation with Complaint resolution and delivery speed etc.
2. Find out Multicollinearity through linear regression :
After knowing the correlations now we have to check the multicollinearity before doing the PCA
or factor analysis.
For finding the multicollinearities, we will use Variance Inflation Factors (VIF) concept. Any value
above 4 (Hair et al., 2010) will suggest that there are multicollinearity among the variables.
Multicolinear = lm(Satisfaction ~ . , data = Data1Hair)
> print (vif (Multicolinear), digits = 4)
Ans: We can clearly see multicolinearity of our dependent variable Satisfcation with Delivery
speed that has the VIF value of 6.516 (greater than 4) and Complaint Resolution (4.730)
presence of multicollinearity which can affect our Regression model.
3. Perform simple linear regression for the dependent variable Satisfaction with every
independent variable:
lm.ProdQual = lm(Satisfaction ~ ProdQual, hair)
lm.ProdQual
So, we get a regression model: Satisfaction = 3.6759 + 0.4151 * Product Quality
The intercept coefficient for the Product Quality is 3.6759
The coefficient of Product quality is 0.4151
Thus for any one unit change in Product Quality, Satisfaction rating would improve by
0.4151 keeping other things constant as explained by model
lm.ecom = lm(Satisfaction ~ Ecom, Data1Hair)
Lm.ecom
Ecom regression model: Satisfaction = 5.1516 + 0.4811 * Ecommerce
lm.TechSup = lm(Satisfaction ~ TechSup, Data1Hair)
lm.TechSup
Tech Support regression model: Satisfaction = 6.44757 + 0.08768 * TechSup
lm.CompRes = lm(Satisfaction ~ CompRes, Data1Hair)
lm.CompRes
ComRes regression model: Satisfaction = 3.680 + 0.595 * ComRes
lm.Advertising = lm(Satisfaction ~ Advertising, Data1Hair)
lm.Advertising
Advertising regression model: Satisfaction = 5.6259 + 0.3222 * Advertising
lm.ProdLine = lm(Satisfaction ~ ProdLine, Data1Hair)
lm.ProdLine
Product line regression model: Satisfaction = 4.0220 + 0.4989 * Prodline
lm.SalesFImage = lm(Satisfaction ~ SalesFImage, Data1Hair)
lm.SalesFImage
Sales Image regression model: Satisfaction = 4.070 + 0.556 * Sales Image
lm.ComPricing = lm(Satisfaction ~ ComPricing, Data1Hair)
lm.ComPricing
ComPricing regression model: Satisfaction = 8.0386 + (-0.1607) * Comprising
lm.WartyClaim = lm(Satisfaction ~ WartyClaim, Data1Hair)
lm.WartyClaim
Wart Claim regression model: Satisfaction = 5.3581 + 0.2581 * Wartclaim
lm.OrdBilling = lm(Satisfaction ~ OrdBilling, Data1Hair)
lm.OrdBilling
OrdBilling regression model: Satisfaction = 4.0541 + 0.6695 * Ordbilling
lm.DelSpeed = lm(Satisfaction ~ DelSpeed, Data1Hair)
lm.DelSpeed
DelSpeed regression model: Satisfaction = 3.2791 + 0.9364 * DelSpeed
4. PCA:
For doing PCA we have to first conduct bartlett sphericity test to check whether Principal
Component Analysis can be done. If the test value is higher than alpha that means we can’t
conduct the PCA on the data.
cortest.bartlett(Hair_correlationdata, nrow(Hair))
The p value of 0.693724e-96 is less than the significance level of alpha = 0.001 so we can reject
the null hypothesis (that PCA cannot be conducted)
##To conduct Factor Analysis we have to do find out Eigen values
ev = eigen (cor(Data1Hair))
ev
print(ev, digits = 4)
Eigenvalue = ev$values
print(Eigenvalue, digits = 4)
factor= c(1:12)
factor
scree = data.frame(factor,Eigenvalue)
scree
plot(scree,main="Scree Plot", col ="Blue", ylim = c(0,5))
lines(scree,col="Red")
Here we can see 4 values before the elbow or greater than 1. It means that we can use 4
variables for doing the factor analysis.
Principal Components Analysis
principal(r = Hair_correlationdata, nfactors = 4, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 RC4 h2 u2 com
ProdQual 0.00 -0.01 -0.03 0.88 0.77 0.232 1.0
Ecom 0.06 0.87 0.05 -0.12 0.78 0.223 1.1
TechSup 0.02 -0.02 0.94 0.10 0.89 0.107 1.0
CompRes 0.93 0.12 0.05 0.09 0.88 0.119 1.1
Advertising 0.14 0.74 -0.08 0.02 0.58 0.424 1.1
ProdLine 0.59 -0.06 0.15 0.64 0.79 0.213 2.1
SalesFImage 0.13 0.90 0.08 -0.16 0.86 0.140 1.1
ComPricing -0.09 0.23 -0.25 -0.72 0.64 0.360 1.5
WartyClaim 0.11 0.05 0.93 0.10 0.89 0.108 1.1
OrdBilling 0.86 0.11 0.08 0.04 0.77 0.234 1.1
DelSpeed 0.94 0.18 -0.01 0.05 0.91 0.086 1.1
RC1 RC2 RC3 RC4
SS loadings 2.89 2.23 1.86 1.77
Proportion Var 0.26 0.20 0.17 0.16
Cumulative Var 0.26 0.47 0.63 0.80
Proportion Explained 0.33 0.26 0.21 0.20
Cumulative Proportion 0.33 0.59 0.80 1.00
Mean item complexity = 1.2
Test of the hypothesis that 4 components are sufficient.
The root mean square of the residuals (RMSR) is 0.06
Fit based upon off diagonal values = 0.97
It confirms the scree plot finding that we can use 4 components to conduct the PCA. The root
mean square of the residuals (RMSR) is very less 0.06.
Further the 4 RCs figures explain 80% of cumulative variation.
This becomes more clearer in the diagram.
fa.diagram(PCA)
Now we know what component contains what variables
## For Factor analysis we have to get scores for components
scores = round(PCA$score, 2)
Based on this we can name our four factors
Buying Experience: Complaint resolution, Order and Billing and delivery speed
Branding: E-comm, Sales team performance, Advertising
After Sales Support: Technical support, and Warranty and claims
Quality of Product: Varieties and types, prices its quality and all tangible aspects
We will create a new data set of these scores to give names
as.data.frame(scores)
colnames(scores) = c("Experience ", "Brand ", "ASales ", "Quality")
print(head(scores))
Experience Brand ASales Quality
[1,] 0.13 0.77 -1.88 0.37
[2,] 1.22 -1.65 -0.61 0.81
[3,] 0.62 0.58 0.00 1.57
[4,] -0.84 -0.27 1.27 -1.25
[5,] -0.32 -0.83 -0.01 0.45
[6,] -0.65 -1.07 -1.30 -1.05
Before performing Multiple linear regression we once again combine the satisfaction figures into
out new data frame and name the file as hair new
hair_new = cbind(hair_s, scores)
print(head(hair_new))
## Perform Multiple linear regression with customer satisfaction as dependent variables
m.linear.Model = lm(Satisfaction ~ Experience + Brand + ASales + Quality, hair_new)
summary(m.linear.RegModel)
Call:
lm(formula = Satisfaction ~ Experience + Brand + ASales + Quality,
data = hair_new)
Residuals:
Min 1Q Median 3Q Max
# -1.6346 -0.5021 0.1368 0.4617 1.5235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.91813 0.07087 97.617 < 2e-16 ***
Experience 0.61799 0.07122 8.677 1.11e-13 ***
Brand 0.50994 0.07123 7.159 1.71e-10 ***
ASales 0.06686 0.07120 0.939 0.35
Quality 0.54014 0.07124 7.582 2.27e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7087 on 95 degrees of freedom
Multiple R-squared: 0.6607, Adjusted R-squared: 0.6464
F-statistic: 46.25 on 4 and 95 DF, p-value: < 2.2e-16
Final Analysis:
Coefficient values like Intercept are significant, so it can be said that it is affecting our regression
model.
Similarly, predicted variables like Experience, Branding, After Sales and Product Quality have
significant betas implying that Response variable Satisfaction is associated with them.
After sales service is the only variable which has some high p-value implying that its beta
coefficient may not be contributing that significantly to the model or may be zero.
Overall p-value (extremely less e raise to minus 16) of Model given by F-statistic gives evidence
against the null-hypothesis. Model is significantly valid at this point
Interpretation from the data:
Satisfaction levels of the customers depends largely on the buying experience of the consumer
for the company should make all effort to improve the customer buying experience. They should
concentrate on quick delivery, billing, solve customer issues quickly and make products more
consumer friendly.
Apart from customer service, company should equally give attention of its brand visibility and its
recognition. Our model suggests that the advertisement plays a big role in that.