Machine Learning
Transport Choice of Employees
Senthil Kumar M
22.Sep.2019
Machine Learning (PGP-BABI)
by Great Learning
Table of Contents
INTRODUCTION 2
Observation 4
Step by step approach 5
Exploratory Data Analysis 5
EDA Summary: 12
Logistic Regression 14
KNN 20
Naive Bayes 22
REFERENCES 26
Great Learning PGP 1
1
Great Learning PGP(BABI)
INTRODUCTION
This project is to understand the determinants of transport choice made by employees.
The given data has an employee information about their mode of transport as well as
their personal and professional details like age, salary, work exp. We need to predict
whether or not an employee will use Car as a mode of transport. Also, which variables
are a significant predictor behind this decision.
We are gonna use multiple model and performance metrics to derive a better model that
can describe a variable influencing employee to use a car as a mode of transport. The
input variables include employee personal details like Age, Salary, Work.exp. We are
going to use
Process Map
2
Great Learning PGP(BABI)
The structure of input variable is tabled below:
Given the dataset, we required to perform following tasks as explained to complete this
project successfully:
1. EDA
2. Data Preparation
3. Modeling
4. Actionable Insights & Recommendations
3
Great Learning PGP(BABI)
Observation
Employees use 2 wheeler, public transport and car as a mode of transport to commute to
their workplace. We have been given 418 rows of data with 9 variables. We might want
to cleanup the dataset and convert its type appropriately as required before processing it
for analysis.
Problem statement is that of predicting whether or not an employee will use a car
as a mode of transport, also which variable is a significant predictor behind the
decision.
Step by step approach
We shall do the following to perform stepwise analysis and conclude this project.
1. Exploratory Data Analysis
2. Clustering
3. CART
4. Random Forest
5. Performance Measurement
6. Conclusion
1. Exploratory Data Analysis
We will start with converting categorical variables to factor to start our EDA process.
4
Great Learning PGP(BABI)
The following graph of overview as how the variables spread with volume of usage:
5
Great Learning PGP(BABI)
Structure of the dataset printed for reference.
Lets notice that there is a missing value in a variable MBA. we have several ways to treat
but we will remove the whole record as there is only 1 missing value.
There are several automated packages in ‘R’ to perform exploratory data analysis, we are
going to use one such package “dlookr” in this project. EDA report from “dlookr” package
gives us the detailed count of distinct values in each variable along with normality test,
correlation coefficient other descriptive stats are elaborated as below:
6
Great Learning PGP(BABI)
7
Great Learning PGP(BABI)
8
Great Learning PGP(BABI)
Normality Test of Numeric Variable:
Normality test statistics proves that Age & Distance variables are closely distributed
normal, while Work Exp & Salary having positive skew in the dataset. Numeric variables
individually tested for normality and skewness values with QQ plots for each variables
printed down for reference.
9
Great Learning PGP(BABI)
10
Great Learning PGP(BABI)
Univariate Distribution: Histogram
11
Great Learning PGP(BABI)
Churn Ratio by numerical predictors:
We can notice that the higher the salary & age the employees are using a car. There is
clear indication that age 30 above as well as salary 30k and above preferred to use a car
as a mode of transport. Also the distance above 15miles are with higher salary are
choosing car as mode that is very evident in this dataset.
12
Great Learning PGP(BABI)
The above map depicts that female car usage is much lower compared to male, whereas
qualification doesn’t have any correlation with car usage. But license as we can assume
employee without license uses public transport.
Target based Analysis: (Categorical Variables)
13
Great Learning PGP(BABI)
14
Great Learning PGP(BABI)
15
Great Learning PGP(BABI)
16
Great Learning PGP(BABI)
Target based Analysis: (Numerical Variables)
AGE:
17
Great Learning PGP(BABI)
Wrok Exp:
18
Great Learning PGP(BABI)
Salary:
19
Great Learning PGP(BABI)
Distance:
20
Great Learning PGP(BABI)
Grouped Correlation Plot of Numerical Variables
21
Great Learning PGP(BABI)
EDA Summary:
1. There is 1 NA’s in the entire dataset
2. Correlation between predictor variables found and removed from dataset
3. We had challenges in numeric variables that were positively correlated, hence
removing a variable Age & Work.Exp reduced numeric predictors to only 2 to go
ahead with model. We could have used other methods such as PCA to fix the same
but since the correlation about 90% we are retaining only Salary from personal
details to train our model.
Data Preparation:
Our primary interest as per problem statement is to understand the factors influencing
car usage. Hence we will create a new column for Car usage. It will take value 0 for
Public Transport & 2 Wheeler and 1 for car usage Understand the proportion of cars in
Transport Mode.
22
Great Learning PGP(BABI)
Only 8% of employees in the dataset is using cars as a mode of transport.
Smote the Data
Before Smote After Smote
23
Great Learning PGP(BABI)
Modelling Building:
24
Great Learning PGP(BABI)
25
Great Learning PGP(BABI)
26
Great Learning PGP(BABI)
27
Great Learning PGP(BABI)
Improving the model
28
Great Learning PGP(BABI)
29
Great Learning PGP(BABI)
VIF scores to verify the multicollinearity, Work.Exp variable score above 10 confirms
that multicollinearity exists in the dataset.
After dropping out the Age & Work.Exp variables, we notice that VIF results are
significantly low and we can conclude that the data is free from multicollinearity. We
might go ahead training model with remaining variables.
30
Great Learning PGP(BABI)
31
Great Learning PGP(BABI)