PREDICTIVE MODELING
PROJECT REPORT
Contents
Problem 1 (Linear Regression)_______________________________________
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis………………………8
1.2. Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning or do we need to change them or drop them? Check for the possibility of combining
the sub levels of a ordinal variables and take actions accordingly. Explain why you are combining
these sub levels with appropriate reasoning………………………………………………………………………………………5
1.3. Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from statsmodel. Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning……………………………………………………………………………………………………………..12
1.4. Inference: Basis on these predictions, what are the business insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present……………………………………………………………………5
Problem 2(Logistic Regression and LDA)_______________________________
2.1. Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis…….5
2.2. Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis)……..7
2.3. Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis)……..7
2.4. Inference: Basis on these predictions, what are the insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present……………………………………………………………………5
Problem 1: Linear Regression
INTRODUCTION
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help the company in
predicting the price for the stone on the bases of the details given in the dataset so it can distinguish
between higher profitable stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.
DATA DESCRIPTION
Variable Name Description Detail Data Type
Carat Weight of the cubic Carat Numeric
zirconia
Cut Describe cut quality of Quality in increasing Categorical (Ordinal)
the cubic zirconia order Fair, Good, Very
Good, Premium, Ideal
Colour Colour of the cubic D being the worst and Categorical (Ordinal)
zirconia J the best
Clarity Cubic zirconia Clarity (In order from Best to Categorical (Ordinal)
refers to the absence Worst, IF = flawless,
of the Inclusions and l1= level 1inclusion) IF,
Blemishes VVS1, VVS2, VS1, VS2,
Sl1, Sl2, l1
Depth The Height of a cubic Numeric
zirconia, measured
from the Culet to the
table, divided by its
average Girdle
Diameter
Table The Width of the cubic Numeric
zirconia's Table
expressed as a
Percentage of its
Average Diameter
Price Price of the cubic In mm Numeric
zirconia
X Length of the cubic In mm Numeric
zirconia
Y Width of the cubic In mm Numeric
zirconia
Z Height of the cubic In mm Numeric
zirconia
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, Data types, shape, EDA, duplicate values). Perform Univariate and
Bivariate Analysis.
Solution.
Summary of the dataset
The data set contains 26967 row and 11 columns. In the given data set there are 2 Integer
type features,6 Float type features and 3 Object type features. Where 'price' is the target
variable and all other are predictor variable. The first column is an index ("Unnamed: 0") as
this only serial no, we can remove it. Except for the column depth, the rest null count is
26967.
EXPLORATORY DATA ANALYSIS
Step 1: Check and remove any duplicates in the dataset
Step 2: Check and treat any missing values in the dataset
Step 3: Outlier Treatment
Step 4: Univariate Analysis
Step 5: Bi-variate Analysis
Step 1: Check and remove any duplicates in the dataset After checking for any duplicate
values present in the dataset it is confirmed that there are no duplicates hence it doesn't
require treatment to remove duplicates.
Step 2: Check and treat any missing values in the dataset
Step 3: Outlier Treatment Using the boxplot we confirm and visualise the presence of outliers
in the dataset and then proceed to treat the outliers present.
Below we see that the outliers have been treated accordingly.
Step 4: Univariate Analysis
The dataset indicates that there is significant amount of outliers present in one or few of the
variable and skewness is measured for every attributes present and after performing the
univariate analysis we can notice that the distribution of some quantitative features like
"Carat" and the target feature “Price” are heavily "right-skewed".
Step 5: Bi-variate Analysis
➢ It involves the analysis of two variables (often denoted as X, Y), for the purpose of
determining the empirical relationship between them.
➢ It can be inferred that most features correlate with the price of Diamond. The notable
exception is "depth" which has a negligible correlation (<1%).
OBSERVATIONS BASED ON EDA
The inferences drawn from the above Exploratory Data analysis:
Observation-1: 'Price' is the target variable while all others are the predictors. The data
set contains 26967 row, 11 column. In the given data set there are 2 Integer type
features,6 Float type features. 3 Object type features. Where 'price' is the target variable
and all other are predictor variable. The first column is an index ("Unnamed: 0")as this
only serial no, we can remove it.
Observation-2: On the given data set the mean and median values does not have much
difference. We can observe Min value of "x", "y", "z" are zero this indicates that they are
faulty values. As we know dimensionless or 2-dimensional diamonds are not possible. So
we have filter out those as it clearly faulty data entries. There are three object data type
'cut', 'colour' and 'clarity'.
Observation-3: We can observe there are 697 missing value in the depth column. There
are some duplicate row present. (33 duplicate rows out of 26958). which is nearly 0.12 %
of the total data. So on this case we have dropped the duplicated row.
Observation-4: There are significant amount of outlier present in some variable, the
features with datapoint that are far from the rest of dataset which will affect the outcome
of our regression model. So we have treat the outlier. We can see that the distribution of
some quantitative features like "carat" and the target feature "price" are heavily "right-
skewed".
Observation-5: It looks like most features do correlate with the price of Diamond. The
notable exception is "depth" which has a negligible correlation (r-s1%). Observation on
'CUT': The Premium Cut on Diamonds are the most Expensive, followed by Very Good
Cut.
1.2 Impute null values if present, also check for the values which are equal to zero.
Do they have any meaning or do we need to change them or drop them? Check for
the possibility of combining the sub levels of a ordinal variables and take actions
accordingly. Explain why you are combining these sub levels with appropriate
reasoning.
Solution.
➢ We start by checking through the dataset for any null values that are present as seen in
Figure 8, it shows that there are a total of 697 null values in the depth column.
➢ Followed by which the median is computed for each attribute so that it can be used to
replace the null values that are present in the dataset.
➢ In below given figure 9 we can see that the null values are replaced by the median that's
computed.
➢ After the removing the null values the shape of the dataset becomes 26925 rows and 10
columns.
Is scaling necessary in this case?
No, it is not necessary, we'll get an equivalent solution whether we apply some kind of
linear scaling or not. But is recommended for regression techniques as well because it
would help gradient descent to converge fast and reach the global minima. When
number of features becomes large, it helps in running model quickly else the starting
point would be very far from minima, if the scaling is not done in pre-processing.
For now we will process the model without scaling and later we will check the output
with scaled data of regression model output.
1.3 Encode the data (having string values) for Modelling. Split the data into train and
test (70:30). Apply Linear regression using scikit learn. Perform checks for significant
variables using appropriate method from statsmodel. Create multiple models and
check the performance of Predictions on Train and Test sets using Rsquare, RMSE &
Adj Rsquare. Compare these models and select the best one with appropriate
reasoning.
Solution.
Train-Test Split:
➢ Copy all the predictor variables into X data frame and copy target into the y data frame.
Using the dependent variable we split the X and Y data frames into training set and test
set.
➢ For this we use the Sklearn package and then split X and Y in 70:30 ration and then
invoke the linear regression function and find the best fit model on training data.
➢ The intercept for our model is -3171.9504473076336.
➢ The intercept (often labelled the constant) is the expected mean values of Y when
x=0,and when X is not equal to zero then the intercept has no intrinsic meaning.
➢ In the present case when the other predictor variable is zero i.e., like carat, cut, color,
clarity then C=-3172 ( Y = m/X/ m2X2+……+ mnXn + C+e), which means that the price is
-3172 which doesn't make any sense so in order to deal with this we have to carry out z-
score and make it nearly zero.
R square on training data : 0.9311935886926559
R square on testing data : 0.931543712584074
➢ R square is the percentage of the response variable variation that is explained by a linear
model and computed by the formula as:
R-square = Explained Variation / Total Variation
➢ It is always between 0 and 100%, in which 0% indicates that the model explains none of
the variability of the response data around its mean and 100% indicates that the model
explains all the variability of the response data around its mean.
➢ In the regression model we can see the R-square value on training and test data
respectively as 0.9311935886926559 and - 0.931543712584074.
➢ The RMSE on training and test data respectively is 907.1312415459143 and
911.8447345328437.
➢ From the scatter plot, we see that it is a linear and there is very strong correlation
present between the predicted y and actual y.
➢ It also indicates that there's a lot spread which indicates some unexplained variances on
the output.
➢ As the training data & Test data score are almost inline we can conclude that this model
is a Right-Fit model.
Training Data Test Data
R-square 0.9311935886926559 0.931543712584074
RMSE 907.1312415459143 911.8447345328436
Applying z- score stats models
➢ We initiate the linear Regression function and find the best fit model on the training
data and then explore the coefficients for each of the attributes.
➢ The intercept for our model is -5.879615251304736e-16 and the co-efficient of
determinant is 0.9315051288558229.
➢ It's observed that by applying z score the intercept has changed from -
3171.950447307667 to 5.87961525130473e-16, which tells that the co-efficient has
changed and the bias has become nearly zero but the overall accuracy is still the same.
Check Multi-collinearity using VIF
• We can observe very strong multi collinearity present in the data set when ideally it
should be within 1 to 5.
Linear Regression using stats models
➢ Assuming the null hypothesis is true, i.e. price from that universe we have drawn co-
efficient for the variable shown above.
➢ Now we can ask what is the probability of finding this co-efficient in this drawn sample if
in the real world the co-efficient is zero. As we see here the overall P value is less than
alpha, so rejecting HO and accepting Ha that at least 1 regression co-efficient is not '0'.
Here all regression co-efficient are not '0'.
➢ For example, we can see the p value is showing 0.449 for 'depth' variable, which is much
higher than 0.05. That means this dimension is of no use. So we can say that the
attribute which are having p value greater than 0.05 are poor predictor for price.
Root Mean Squared Error (Training) ------RMSE: 907.1312415459133
Root Mean Squared Error (lest) ------------RMSE: 911.8447345328433
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.
Please explain and summarise the various steps performed in this project. There
should be proper business interpretation and actionable insights present.
Solution.
Inference:
We can see that the from the linear plot, very strong corelation between the predicted y and actual y.
But there are lots of spread. That indicates some kind noise present on the data set i.e. Unexplained
variances on the output.
Linear regression Performance Metrics:
Intercept for the model: -3171.950447307667 R square on training data: 0.9311935886926559 R
square on testing data: 0.931543712584074 RMSE on Training data: 907.1312415459143 RMSE on
Testing data: 911.8447345328436 As the training data & testing data score are almost inline, we can
conclude this model is a Right-Fit Model.
Impact of scaling:
We can observe by applying z score the intercept became -5.87961525130473e-16. Earlier it
was -3171.950447307667. the co-efficient has changed, the bias became nearly zero but the
overall accuracy still same.
Multi collinearity: We can observe there are very strong multi collinearity present in
the data set.
From statsmodels: we can see R-squared:0.931 and Adj. R-squared: 0.931 are same.
The overall P value is less than alpha.
➢ Finally we can conclude that Best 5 attributes that are most important are 'Carat', 'Cut',
'colour', clarity' and width i.e. 'y' for predicting the price.
➢ When 'carat' increases by 1 unit, diamond price increases by 8901.94 units, keeping all
other predictors constant.
➢ When 'cut' increases by 1 unit, diamond price increases by 109.19 units, keeping all other
predictors constant.
➢ When 'colour' increases by 1 unit, diamond price increases by 272.92 units, keeping all
other predictors constant.
➢ When 'clarity' increases by 1 unit, diamond price increases by 436.44 units, keeping all
other predictors constant.
➢ When 'y' increases by 1 unit, diamond price increases by 1464.83 units, keeping all other
predictors constant.
➢ We can see that the p value is 0.449 for depth variable, which is much greater than 0.05.
That means this attribute is of no use.
➢ There are also some negative co-efficient values, we can see the 'X' i.e Length of the cubic
zirconia in mm. having negative co-efficient -1417.9089. And the p value is less than 0.05,
so can conclude that as higher the length of the stone is a lower profitable stones.
➢ Similarly for the 'z' variable having negative co-efficient i.e. -711.23. And the p value is less
than 0.05, so we can conclude that as higher the 'z' of the stone is a lower profitable
stones.
Recommendations:
➢ The Gem Stones company should consider the features 'Carat', 'Cut', 'colour', 'clarity' and
width i.e. 'y' as most important for predicting the price. To distinguish between higher
profitable stones and lower profitable stones so as to have better profit share.
➢ As we can see from the model Higher the widtb('y') of the stone is higher the price.
➢ So the stones having higher widtb('y') should consider in higher profitable stones. The
'Premium Cut' on Diamonds are the most Expensive, followed by 'Very Good' Cut, these
should consider in higher profitable stones.
➢ The Diamonds clarity with 'VS1' &'VS2' are the most expensive. So these two category also
consider in higher profitable stones.
➢ As we see for 'X' i.e. Length. of the stone, higher the length of the stone is lower the price.
➢ So higher the Length('x') of the stone are lower is the profitabilim higher the 'z' i.e Height
of the stone is, lower the price. This is because if a Diamond's Height is too large Diamond
will become 'Dark' in appearance because it will no longer return an Attractive amount of
light. That is why.
➢ Stones with higher 'z' is also are lower in profitability.
Problem 2: Logistic Regression and LDA
INTRODUCTION
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some didn't. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important
factors on the basis of which the company will focus on particular employees to sell their packages.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.
Solution.
Here I am loading all the necessary library for the model building and reading the head and tail of
the dataset to check whether data has been properly fed.
• We have no null values in the dataset.
• We have integer and object data.
The data that we have is of integer and continuous data, here the holiday package is our target
variable .
Salary, age, educ and number young children, number older children of employee have the went to
foreign, those are the given attributes we have to cross examine and help the company predict
weather the person will opt for holiday package or not.
NULL VALUES
There are no null values in the dataset
CHECK FOR DUPLICATES IN THE GIVEN DATASET
Number of duplicate rows = 0
Unique values for categorical variables
Percentage of employees that are interested in the holiday package 45.9%
UNIVARIATE ANALYSIS
SKEWNESS
• We can see that most of the distribution are right skewe except for educ
• Salary distribution has the max no of outliers
• There are some outliers in educ , no of young children and no. of older children
CATOGORICAL UNIVARIATE ANALYSIS
Maximum of the employees don’t prefer to go to foreign
The employees who prefer holiday package are slightly less than who don’t.
• As we can observe people with salaries below 150000 prefer holiday package.
• Employee age over 50 to 60 have seems to be not taking the holiday package,
whereas in the age 30 to 50 and salary less than 50000 people have opted more for
holiday package
BIVARITE ANALYIS
DATA DISTRIBUTION
There is hardly any correlation between the data, the data seems to be normal. There is
no huge difference in the data distribution among the holiday package, I don’t see any clear two
different distributions in the dataset provided.
CHECKING FOR CORRELATION
There is hardly any correlation between the data so no collinearity
1. AFTER TREATING OUTLIERS DATA LOOKS LIKE THIS
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis).
Solution.
Encoding the data(having string variables)
Head of the dataset
Here we have done ONE HOT ENCODING to create dummy variables and we can see all values
for foreign_yes are 0.
Better results are predicted by logistic regression model if encoding is done.
Train/ Test split
We will split the data in 70/30 ratio
Applying Logistic Regression
Applying GridSearchCV for Logistic Regression
The grid search method is used for logistic regression to find the optimal solving and the
parameters for solving.
We have found the parameters using grid search such as penalty=12 , solver: liblinear ,
tolerance=1e-06
Prediction on the training set
ytrain_predict = best_model.predict(X_train)
ytest_predict = best_model.predict(X_test)
Getting the probabilities on the test set
Performance Metrics will be discussed in 2.3
LDA (linear discriminant analysis)
DATASET HEAD
DATASET HEAD AFTER DATA PROCESSING
Build LDA Model
PROBABILITY PREDICTION
Performance Metrics will be discussed in 2.3
2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model Final Model: Compare Both the models and write inference which model
is best/optimized.
Solution.
PEFORMANCE METRICS FOR LINEAR REGRESSION
Confusion matrix on the training data
Here we see that precision for 1 is 0.63 , recall is 0.45 accuracy is 0.63 and f1 score is 0.63
Confusion matrix on the test data
Here we see that precision for 1 is 0.69 , recall is 0.45 accuracy is 0.66 and f1 score is 0.55
Accuracy - Training Data
0.6344262295081967
AUC and ROC for the training data
Accuracy - Test Data
0.6564885496183206
AUC and ROC for the testing data
Metrics for train data
lr_train_precision 0.65
lr_train_recall 0.45
lr_train_f1 0.53
Metrics for test data
lr_test_precision 0.69
lr_test_recall 0.45
lr_test_f1 0.55
PERFORMANCE METRICS FOR LDA(linear discriminant analysis)
MODEL SCORE
0.6327868852459017
CLASSIFICATION REPORT TRAIN DATA
Here we see that precision for 1 is 0.65 , recall is 0.44 accuracy is 0.63 and f1 score is 0.52
confusion_matrix for train data
array([[263, 66],
[158, 123]]
Model score for test data
0.6564885496183206
Classification report for test data
Confusion matrix for test data
array([[118, 24],
[ 66, 54]]
CHANGING THE CUTT OFF VALUE TO CHECK OPTIMAL VALUE
THAT GIVES BETTER ACCURACY AND F1 SCORE
AUC and ROC for the training data
Comparing both these models, we find both results are same, but LDA works better when there is
category target variable.
As we can see the results for AUC/ROC for both the models are almost equivalent to each other
So it is very difficult to differentiate between the two . The scores are also almost at par with
each other . Both the models are working perfectly at par with each other.
Since LDA works better with categorical values so we will pick it in this situation.
2.4 Inference: Basis on these predictions, what are the insights and
recommendations.
Please explain and summarise the various steps performed in this project. There
should be proper business interpretation and actionable insights present.
Solution.
So we had been given a problem where we had to find out whether the employees will opt for a
holiday package or not .
We looked in the data using logistic regression and LDA.
We found out that the results using both the methods is same. Predictions were done using both
the models.
While doing EDA we found out that
• Most of the employees who are above 50 don’t opt for holiday packages. It seems
like they are not interested in holiday packages at all .
• Employees who are in the age gap of 30 to 50 opt for holiday packages .It seems
like young people believe I spending on holiday packages so age here plays a very
important role in deciding whether they will opt for package or not
• Also people who have salary less than 50000 opt for holiday packages . So salary
is also a deciding factor for the holiday package.
• Education also plays an important role in deciding the holiday packages .
• To improve our customer base we need to look into those factors
Recommendations
As we already have the customer base who are of the age of 30 to 50 so we need to look for the
options and target the older people and the people who are earning more than 150000.
• As we know most of the people who are older prefer to visit religious places so it would
be better if we target those places and provide them with packages where they can visit
religious places.
• We can also look into the family dynamics of the people of the older people , if the older
people have elder children e.g 30 to 40 they can use the holiday packages so the deal
should include the family package .
• People who earn more than 150000 don’t spend much on the holiday packages , they tend
to go for lavish holidays and we can provide them with customized packages according to
their wish , such as fancy hotels , longer vacations , personal cars during the holiday to
attract such employees .
• Plus such people who earn more than 150000 we can provide them extra facilities
according to their own wishes at the moment.
In this project we started with EDA , descriptive statistics and did null value condition check, we
performed Univariate and Bivariate Analysis. did exploratory data analysis ,we treated outliers
then we moved on to Logistic regression . We encoded the data (having string values) for
Modelling. We split data into train and test (70:30) and finally we applied Logistic Regression
and LDA (linear discriminant analysis).