KEMBAR78
Predictive modeling | PDF
Predictive Modeling
Practical guide to data science and model
building under less than 60 min
Prashant Mudgal
Introduction
Predictive modeling and data science are said to be most attractive subjects and related jobs
are said to be hottest jobs of the twenty first century. The same was quotes in an article in
Harvard Business Review few months ago.
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
There are thousands of articles available online and same number of books that teach data
science and give theory of various related topics such as Linear algebra, probability,
optimization, machine learning, calculus. This brief work is aimed in the same direction with
focus on implementation on fairly sizable dataset. It focuses on cleaning the data,
visualization, EDA, feature scaling, feature normalization, k-nearest neighbor, logistic
regression, random forests, cross validation without delving too deep into any of them but
giving a start to a new learner.

Problem Statement and Data
The name of the bank is Banco de Portugal
Website https://www.bportugal.pt/en-US/Pages/inicio.aspx
Problem statement : Predict using mathematical methods whether a customer a customer
subscribed to a term deposit loan.
Data has been taken from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Citation : Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to
Predict the Success of Bank Telemarketing. Decision Support Systems,
In press, http://dx.doi.org/10.1016/j.dss.2014.03.001
Available at: [pdf] http://dx.doi.org/10.1016/j.dss.2014.03.001
[bib] http://www3.dsi.uminho.pt/pcortez/bib/2014-dss.txt
1. Title: Bank Marketing (with social/economic context)
2. Sources : Sérgio Moro (ISCTE-IUL), Paulo Cortez (Univ. Minho) and Paulo Rita (ISCTE-
IUL) @ 2014
3. Past Usage: The full dataset (bank-additional-full.csv) was described and analyzed in: S.
Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank
Telemarketing. Decision Support Systems (2014), doi:10.1016/j.dss.2014.03.001.
4. Relevant Information: This dataset is based on "Bank Marketing" UCI dataset. The data is
enriched by the addition of five new social and economic features/attributes (national wide
indicators from a ~10M population country), published by the Banco de Portugal and
publicly available at: https://www.bportugal.pt/estatisticasweb. The data is related with
direct marketing campaigns of a Portuguese banking institution. The marketing campaigns
were based on phone calls. Often, more than one contact to the same client was required, in
order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
Language and Packages
Given the rise of python and ease of usage, the models have been built in python. One can
replicate the same on R or SAS.
Packages and libraries : Pandas, numpy, scikit-learn, matplotlib
Breaking Predictive analytics down
It is industry wide standard to spend approximately 80% of the time in EDA and data
preparation.
This is the split of time spent only for the first model that is built.
5%
15%
40%
40%
Descriptive Analysis Data Treatment Model Building
Performance Estimation
Descriptive Analysis
Descriptive analysis of data deals with understanding the data we are using. It includes:
Identifying the predictor and the target variables. Dividing the dataset into test and
training sets. Usually 25% of the dataset is kept as test and model is built on 75% of the data.
Univariate analysis for checking the spread of the variables. For numerical variables one
uses measures of central tendency(mean, median, mode), measures of dispersion(IQR,
Range, variance, standard deviation) or using visualization methods. For categorical/
classification variables one can use frequency distribution and bar charts.
Bivariate analysis to check the association between the various variables. For continuous
variables one can use scatter plots and then measure correlation coefficients.
-1: perfect negative linear correlation
+1:perfect positive linear correlation and
0: No correlation
For categorical and continuous Z-test and t-test(t-test for small datasets), ANOVA(analysis
of variance) for checking whether two samples from dataset are statistically different.
For categorical and categorical chi-sq test is the prime test. It is probability based approach
and measures probability, probability of 1 means the variables are independent. Chi-sq test
also measures the goodness of fit in regression models and multicollinearity.
Data Treatment
Data treatment includes:
Outlier treatment - log transformation
Missing value treatment - imputing using mean, median or mode
Feature Scaling - Limiting range of variables
Normalization of features
Label Encoding for categorical variables as models can’t work on String variables
Model Building
Most of the problems that are solved using machine learning techniques are classification
problems.
For Numerical regression - Linear Regression
For categorical target variables - Logistic Regression
Tree based methods - Random Forest, Gradient Boosting
Performance Estimation and Improvements
Performance of the model can be checked using various methods
ROC-AUC curve (Receiving Operating Characteristic)
Checking accuracy score
Confusion Matrix
Measuring Root mean squared error
To reduce the model error, one can use k-fold Cross validation methods. If there are a
large number of predictors then one can select the best features using selectKBest
techniques. The process is called dimensionality reduction. Principal Component Analysis,
PCA, is another powerful way to reduce the number of dimensions and check for
multicollinearity.
iPython Notebook https://github.com/Prashantmdgl9/Predictive-Analytics
Import the libraries and read the data. One should take care of the delimiters; though the
data is in CSV format, it is semicolon delimited. Check the number of rows and decide how
much you want to keep in training and test. For now, don’t divide the data as any cleansing
activity that needs to be done would be done on complete data frame.
Look at the data in detail, use the head, describe and column functions in the pandas library
to take a closer look.
The data doesn’t have any ID column, let’s add an ID column and check various data types.
The data frame has int and string type variables. Let’s check whether there are any columns
with missing data and identify the ID and target variables. Also, separate the numeric and
the categorical columns.
For purpose of description, we will deal with missing data. Impute numeric values with
mean and categorical with -9999
Let’s scale the features, feature scaling is done to limit the range of the variables so they can
be compared. It is done for continuous variables. Before doing so, let’s plot histograms of the
numeric variables and check the range and distribution.
We see that many variables have entirely different ranges and scaling might help. Before
proceeding we should encode our categorical variables as they are String objects and we can’t
build models on String variables unless converted.
Let’s scale the data frame using MinMax scaler and fit kNN. Here we also have predicted the
accuracy of the model.
We can normalize the data using scale function in scikit-learn library and perform a logistic
regression on normalized data
We can proceed with machine learning algorithms and random forest would be our
algorithm of our choice because of its enhanced performance.
To reduce the bias variance error, we should use k-fold cross validation that would help us
alleviate the problem.
Feature selection is one of the ways to reduce the number of predictors and include only
those which cause most variance in the predictions of the model. It can be achieved by using
Principal component analysis or using selectKbest feature of scikit-learn.
Let’s plot the p-values of the features and form short list of best features.
Let’s form our new list according to highest p-values and fit random forest with cross
validation on the training set with k = 10 and then fit on test data.
Conclusion
The step above concludes our model building exercise. We started with data exploration and
data preparation and went on to use complex methods such as kNN, cross validation and
random forests to arrive at the final model and results. Depending on the type of dataset, one
has to add or remove few steps but the gist remains the same - explore, treat, build and
improve. The steps above can be used to build starting model on any type of dataset and that
would give decent accuracy.
* If you want to download the python notebook of the project then you can visit
https://github.com/Prashantmdgl9/Predictive-Analytics

Predictive modeling

  • 1.
    Predictive Modeling Practical guideto data science and model building under less than 60 min Prashant Mudgal
  • 2.
    Introduction Predictive modeling anddata science are said to be most attractive subjects and related jobs are said to be hottest jobs of the twenty first century. The same was quotes in an article in Harvard Business Review few months ago. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ There are thousands of articles available online and same number of books that teach data science and give theory of various related topics such as Linear algebra, probability, optimization, machine learning, calculus. This brief work is aimed in the same direction with focus on implementation on fairly sizable dataset. It focuses on cleaning the data, visualization, EDA, feature scaling, feature normalization, k-nearest neighbor, logistic regression, random forests, cross validation without delving too deep into any of them but giving a start to a new learner.

  • 3.
    Problem Statement andData The name of the bank is Banco de Portugal Website https://www.bportugal.pt/en-US/Pages/inicio.aspx Problem statement : Predict using mathematical methods whether a customer a customer subscribed to a term deposit loan. Data has been taken from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing Citation : Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, In press, http://dx.doi.org/10.1016/j.dss.2014.03.001 Available at: [pdf] http://dx.doi.org/10.1016/j.dss.2014.03.001 [bib] http://www3.dsi.uminho.pt/pcortez/bib/2014-dss.txt 1. Title: Bank Marketing (with social/economic context) 2. Sources : Sérgio Moro (ISCTE-IUL), Paulo Cortez (Univ. Minho) and Paulo Rita (ISCTE- IUL) @ 2014 3. Past Usage: The full dataset (bank-additional-full.csv) was described and analyzed in: S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems (2014), doi:10.1016/j.dss.2014.03.001. 4. Relevant Information: This dataset is based on "Bank Marketing" UCI dataset. The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at: https://www.bportugal.pt/estatisticasweb. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
  • 4.
    Language and Packages Giventhe rise of python and ease of usage, the models have been built in python. One can replicate the same on R or SAS. Packages and libraries : Pandas, numpy, scikit-learn, matplotlib Breaking Predictive analytics down It is industry wide standard to spend approximately 80% of the time in EDA and data preparation. This is the split of time spent only for the first model that is built. 5% 15% 40% 40% Descriptive Analysis Data Treatment Model Building Performance Estimation
  • 5.
    Descriptive Analysis Descriptive analysisof data deals with understanding the data we are using. It includes: Identifying the predictor and the target variables. Dividing the dataset into test and training sets. Usually 25% of the dataset is kept as test and model is built on 75% of the data. Univariate analysis for checking the spread of the variables. For numerical variables one uses measures of central tendency(mean, median, mode), measures of dispersion(IQR, Range, variance, standard deviation) or using visualization methods. For categorical/ classification variables one can use frequency distribution and bar charts. Bivariate analysis to check the association between the various variables. For continuous variables one can use scatter plots and then measure correlation coefficients. -1: perfect negative linear correlation +1:perfect positive linear correlation and 0: No correlation For categorical and continuous Z-test and t-test(t-test for small datasets), ANOVA(analysis of variance) for checking whether two samples from dataset are statistically different. For categorical and categorical chi-sq test is the prime test. It is probability based approach and measures probability, probability of 1 means the variables are independent. Chi-sq test also measures the goodness of fit in regression models and multicollinearity. Data Treatment Data treatment includes: Outlier treatment - log transformation Missing value treatment - imputing using mean, median or mode Feature Scaling - Limiting range of variables Normalization of features Label Encoding for categorical variables as models can’t work on String variables
  • 6.
    Model Building Most ofthe problems that are solved using machine learning techniques are classification problems. For Numerical regression - Linear Regression For categorical target variables - Logistic Regression Tree based methods - Random Forest, Gradient Boosting Performance Estimation and Improvements Performance of the model can be checked using various methods ROC-AUC curve (Receiving Operating Characteristic) Checking accuracy score Confusion Matrix Measuring Root mean squared error To reduce the model error, one can use k-fold Cross validation methods. If there are a large number of predictors then one can select the best features using selectKBest techniques. The process is called dimensionality reduction. Principal Component Analysis, PCA, is another powerful way to reduce the number of dimensions and check for multicollinearity.
  • 7.
    iPython Notebook https://github.com/Prashantmdgl9/Predictive-Analytics Importthe libraries and read the data. One should take care of the delimiters; though the data is in CSV format, it is semicolon delimited. Check the number of rows and decide how much you want to keep in training and test. For now, don’t divide the data as any cleansing activity that needs to be done would be done on complete data frame. Look at the data in detail, use the head, describe and column functions in the pandas library to take a closer look.
  • 8.
    The data doesn’thave any ID column, let’s add an ID column and check various data types.
  • 9.
    The data framehas int and string type variables. Let’s check whether there are any columns with missing data and identify the ID and target variables. Also, separate the numeric and the categorical columns.
  • 10.
    For purpose ofdescription, we will deal with missing data. Impute numeric values with mean and categorical with -9999 Let’s scale the features, feature scaling is done to limit the range of the variables so they can be compared. It is done for continuous variables. Before doing so, let’s plot histograms of the numeric variables and check the range and distribution.
  • 11.
    We see thatmany variables have entirely different ranges and scaling might help. Before proceeding we should encode our categorical variables as they are String objects and we can’t build models on String variables unless converted.
  • 12.
    Let’s scale thedata frame using MinMax scaler and fit kNN. Here we also have predicted the accuracy of the model. We can normalize the data using scale function in scikit-learn library and perform a logistic regression on normalized data
  • 13.
    We can proceedwith machine learning algorithms and random forest would be our algorithm of our choice because of its enhanced performance. To reduce the bias variance error, we should use k-fold cross validation that would help us alleviate the problem.
  • 14.
    Feature selection isone of the ways to reduce the number of predictors and include only those which cause most variance in the predictions of the model. It can be achieved by using Principal component analysis or using selectKbest feature of scikit-learn. Let’s plot the p-values of the features and form short list of best features.
  • 15.
    Let’s form ournew list according to highest p-values and fit random forest with cross validation on the training set with k = 10 and then fit on test data. Conclusion The step above concludes our model building exercise. We started with data exploration and data preparation and went on to use complex methods such as kNN, cross validation and random forests to arrive at the final model and results. Depending on the type of dataset, one has to add or remove few steps but the gist remains the same - explore, treat, build and improve. The steps above can be used to build starting model on any type of dataset and that would give decent accuracy. * If you want to download the python notebook of the project then you can visit https://github.com/Prashantmdgl9/Predictive-Analytics