Project Report..
Project Report..
RIYA KUKRETI
DR.VINEET KUMAR SAINI
Assistant Professor
CSE Department
DATE: SHIVABHANUPRASAD
place: dehradun ROLL NO:2301010131
Examined by:
(signature)
Mr. Vineet Kumar Saini Head Of Department
DR.MADHU KIROLA
3
Next i would like to tender my sincere thanks to DR.MADHU KIROLA (Head of computer
science & engineering Department) for her co-operation and encouragement
Now a -days there are a lot of service providers available in every business. There is
no shortage of customers in any options. Mainly, in the banking sector when people
want to keep their money safely they have a lot of options. As a result, customer
churn and loyalty of customers have become a major problem for most banks. In this
paper, a method that predicts customer churn in banking using Machine learning
with ANN. This research promotes the exploration of the likelihood of churning by
customer loyalty.
The number of service providers is increasing very rapidly in every business. These
days, there is no shortage of options for customers in the banking sector when
choosing where to put their money. As a result, customer churn and engagement
have become one of the top issues for most banks. In this project, a method to
predict the customer churn in a Bank, using machine learning techniques, which is a
branch of artificial intelligence, is proposed. The research promotes the exploration
4
of the likelihood of churn by analyzing customer behavior. Customer Churn has
become a major problem in all industries including the banking industry and banks
have always tried to track customer interaction so that they can detect the customers
who are likely to leave the bank
TABLE OF CONTENT
ABSTRACT 5
TABLE OF CONTENT 6
7
LIST OF FIGURES
7
LIST OF SYMBOLS
01 INTRODUCTION 11
02 LITERATURE SURVEY 15
03
METHODOLOGY
18
3.1 OBJECTIVE 20
3.2 LIST OF MODULES 21
23
3.3 SYSTEM ARCHITECTURE
04
RESULT AND DISCUSSION
PERFORMANCE ANALYSIS 25
48
4.1 FEATURES 61
4.2 CODE
5
05 CONCLUSION AND FUTURE 85
WORK
LIST OF FIGURES
01 22
SYSTEM
ARCHITECTURE
02 WORKFLOW DIAGRAM 23
03 ER - DIAGRAM 25
04 MODULE DIAGRAM 29
6
1. INTRODUCTION
Churning means a customer who leaves one company and transfers to another
company. It is not only a loss in income but also other negative effects on the
operations and also mainly Customer Relation Management is very important for
banking when the company considers it as they try to establish long-term
relationships with customers and also it will lead to increase their customer base.
The service provider's challenges are found in the behavior of the customer and their
expectations. In the current generation, people are mostly educated compared to
previous generations. So, the current generation of people is expecting more policies
and their diverse demand for connectivity and innovation. This advanced knowledge
is leading to changes in purchase behavior. This is a big challenge for current
service providers to think innovatively to reach their expectations.
Private sectors need to recognize customers Liu and Shih strengthen this argument
in their paper by indicating that increasing pressures on companies to develop new
and innovative ideas in marketing, to meet customer expectations and increase
loyalty and retention. For Customers, it is very easy to transfer their relations from
one bank to another bank. Some customers might be keeping their relationship
status null that means they will keep their account status inactive. By keeping this
account inactive it might be the customer transferring their relationship with another
bank. There are different types of customers are in the bank. Farmers are one of the
major customers to the banks they will expect fewer monthly chargers as they were
financially low. Businessperson, are also one of the major and important customers
because a lot of transactions with huge amount is done by them only usually. These
customers will expect better service quality. One of the most important categories
was Middle-class customers, mostly in every bank these peoples are more than the
type of customers. These people will expect fewer monthly charges, better service
quality, and new policies.
So, maintaining different types of customers is not that easy. They need to consider
customers and their needs to resolve these challenges delivering reliable service on
5 time and within budget to customers. While maintaining a good working partnership
with them is another significant challenge for them. If they failed to resolve these
7
challenges this may cause churning. Recruiting a new customer is more expensive
and harder than keeping already customers. Customers holding on the other hand is
usually more expensive because they have already gained the confidence and
loyalty of present customers. So, the need for a system that can predict customer
churn effectively in the early stages is very important for any banking. This paper
aims at a framework that can predict the customer churning banking sectors using
some machine learning algorithms with ANN.
Existing System:
Disadvantages:
MACHINE LEARNING
8
Machine learning is to predict the future from past data. Machine learning
(ML) is a type of artificial intelligence (AI) that provides computers with the ability to
learn without being explicitly programmed. Machine learning focuses on the
development of Computer Programs that can change when exposed to new data and
the basics of Machine Learning, implementation of a simple machine learning
algorithm using python. Process of training and prediction involves the use of
specialized algorithms. It feeds the training data to an algorithm, and the algorithm
uses this training data to give predictions on a new test data. Machine learning can
be roughly separated into three categories. There is supervised learning,
unsupervised learning and reinforcement learning. Supervised learning programs are
both given the input data and the corresponding labeling to learn data has to be
labeled by a human being beforehand. Unsupervised learning has no labels. It
provided the learning algorithm. This algorithm has to figure out the clustering of the
input data. Finally, Reinforcement learning dynamically interacts with its environment
and it receives positive or negative feedback to improve its performance.
Data scientists use many different kinds of machine learning algorithms to
discover patterns in python that lead to actionable insights. At a high level, these
different algorithms can be classified into two groups based on the way they “learn”
about data to make predictions: supervised and unsupervised learning. Classification
is the process of predicting the class of given data points. Classes are sometimes
called as targets/ labels or categories. Classification predictive modeling is the task
of approximating a mapping function from input variables(X) to discrete output
variables(y). In machine learning and statistics, classification is a supervised learning
approach in which the computer program learns from the data input given to it and
then uses this learning to classify new observations. This data set may simply be bi-
class (like identifying whether the person is male or female or that the mail is spam
or non-spam) or it may be multi-class too. Some examples of classification problems
are: speech recognition, handwriting recognition, biometric identification, document
classification etc.
9
supervised machine learning is the majority of practical machine learning
using supervised learning. Supervised learning is where you have input
variables (X) and an output variable (y) and use an algorithm to learn the
mapping function from the input to the output is y = f(X). The goal is to
approximate the mapping function so well that when you have new input data
(X) that you can predict the output variables (y) for that data. Techniques of
Supervised Machine Learning algorithms include logistic regression, multi-
class classification, Decision Trees and support vector machines etc.
Supervised learning requires that the data used to train the algorithm is
already labeled with correct answers. Supervised learning problems can be
further grouped into Classification problems. This problem has as goal the
construction of a succinct model that can predict the value of the dependent
attribute from the attribute variables. The difference between the two tasks is
the fact that the dependent attribute is numerical for categorical for
classification. A classification model attempts to draw some conclusion from
observed values. Given one or more inputs a classification model will try to
predict the value of one or more outcomes. A classification problem is when
the output variable is a category, such as “red” or “blue”.
This dataset contains 10000 records of features extracted from Bank Customer
Data, which were then classified into 2 classes:
● Exit
● Not Exit
Proposed System:
10
The proposed method is to build a BANK CUSTOMER CHURN Prediction
using Machine learning Technique. We are going to develop a AI based model, we
need data to train our model. We can use BANK CUSTOMER Dataset in order to
train the model. To use this dataset, we need to understand what the intents that we
are going to train are. An intent is the intention of the user interacting with a
predictive model or the intention behind each Data that the Model receives from a
particular user. According to the domain that you are developing an AI solution,
these intents may vary from one solution to another. The strategy is to define
different intents and make training samples for those intents and train your AI model
with those training sample data as model training data and intents as model training
categories. The model is build using the process of vectorisation where the vectors
made to understand the data. To use different Algorithm we can get a better AI
model and best accuracy. After building a model we evaluate the model using
different metrics like confusion metrics, precision ,
reca
ll,
sen
sitivi
ty
and
F1
scor
e.
11
2.Literature survey
A literature review is a body of text that aims to review the critical points of
current knowledge on and/or methodological approaches to a particular topic. It is a
secondary source and discusses published information in a particular subject area
and sometimes information in a particular subject area within a certain time period.
Its ultimate goal is to bring the reader up to date with current literature on a topic and
forms the basis for another goal, such as future research that may be needed in the
area and precedes a research proposal and may be just a simple summary of
sources. Usually, it has an organizational pattern and combines both summary and
synthesis.
A summary is a recap of important information about the source, but a
synthesis is a reorganization, reshuffling of information. It might give a new
interpretation of old material or combine new with old interpretations or it might trace
the intellectual progression of the field, including major debates. Depending on the
situation, the literature review may evaluate the sources and advise the reader on
the most pertinent or relevant of them
12
their perceived quality by way of giving timely and quality service to their customers.
Customer churn has become one of the primary challenges that many firms are
facing nowadays. Several churn prediction models and techniques are proposed
previously in literature to predict customer churn in areas such as finance, telecom,
banking etc. Researchers are also working on customer churn prediction in e-
commerce using data mining and machine learning techniques. In this paper, a
comprehensive review of various models to predict customer churn in e-commerce
data mining and machine learning techniques has been presented. A critical review
of recent research papers in the field of customer churn prediction in e-commerce
using data mining has been done. Thereafter, important inferences and research
gaps after studying the literature are presented. Finally, the research significance
and concluding remarks are described in the end.
bank customer retention prediction and customer ranking based on deep neural
networks dr a.p.jagadeesan ph.d 2020 Retention of customers is a major concern in
any industry. Customer churn is an important metric that gives the hard truth about
the retention percentage of customers. A detailed study about the existing models for
predicting the customer churn is made and a new model based on Artificial Neural
Network is proposed to find the customer churn in banking domain. The proposed
model is compared with the existing machine learning models. Logistic regression,
Decision Tree and random forest mechanisms are the baseline models that are used
for comparison, the performance metrics that were compared are accuracy,
precision, recall and F1 score. It has been observed that the artificial neural network
model performs better than the logistic regression model and decision tree model.
But when the results are compared with the random forest model considerable
difference is not noted. The proposed model differs from the existing models in a
way that it can rank the customers in the order in which they would leave the
organization.
3.Methodology
This section explains the various works that have been done in order to
predict the customer churn. It includes machine learning models. In addition to the
conventional data used for predicting the customer churn, the authors have added
data from the various sources. It includes the conversation of the customers through
14
phone, the websites and products the customer has viewed, interactive voice data
and other financial data. Binary Classification model is used for predicting the
customer churn. Though a good improvement is noticed with this model, the data
that has been used in this is not commonly available at all times. Churn prediction is
a binary classification problem; the authors specified that from the studies it has
been observed that there is no proper means of measuring the certainty of the
classifier that has been employed for churn prediction. It has also been observed that
the accuracy of the classifiers differs for different zones of the dataset.
Project Goals
15
● Based on the best accuracy
Objectives
The goal is to develop a machine learning model for Bank Churn Prediction, to
potentially replace the updatable supervised machine learning classification models
by predicting results in the form of best accuracy by comparing supervised
algorithms.
Feasibility study:
Data Wrangling
In this section of the report will load in the data, check for cleanliness, and
then trim and clean the given dataset for analysis. Make sure that the document
steps carefully and justify cleaning decisions.
Data collection The data set collected for predicting given data is split into Training
set and Test set. Generally, 7:3 ratios are applied to split the Training set and Test
set. The Data Model which was created using Different algorithms on the Training
set and based on the test result accuracy, Test set prediction is done.
Preprocessing
The data which was collected might contain missing values that may lead to
inconsistency. To gain better results data needs to be preprocessed so as to improve
16
Functional requirements:
requirements for the software product. It is the first step in the requirements analysis
process. It lists requirements of a particular software system. The following details
follow the special libraries like sk-learn, pandas, numpy, matplotlib and seaborn.
Non-Functional Requirements:
1. Problem define
2. Preparing data
3. Evaluating algorithms
4. Improving results
5. Prediction the result
Environmental Requirements:
1. Software Requirements :
18
Workflow diagram
Use case diagrams are considered for high level requirement analysis of a system.
So when the requirements of a system are analyzed the functionalities are captured
in use cases. So, it can say that uses cases are nothing but the system
functionalities written in an organized manner.
Class Diagram
Class diagram is basically a graphical representation of the static view of the system
and represents different aspects of the application. So a collection of class diagrams
19
represent the whole system. The name of the class diagram should be meaningful to
describe the aspect of the system. Each element and their relationships should be
identified in advance Responsibility (attributes and methods) of each class should be
clearly identified for each class minimum number of properties should be specified
and because unnecessary properties will make the diagram complicated. Use notes
whenever required to describe some aspect of the diagram and at the end of the
drawing it should be understandable to the developer/coder. Finally, before making
the final version, the diagram should be drawn on plain paper and reworked as many
times as possible to make it correct.
20
relational database is rolled out, an ERD can still serve as a referral point, should
any debugging or business process re-engineering be needed later.
Data Pre-processing
Validation techniques in machine learning are used to get the error rate of the
Machine Learning (ML) model, which can be considered as close to the true error rate of the
dataset. If the data volume is large enough to be representative of the population, you may not
need the validation techniques. However, in real-world scenarios, to work with samples of
data that may not be a true representative of the population of given dataset. To finding the
missing value, duplicate value and description of data type whether it is float variable or
integer. The sample of data used to provide an unbiased evaluation of a model fit on the
training dataset
The evaluation becomes more biased as skill on the validation dataset is incorporated
into the model configuration. The validation set is used to evaluate a given model, but this is
for frequent evaluation. It as machine learning engineers use this data to fine-tune the model
hyper parameters. Data collection, data analysis, and the process of addressing data content,
quality, and structure can add up to a time-consuming to-do list. During the process of data
identification, it helps to understand your data and its properties; this knowledge will help
you choose which algorithm to use to build your model.
A number of different data cleaning tasks using Python’s Pandas library and
specifically, it focus on probably the biggest data cleaning task, missing values and it able to
21
more quickly clean data. It wants to spend less time cleaning data, and more time
exploring and modeling.
Some of these sources are just simple random mistakes. Other times, there can be a
deeper reason why data is missing. It’s important to understand these different types of
missing data from a statistics point of view. The type of missing data will influence how to
deal with filling in the missing values and to detect missing values, and do some basic
imputation and detailed statistical approach for dealing with missing data. Before, joint into
code, it’s important to understand the sources of missing data. Here are some typical reasons
why some data is mising
22
● Data was lost while transferring manually from a legacy database.
● Users chose not to fill out a field tied to their beliefs about how the results would be used
or interpreted.
import libraries for access and functional purpose and read the given dataset
show columns
23
Exploration data analysis of visualization
This can be helpful when exploring and getting to know a dataset and can help with
identifying patterns, corrupt data, outliers, and much more. With a little domain knowledge,
data visualizations can be used to express and demonstrate key relationships in plots and
charts that are more visceral and stakeholders than measures of association or significance.
Data visualization and exploratory data analysis are whole fields themselves and it will
recommend a deeper dive into some the books mentioned at the end.
25
26
Sometimes data does not make sense until it can look at in a visual form, such as with
charts and plots. Being able to quickly visualize of data samples and others is an important
skill both in applied statistics and in applied machine learning. It will discover the many types
of plots that you will need to know when visualizing data in Python and how to use them to
better understand your own data.
How to chart time series data with line plots and categorical quantities with bar charts.
MODULE DIAGRAM
visualized data
Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data Preprocessing is a technique that is used to convert the raw data into a clean
data set. In other words, whenever the data is gathered from different sources it is collected in
raw format which is not feasible for the analysis. To achieving better results from the applied
model in Machine Learning method of the data has to be in a proper manner. Some specified
Machine Learning model needs information in a specified format, for example, Random
Forest algorithm does not support null values. Therefore, to execute random forest algorithm
null values have to be managed from the original raw data set. And another aspect is that data
set should be formatted in such a way that more than one Machine Learning and Deep
Learning algorithms are executed in a given dataset.
27
In the next section you will discover exactly how you can do that in Python with scikit-learn.
The key to a fair comparison of machine learning algorithms is ensuring that each algorithm
is evaluated in the same way on the same data and it can achieve this by forcing each
algorithm to be evaluated on a consistent test harness.
Logistic Regression
Random Forest
Naive Bayes
The K-fold cross validation procedure is used to evaluate each algorithm, importantly
configured with the same random seed to ensure that the same splits to the training data are
performed and that each algorithm is evaluated in precisely the same way. Before comparing
algorithms, build a Machine Learning Model using Scikit-Learn libraries. In this library
package, we have to do preprocessing, linear model with logistic regression method, cross
validating by KFold method, ensemble with random forest method and tree with decision tree
classifier. Additionally, splitting the train set and test set. To predict the result by comparing
accuracy.
29
Accuracy: The Proportion of the total number of predictions that is correct otherwise overall
how often the model predicts correctly defaulters and non-defaulters.
Accuracy calculation:
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly
predicted observation to the total observations. One may think that, if we have high accuracy
then our model is best. Yes, accuracy is a great measure but only when you have symmetric
datasets where values of false positives and false negatives are almost the same.
Precision is the ratio of correctly predicted positive observations to the total predicted
positive observations. The question that this metric answer is of all passengers that ar
labeled as survived, how many actually survived? High precision relates to the low false
positive rate. We have got 0.788 precision which is pretty good.
Recall: The proportion of positive observed values correctly predicted. (The proportion of
actual defaulters that the model will correctly predict)
Recall(Sensitivity) - Recall is the ratio of correctly predicted positive observations to the all
observations in actual class - yes.
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both
false positives and false negatives into account. Intuitively it is not as easy to understand as
accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class
distribution. Accuracy works best if false positives and false negatives have similar cost. If
30
the cost of false positives and false negatives are very different, it’s better to look at both
Precision and Recall.
General Formula:
Score Formula:
Algorithm Explanation
sklearn:
● In python, sklearn is a machine learning package which includes a lot of ML
algorithms.
● Here, we are using some of its modules like
train_test_split,
DecisionTreeClassifier or Logistic Regression and accuracy_score.
NumPy:
● It is a numeric python module which provides fast maths functions for
calculations.
31
● It is used to read data in numpy arrays and for manipulation purposes.
Pandas:
● Used to read and write different files.
● Data manipulation can be done easily with data frames.
Matplotlib:
● Data visualization is a useful way to help with identifying the patterns from a
given dataset.
● Data manipulation can be done easily with data frames.
Logistic Regression
It is a statistical method for analysing a data set in which there are one or more
independent variables that determine an outcome. The outcome is measured with a
dichotomous variable (in which there are only two possible outcomes). The goal of logistic
regression is to find the best fitting model to describe the relationship between the
dichotomous characteristic of interest (dependent variable = response or outcome variable)
and a set of independent (predictor or explanatory) variables. Logistic regression is a Machine
Learning classification algorithm that is used to predict the probability of a categorical
dependent variable. In logistic regression, the dependent variable is a binary variable that
contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
32
Logistic regression Assumptions:
For a binary regression, the factor level 1 of the dependent variable should represent
The independent variables should be independent of each other. That is, the model
33
The following are the basic steps involved in performing the random forest algorithm:
Choose the number of trees you want in your algorithm and repeat steps 1 and 2.
35
In case of a regression problem, for a new record, each tree in the forest predicts a value for Y
(output). The final value can be calculated by taking the average of all the values predicted by
all the trees in forest. Or, in case of a classification problem, each tree in the forest predicts
the category to which the new record belongs. Finally, the new record is assigned to the
category that wins the majority vote.
MODULE DIAGRAM
getting accuracy
36
Decision Tree Classifier
It is one of the most powerful and popular algorithms. Decision-tree algorithms fall under
the category of supervised learning algorithms. It works for both continuous as well as
categorical output variables. Assumptions of Decision tree:
Attributes are assumed to be categorical for information gain, attributes are assumed
to be continuous.
37
Decision tree builds classification or regression models in the form of a tree structure.
It breaks down a data set into smaller and smaller subsets while at the same time an
associated decision tree is incrementally developed. A decision node has two or more
branches and a leaf node represents a classification or decision. The topmost decision node in
a tree which corresponds to the best predictor called root node. Decision trees can handle both
categorical and numerical data. Decision trees build classification or regression models in the
form of a tree structure. It utilizes an if-then rule set which is mutually exclusive and
exhaustive for classification. The rules are learned sequentially using the training data one at
a time. Each time a rule is learned, the tuples covered by the rules are removed.
This process is continued on the training set until meeting a termination condition. It is
constructed in a top-down recursive divide-and-conquer manner. All the attributes should be
categorical. Otherwise, they should be discretized in advance. Attributes in the top of the tree
have more impact in the classification and they are identified using the information gain
concept. A decision tree can be easily over-fitted generating too many branches and may
reflect anomalies due to noise or outliers.
MODULE DIAGRAM
38
libraries.
It has no database abstraction layer, form validation, or any other components where
pre-existing third-party libraries provide common functions.
However, Flask supports extensions that can add application features as if they were
implemented in Flask itself.
When Ronacher and Georg Brand created a bulletin board system written in Python,
the Pocoo projects Werkzeug and Jinja were developed.
In April 2016, the Pocoo team was disbanded and development of Flask and related
libraries passed to the newly formed Pallets project.
Flask has become popular among Python enthusiasts. As of October 2020, it has
second most stars on GitHub among Python web-development frameworks, only slightly
42