KEMBAR78
Project Report.. | PDF | Statistical Classification | Machine Learning
0% found this document useful (0 votes)
8 views36 pages

Project Report..

Uploaded by

arjunjadayadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views36 pages

Project Report..

Uploaded by

arjunjadayadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Submitted to: DR.

RIYA KUKRETI
DR.VINEET KUMAR SAINI

Assistant Professor

CSE Department

UIT, Uttaranchal University


DECLARATION

I, SHIVA BHANU PRASAD ARLAPUDI hereby declare that the Project


Report entitled “PREDICTION OF BANK CUSTOMER CHURN USING
MACHINE LEARNING TECHNIQUE "done by us under the guidance
is submitted in partial fulfillment of the requirements for the award of
Bachelor of EngineeringDegree in Computer Science and Engineering

DATE: SHIVABHANUPRASAD
place: dehradun ROLL NO:2301010131

Examined by:
(signature)
Mr. Vineet Kumar Saini Head Of Department
DR.MADHU KIROLA

3
Next i would like to tender my sincere thanks to DR.MADHU KIROLA (Head of computer
science & engineering Department) for her co-operation and encouragement

A.SHIVA BHANU PRASAD

UNI ROLL NO:2301010131


Abstract

Now a -days there are a lot of service providers available in every business. There is
no shortage of customers in any options. Mainly, in the banking sector when people
want to keep their money safely they have a lot of options. As a result, customer
churn and loyalty of customers have become a major problem for most banks. In this
paper, a method that predicts customer churn in banking using Machine learning
with ANN. This research promotes the exploration of the likelihood of churning by
customer loyalty.

The logistic regression,random forest,decision tree and naive Bayes Machine


Learning algorithms are used in this study. Keras and TensorFlow are ANN concepts
that are also used in this study. This study is done on a dataset called churn
modeling. The dataset was collected from Kaggle. The results are compared to find
an appropriate model with higher accuracy. As a result, the Random Forest
algorithm achieved higher accuracy than other algorithms. And accuracy was nearly
87%. The least accuracy was achieved by the Decision tree algorithm and it was
78.59% accuracy.

The number of service providers is increasing very rapidly in every business. These
days, there is no shortage of options for customers in the banking sector when
choosing where to put their money. As a result, customer churn and engagement
have become one of the top issues for most banks. In this project, a method to
predict the customer churn in a Bank, using machine learning techniques, which is a
branch of artificial intelligence, is proposed. The research promotes the exploration

4
of the likelihood of churn by analyzing customer behavior. Customer Churn has
become a major problem in all industries including the banking industry and banks
have always tried to track customer interaction so that they can detect the customers
who are likely to leave the bank

TABLE OF CONTENT

CHAPTER.NO TITLE PAGE NO

ABSTRACT 5
TABLE OF CONTENT 6
7
LIST OF FIGURES
7
LIST OF SYMBOLS

01 INTRODUCTION 11

02 LITERATURE SURVEY 15

03
METHODOLOGY
18
3.1 OBJECTIVE 20
3.2 LIST OF MODULES 21
23
3.3 SYSTEM ARCHITECTURE

04
RESULT AND DISCUSSION
PERFORMANCE ANALYSIS 25
48
4.1 FEATURES 61
4.2 CODE

5
05 CONCLUSION AND FUTURE 85
WORK

LIST OF FIGURES

S.NO TITLE PAGE.NO

01 22
SYSTEM
ARCHITECTURE

02 WORKFLOW DIAGRAM 23

03 ER - DIAGRAM 25

04 MODULE DIAGRAM 29

6
1. INTRODUCTION

Churning means a customer who leaves one company and transfers to another
company. It is not only a loss in income but also other negative effects on the
operations and also mainly Customer Relation Management is very important for
banking when the company considers it as they try to establish long-term
relationships with customers and also it will lead to increase their customer base.
The service provider's challenges are found in the behavior of the customer and their
expectations. In the current generation, people are mostly educated compared to
previous generations. So, the current generation of people is expecting more policies
and their diverse demand for connectivity and innovation. This advanced knowledge
is leading to changes in purchase behavior. This is a big challenge for current
service providers to think innovatively to reach their expectations.

Private sectors need to recognize customers Liu and Shih strengthen this argument
in their paper by indicating that increasing pressures on companies to develop new
and innovative ideas in marketing, to meet customer expectations and increase
loyalty and retention. For Customers, it is very easy to transfer their relations from
one bank to another bank. Some customers might be keeping their relationship
status null that means they will keep their account status inactive. By keeping this
account inactive it might be the customer transferring their relationship with another
bank. There are different types of customers are in the bank. Farmers are one of the
major customers to the banks they will expect fewer monthly chargers as they were
financially low. Businessperson, are also one of the major and important customers
because a lot of transactions with huge amount is done by them only usually. These
customers will expect better service quality. One of the most important categories
was Middle-class customers, mostly in every bank these peoples are more than the
type of customers. These people will expect fewer monthly charges, better service
quality, and new policies.

So, maintaining different types of customers is not that easy. They need to consider
customers and their needs to resolve these challenges delivering reliable service on
5 time and within budget to customers. While maintaining a good working partnership
with them is another significant challenge for them. If they failed to resolve these

7
challenges this may cause churning. Recruiting a new customer is more expensive
and harder than keeping already customers. Customers holding on the other hand is
usually more expensive because they have already gained the confidence and
loyalty of present customers. So, the need for a system that can predict customer
churn effectively in the early stages is very important for any banking. This paper
aims at a framework that can predict the customer churning banking sectors using
some machine learning algorithms with ANN.

Existing System:

Predicting player behaviour and customer churn is one of the central


and most common challenges in game analytics. A crucial stage in developing
customer churn prediction model is feature engineering. In the mobile gaming field,
features are commonly constructed from the raw behavioural telemetry data which
leads to challenges related to the establishment of meaningful features and
comprehensible feature frameworks. This research proposes an extended Recency,
Frequency, and Monetary value (RFM) feature framework for churn prediction in the
mobile gaming field by incorporating features related to user Lifetime, Intensity and
Rewards (RFMLIR). The proposed framework is verified by exploring behavioral
differences between churners and non-churners within the established framework for
different churn definitions and definition groups by applying robust exploratory
methods and developing univariate and multivariate churn prediction models.
Although feature importance varies among churn definitions, long term frequency
feature stands out as the most important feature. The top five most important
features distinguished by the multivariable churn prediction models include long and
short term frequency features, monetary, intensity and lifetime

Disadvantages:

● They done only Feature Engineering and analysis, no a practical


model ● No predictive AI model is build.

MACHINE LEARNING

8
Machine learning is to predict the future from past data. Machine learning
(ML) is a type of artificial intelligence (AI) that provides computers with the ability to
learn without being explicitly programmed. Machine learning focuses on the
development of Computer Programs that can change when exposed to new data and
the basics of Machine Learning, implementation of a simple machine learning
algorithm using python. Process of training and prediction involves the use of
specialized algorithms. It feeds the training data to an algorithm, and the algorithm
uses this training data to give predictions on a new test data. Machine learning can
be roughly separated into three categories. There is supervised learning,
unsupervised learning and reinforcement learning. Supervised learning programs are
both given the input data and the corresponding labeling to learn data has to be
labeled by a human being beforehand. Unsupervised learning has no labels. It
provided the learning algorithm. This algorithm has to figure out the clustering of the
input data. Finally, Reinforcement learning dynamically interacts with its environment
and it receives positive or negative feedback to improve its performance.
Data scientists use many different kinds of machine learning algorithms to
discover patterns in python that lead to actionable insights. At a high level, these
different algorithms can be classified into two groups based on the way they “learn”
about data to make predictions: supervised and unsupervised learning. Classification
is the process of predicting the class of given data points. Classes are sometimes
called as targets/ labels or categories. Classification predictive modeling is the task
of approximating a mapping function from input variables(X) to discrete output
variables(y). In machine learning and statistics, classification is a supervised learning
approach in which the computer program learns from the data input given to it and
then uses this learning to classify new observations. This data set may simply be bi-
class (like identifying whether the person is male or female or that the mail is spam
or non-spam) or it may be multi-class too. Some examples of classification problems
are: speech recognition, handwriting recognition, biometric identification, document
classification etc.

9
supervised machine learning is the majority of practical machine learning
using supervised learning. Supervised learning is where you have input
variables (X) and an output variable (y) and use an algorithm to learn the
mapping function from the input to the output is y = f(X). The goal is to
approximate the mapping function so well that when you have new input data
(X) that you can predict the output variables (y) for that data. Techniques of
Supervised Machine Learning algorithms include logistic regression, multi-
class classification, Decision Trees and support vector machines etc.
Supervised learning requires that the data used to train the algorithm is
already labeled with correct answers. Supervised learning problems can be
further grouped into Classification problems. This problem has as goal the
construction of a succinct model that can predict the value of the dependent
attribute from the attribute variables. The difference between the two tasks is
the fact that the dependent attribute is numerical for categorical for
classification. A classification model attempts to draw some conclusion from
observed values. Given one or more inputs a classification model will try to
predict the value of one or more outcomes. A classification problem is when
the output variable is a category, such as “red” or “blue”.

Preparing the Dataset :

This dataset contains 10000 records of features extracted from Bank Customer
Data, which were then classified into 2 classes:

● Exit

● Not Exit
Proposed System:

10
The proposed method is to build a BANK CUSTOMER CHURN Prediction
using Machine learning Technique. We are going to develop a AI based model, we
need data to train our model. We can use BANK CUSTOMER Dataset in order to
train the model. To use this dataset, we need to understand what the intents that we
are going to train are. An intent is the intention of the user interacting with a
predictive model or the intention behind each Data that the Model receives from a
particular user. According to the domain that you are developing an AI solution,
these intents may vary from one solution to another. The strategy is to define
different intents and make training samples for those intents and train your AI model
with those training sample data as model training data and intents as model training
categories. The model is build using the process of vectorisation where the vectors
made to understand the data. To use different Algorithm we can get a better AI
model and best accuracy. After building a model we evaluate the model using
different metrics like confusion metrics, precision ,

reca
ll,
sen
sitivi
ty
and
F1
scor
e.

Architecture of Proposed model

11
2.Literature survey

A literature review is a body of text that aims to review the critical points of
current knowledge on and/or methodological approaches to a particular topic. It is a
secondary source and discusses published information in a particular subject area
and sometimes information in a particular subject area within a certain time period.
Its ultimate goal is to bring the reader up to date with current literature on a topic and
forms the basis for another goal, such as future research that may be needed in the
area and precedes a research proposal and may be just a simple summary of
sources. Usually, it has an organizational pattern and combines both summary and
synthesis.
A summary is a recap of important information about the source, but a
synthesis is a reorganization, reshuffling of information. It might give a new
interpretation of old material or combine new with old interpretations or it might trace
the intellectual progression of the field, including major debates. Depending on the
situation, the literature review may evaluate the sources and advise the reader on
the most pertinent or relevant of them

A comparison of machine learning techniques for customer churn prediction


Praveen Asthana 2018 We present a comparative study on the most popular
machine learning methods applied to the challenging problem of customer churning
prediction in the telecommunications industry. In the first phase of our experiments,
all models were applied and evaluated using cross-validation on a popular, public
domain dataset. In the second phase, the performance improvement offered by
boosting was studied. In order to determine the most efficient parameter
combinations we performed a series of Monte Carlo simulations for each method
and for a wide range of parameters. Our results demonstrate clear superiority of the
boosted versions of the models against the plain (non-boosted) versions. The best
overall classifier was the SVM-POLY using AdaBoost with accuracy of almost 97%
and F-measure over 84%. Customer Churn Analysis in Banking Sector G. Jignesh
Chowdary1 , Suganya. G 2 , Premalatha. M32019 The role of ICT in the banking
sector is a crucial part of the development of nations. The development of the
banking sector mostly depends on its valuable customers. So, customer churn

12
their perceived quality by way of giving timely and quality service to their customers.
Customer churn has become one of the primary challenges that many firms are
facing nowadays. Several churn prediction models and techniques are proposed
previously in literature to predict customer churn in areas such as finance, telecom,
banking etc. Researchers are also working on customer churn prediction in e-
commerce using data mining and machine learning techniques. In this paper, a
comprehensive review of various models to predict customer churn in e-commerce
data mining and machine learning techniques has been presented. A critical review
of recent research papers in the field of customer churn prediction in e-commerce
using data mining has been done. Thereafter, important inferences and research
gaps after studying the literature are presented. Finally, the research significance
and concluding remarks are described in the end.
bank customer retention prediction and customer ranking based on deep neural
networks dr a.p.jagadeesan ph.d 2020 Retention of customers is a major concern in
any industry. Customer churn is an important metric that gives the hard truth about
the retention percentage of customers. A detailed study about the existing models for
predicting the customer churn is made and a new model based on Artificial Neural
Network is proposed to find the customer churn in banking domain. The proposed
model is compared with the existing machine learning models. Logistic regression,
Decision Tree and random forest mechanisms are the baseline models that are used
for comparison, the performance metrics that were compared are accuracy,
precision, recall and F1 score. It has been observed that the artificial neural network
model performs better than the logistic regression model and decision tree model.
But when the results are compared with the random forest model considerable
difference is not noted. The proposed model differs from the existing models in a
way that it can rank the customers in the order in which they would leave the
organization.

3.Methodology
This section explains the various works that have been done in order to
predict the customer churn. It includes machine learning models. In addition to the
conventional data used for predicting the customer churn, the authors have added
data from the various sources. It includes the conversation of the customers through

14
phone, the websites and products the customer has viewed, interactive voice data
and other financial data. Binary Classification model is used for predicting the
customer churn. Though a good improvement is noticed with this model, the data
that has been used in this is not commonly available at all times. Churn prediction is
a binary classification problem; the authors specified that from the studies it has
been observed that there is no proper means of measuring the certainty of the
classifier that has been employed for churn prediction. It has also been observed that
the accuracy of the classifiers differs for different zones of the dataset.

Project Goals

Exploration data analysis of variable identification

● Loading the given dataset


● Import required libraries packages
● Analyze the general properties
● Find duplicate and missing values
● Checking unique and count values

Uni-variate data analysis

● Rename, add data and drop the data


● To specify data type

Exploratory data analysis of bi-variate and multivariate

● Plot diagram of pairplot, heatmap, bar chart and Histogram


Method of Outlier detection with feature engineering

● Pre-processing the given dataset


● Splitting the test and training dataset
● Comparing the Decision tree and Logistic regression model and random
forest etc.

Comparing algorithm to predict the result

15
● Based on the best accuracy

Objectives
The goal is to develop a machine learning model for Bank Churn Prediction, to
potentially replace the updatable supervised machine learning classification models
by predicting results in the form of best accuracy by comparing supervised
algorithms.

Scope of the Project


Here the scope of the project is that integration of Bank support with
computer-based records could reduce, enhance Bank safety, decrease the customer
churn , and improve Bank Customer Support. This suggestion is promising as data
modeling and analysis tools, e.g., data mining, have the potential to generate a
knowledge-rich environment which can help to significantly improve the quality of
Bank support.

Feasibility study:
Data Wrangling
In this section of the report will load in the data, check for cleanliness, and
then trim and clean the given dataset for analysis. Make sure that the document
steps carefully and justify cleaning decisions.

Data collection The data set collected for predicting given data is split into Training
set and Test set. Generally, 7:3 ratios are applied to split the Training set and Test
set. The Data Model which was created using Different algorithms on the Training
set and based on the test result accuracy, Test set prediction is done.

Preprocessing
The data which was collected might contain missing values that may lead to
inconsistency. To gain better results data needs to be preprocessed so as to improve

16
Functional requirements:

The software requirements specification is a technical specification of

requirements for the software product. It is the first step in the requirements analysis
process. It lists requirements of a particular software system. The following details
follow the special libraries like sk-learn, pandas, numpy, matplotlib and seaborn.

Non-Functional Requirements:

Process of functional steps,

1. Problem define
2. Preparing data
3. Evaluating algorithms
4. Improving results
5. Prediction the result

Environmental Requirements:

1. Software Requirements :

Operating System : Windows

Tool : Anaconda with Jupyter Notebook


SYSTEM ARCHITECTURE

18
Workflow diagram

Use case diagrams are considered for high level requirement analysis of a system.
So when the requirements of a system are analyzed the functionalities are captured
in use cases. So, it can say that uses cases are nothing but the system
functionalities written in an organized manner.
Class Diagram

Class diagram is basically a graphical representation of the static view of the system
and represents different aspects of the application. So a collection of class diagrams

19
represent the whole system. The name of the class diagram should be meaningful to
describe the aspect of the system. Each element and their relationships should be
identified in advance Responsibility (attributes and methods) of each class should be
clearly identified for each class minimum number of properties should be specified
and because unnecessary properties will make the diagram complicated. Use notes
whenever required to describe some aspect of the diagram and at the end of the
drawing it should be understandable to the developer/coder. Finally, before making
the final version, the diagram should be drawn on plain paper and reworked as many
times as possible to make it correct.

Entity Relationship Diagram (ERD)

An entity relationship diagram (ERD), also known as an entity relationship


model, is a graphical representation of an information system that depicts the
relationships among people, objects, places, concepts or events within that system.
An ERD is a data modeling technique that can help define business processes and
be used as the foundation for a relational database. Entity relationship diagrams
provide a visual starting point for database design that can also be used to help
determine information system requirements throughout an organization. After a

20
relational database is rolled out, an ERD can still serve as a referral point, should
any debugging or business process re-engineering be needed later.

4.RESULT AND DISCUSSION:

Data Pre-processing

Validation techniques in machine learning are used to get the error rate of the
Machine Learning (ML) model, which can be considered as close to the true error rate of the
dataset. If the data volume is large enough to be representative of the population, you may not
need the validation techniques. However, in real-world scenarios, to work with samples of
data that may not be a true representative of the population of given dataset. To finding the
missing value, duplicate value and description of data type whether it is float variable or
integer. The sample of data used to provide an unbiased evaluation of a model fit on the
training dataset

while tuning model hyper parameters.

The evaluation becomes more biased as skill on the validation dataset is incorporated
into the model configuration. The validation set is used to evaluate a given model, but this is
for frequent evaluation. It as machine learning engineers use this data to fine-tune the model
hyper parameters. Data collection, data analysis, and the process of addressing data content,
quality, and structure can add up to a time-consuming to-do list. During the process of data
identification, it helps to understand your data and its properties; this knowledge will help
you choose which algorithm to use to build your model.

A number of different data cleaning tasks using Python’s Pandas library and
specifically, it focus on probably the biggest data cleaning task, missing values and it able to

21
more quickly clean data. It wants to spend less time cleaning data, and more time
exploring and modeling.

Some of these sources are just simple random mistakes. Other times, there can be a
deeper reason why data is missing. It’s important to understand these different types of
missing data from a statistics point of view. The type of missing data will influence how to
deal with filling in the missing values and to detect missing values, and do some basic
imputation and detailed statistical approach for dealing with missing data. Before, joint into
code, it’s important to understand the sources of missing data. Here are some typical reasons
why some data is mising

● User forgot to fill in a field.

22
● Data was lost while transferring manually from a legacy database.

● There was a programming error.

● Users chose not to fill out a field tied to their beliefs about how the results would be used
or interpreted.

Variable identification with Uni-variate, Bi-variate and Multi-variate analysis:

import libraries for access and functional purpose and read the given dataset

General Properties of Analyzing the given dataset

Display the given dataset in the form of data frame

show columns

shape of the data frame

To describe the data frame

Checking data type and information about dataset

Checking for duplicate data

Checking Missing values of data frame

Checking unique values of data frame

23
Exploration data analysis of visualization

Data visualization is an important skill in applied statistics and machine learning.


Statistics does indeed focus on quantitative descriptions and estimations of data. Data
visualization provides an important suite of tools for gaining a qualitative understanding.

This can be helpful when exploring and getting to know a dataset and can help with
identifying patterns, corrupt data, outliers, and much more. With a little domain knowledge,
data visualizations can be used to express and demonstrate key relationships in plots and
charts that are more visceral and stakeholders than measures of association or significance.
Data visualization and exploratory data analysis are whole fields themselves and it will
recommend a deeper dive into some the books mentioned at the end.

25
26
Sometimes data does not make sense until it can look at in a visual form, such as with
charts and plots. Being able to quickly visualize of data samples and others is an important
skill both in applied statistics and in applied machine learning. It will discover the many types
of plots that you will need to know when visualizing data in Python and how to use them to
better understand your own data.

How to chart time series data with line plots and categorical quantities with bar charts.

How to summarize data distributions with histograms and box plots.

MODULE DIAGRAM

GIVEN INPUT EXPECTED OUTPUT

input : data output :

visualized data

Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data Preprocessing is a technique that is used to convert the raw data into a clean
data set. In other words, whenever the data is gathered from different sources it is collected in
raw format which is not feasible for the analysis. To achieving better results from the applied
model in Machine Learning method of the data has to be in a proper manner. Some specified
Machine Learning model needs information in a specified format, for example, Random
Forest algorithm does not support null values. Therefore, to execute random forest algorithm
null values have to be managed from the original raw data set. And another aspect is that data
set should be formatted in such a way that more than one Machine Learning and Deep
Learning algorithms are executed in a given dataset.

27
In the next section you will discover exactly how you can do that in Python with scikit-learn.
The key to a fair comparison of machine learning algorithms is ensuring that each algorithm
is evaluated in the same way on the same data and it can achieve this by forcing each
algorithm to be evaluated on a consistent test harness.

In the example below 4 different algorithms are compared:

Logistic Regression

Random Forest

Decision Tree Classifier

Naive Bayes

The K-fold cross validation procedure is used to evaluate each algorithm, importantly
configured with the same random seed to ensure that the same splits to the training data are
performed and that each algorithm is evaluated in precisely the same way. Before comparing
algorithms, build a Machine Learning Model using Scikit-Learn libraries. In this library
package, we have to do preprocessing, linear model with logistic regression method, cross
validating by KFold method, ensemble with random forest method and tree with decision tree
classifier. Additionally, splitting the train set and test set. To predict the result by comparing
accuracy.

Prediction result by accuracy:


Logistic regression algorithm also uses a linear equation with independent predictors
to predict a value. The predicted value can be anywhere between negative infinity to positive
infinity. It needs the output of the algorithm to be classified variable data. Higher accuracy
predicting the result is a logistic regression model by comparing the best accuracy.

True Positive Rate(TPR) = TP / (TP + FN)

False Positive rate(FPR) = FP / (FP + TN)

29
Accuracy: The Proportion of the total number of predictions that is correct otherwise overall
how often the model predicts correctly defaulters and non-defaulters.

Accuracy calculation:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Accuracy is the most intuitive performance measure and it is simply a ratio of correctly
predicted observation to the total observations. One may think that, if we have high accuracy
then our model is best. Yes, accuracy is a great measure but only when you have symmetric
datasets where values of false positives and false negatives are almost the same.

Precision: The proportion of positive predictions that are actually correct.

Precision = TP / (TP + FP)

Precision is the ratio of correctly predicted positive observations to the total predicted
positive observations. The question that this metric answer is of all passengers that ar

labeled as survived, how many actually survived? High precision relates to the low false
positive rate. We have got 0.788 precision which is pretty good.

Recall: The proportion of positive observed values correctly predicted. (The proportion of
actual defaulters that the model will correctly predict)

Recall = TP / (TP + FN)

Recall(Sensitivity) - Recall is the ratio of correctly predicted positive observations to the all
observations in actual class - yes.

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both
false positives and false negatives into account. Intuitively it is not as easy to understand as
accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class
distribution. Accuracy works best if false positives and false negatives have similar cost. If

30
the cost of false positives and false negatives are very different, it’s better to look at both
Precision and Recall.

General Formula:

F- Measure = 2TP / (2TP + FP + FN) F1-

Score Formula:

F1 Score = 2*(Recall * Precision) / (Recall + Precision)


ALGORITHM AND TECHNIQUES

Algorithm Explanation

In machine learning and statistics, classification is a supervised learning approach in


which the computer program learns from the data input given to it and then uses this learning
to classify new observations. This data set may simply be bi-class (like identifying whether
the person is male or female or that the mail is spam or non-spam) or it may be multi-class
too. Some examples of classification problems are: speech recognition, handwriting
recognition, biometric identification, document classification etc. In Supervised Learning,
algorithms learn from labeled data. After understanding the data, the algorithm determines
which label should be given to new data based on pattern and associating the patterns to the
unlabeled new data.

Used Python Packages:

sklearn:
● In python, sklearn is a machine learning package which includes a lot of ML
algorithms.
● Here, we are using some of its modules like
train_test_split,
DecisionTreeClassifier or Logistic Regression and accuracy_score.
NumPy:
● It is a numeric python module which provides fast maths functions for
calculations.

31
● It is used to read data in numpy arrays and for manipulation purposes.
Pandas:
● Used to read and write different files.
● Data manipulation can be done easily with data frames.

Matplotlib:
● Data visualization is a useful way to help with identifying the patterns from a
given dataset.
● Data manipulation can be done easily with data frames.
Logistic Regression

It is a statistical method for analysing a data set in which there are one or more
independent variables that determine an outcome. The outcome is measured with a
dichotomous variable (in which there are only two possible outcomes). The goal of logistic
regression is to find the best fitting model to describe the relationship between the
dichotomous characteristic of interest (dependent variable = response or outcome variable)
and a set of independent (predictor or explanatory) variables. Logistic regression is a Machine
Learning classification algorithm that is used to predict the probability of a categorical
dependent variable. In logistic regression, the dependent variable is a binary variable that
contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).

In other words, the logistic regression model predicts P(Y=1) as a function of X.

32
Logistic regression Assumptions:

Binary logistic regression requires the dependent variable to be binary.

For a binary regression, the factor level 1 of the dependent variable should represent

the desired outcome.

Only the meaningful variables should be included.

The independent variables should be independent of each other. That is, the model

should have little.

The independent variables are linearly related to the log odds.

Logistic regression requires quite large sample sizes.

33
The following are the basic steps involved in performing the random forest algorithm:

Pick N random records from the dataset.

Build a decision tree based on these N records.

Choose the number of trees you want in your algorithm and repeat steps 1 and 2.

35
In case of a regression problem, for a new record, each tree in the forest predicts a value for Y
(output). The final value can be calculated by taking the average of all the values predicted by
all the trees in forest. Or, in case of a classification problem, each tree in the forest predicts
the category to which the new record belongs. Finally, the new record is assigned to the
category that wins the majority vote.

MODULE DIAGRAM

GIVEN INPUT EXPECTED OUTPUT

input : data output :

getting accuracy

36
Decision Tree Classifier
It is one of the most powerful and popular algorithms. Decision-tree algorithms fall under
the category of supervised learning algorithms. It works for both continuous as well as
categorical output variables. Assumptions of Decision tree:

At the beginning, we consider the whole training set as the root.

Attributes are assumed to be categorical for information gain, attributes are assumed

to be continuous.

On the basis of attribute values records are distributed recursively.

We use statistical methods for ordering attributes as root or internal node.

37
Decision tree builds classification or regression models in the form of a tree structure.
It breaks down a data set into smaller and smaller subsets while at the same time an
associated decision tree is incrementally developed. A decision node has two or more
branches and a leaf node represents a classification or decision. The topmost decision node in
a tree which corresponds to the best predictor called root node. Decision trees can handle both
categorical and numerical data. Decision trees build classification or regression models in the
form of a tree structure. It utilizes an if-then rule set which is mutually exclusive and
exhaustive for classification. The rules are learned sequentially using the training data one at
a time. Each time a rule is learned, the tuples covered by the rules are removed.
This process is continued on the training set until meeting a termination condition. It is
constructed in a top-down recursive divide-and-conquer manner. All the attributes should be
categorical. Otherwise, they should be discretized in advance. Attributes in the top of the tree
have more impact in the classification and they are identified using the information gain
concept. A decision tree can be easily over-fitted generating too many branches and may
reflect anomalies due to noise or outliers.

MODULE DIAGRAM

38
libraries.

It has no database abstraction layer, form validation, or any other components where
pre-existing third-party libraries provide common functions.

However, Flask supports extensions that can add application features as if they were
implemented in Flask itself.

Extensions exist for object-relational mappers, form validation, upload handling,


various open authentication technologies and several common framework related tools.

Flask was created by Armin Ronacher of Pocoo, an international group of Python


enthusiasts formed in 2004. According to Ronacher, the idea was originally an April Fool’s
joke that was popular enough to make into a serious application. The name is a play on the
earlier Bottle framework.

When Ronacher and Georg Brand created a bulletin board system written in Python,
the Pocoo projects Werkzeug and Jinja were developed.

In April 2016, the Pocoo team was disbanded and development of Flask and related
libraries passed to the newly formed Pallets project.

Flask has become popular among Python enthusiasts. As of October 2020, it has
second most stars on GitHub among Python web-development frameworks, only slightly

42

You might also like