Heart Disease Prediction System using ensemble
of Machine Learning Algorithms
Submitted in partial fulfillment of the requirements for the award of degree of
BACHELOR OF
ENGINEERING IN
COMPUTER SCIENCE & ENGINEERING
Submitted to:
Dr Meenu Gupta
Submitted By:
Manish Raj
18BCS2216
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Chandigarh University, Gharuan
Abstract: Nearly 17.5 million deaths from cardiovascular disease occur worldwide.
Currently, India has more than 30 million heart patients. People’s unconscious attitudes
towards health are likely to lead to a variety of illnesses and can be life threatening. In the
healthcare industry, large amounts of data are frequently generated. However, it is often
not used effectively. The data indicates that the generated image, sound, text, or file has
some hidden patterns and their relationships. Tools used to extract knowledge from these
databases for clinical diagnosis of disease or other purposes are less common. Of course, if
you can create a mechanism or system that can communicate your mind to people and
alert you based on your medical history, it will help. Current experimental studies use
machine learning (ML) algorithms to predict risk factors for a person’s heart disease,
depending on several characteristics of the medical history. Use input features such as
gender, cholesterol, blood pressure, TTH, and stress to predict the patient’s risk of heart
disease. Data mining (DM) techniques such as Naive Bayes, decision trees, support vector
machines, and logistic regression are analyzed in the heart disease database. The accuracy
of various algorithms is measured and the algorithms were compared. The result of this
experimental analysis is a 0 or 1 result that poses no danger or danger to the individual.
Django is used to run a website.
Development of Overall Model
During this activity of feature driven development, software requirement specification document
was prepared for capturing the requirements. ER Diagram and requirement specification
document was designed. After that, for the completion of this activity, a domain object model
was prepared along with the overall application architecture.
Functional Specifications
Included in this section are the functional/non-functional requirements of the systems along with
the use-cases and wireframes.
Functional Requirements:
The system allows users to predict heart disease.
The system allows users to create an account and login.
The system allows the users to update their profile and password.
The system provides login for admin.
The system should allow administrator to monitor and remove inappropriate datasets and
code.
Non-functional Requirements:
The website should be responsive and have consistent across different screen sizes and
resolutions.
The website should provide user information about different values used during the
prediction.
1.1 Problem statement:
Heart disease can be managed effectively with a combination of lifestyle changes, medicine and,
in some cases, surgery. With the right treatment, the symptoms of heart disease can be reduced
and the functioning of the heart improved. The predicted results can be used to prevent and thus
reduce cost for surgical treatment and other expensive.
The overall objective of my work will be to predict accurately with few tests and attributes the
presence of heart disease. Attributes considered form the primary basis for tests and give
accurate results more or less. Many more input attributes can be taken but our goal is to predict
with few attributes and faster efficiency the risk of having heart disease. Decisions are often
made based on doctors’ intuition and experience rather than on the knowledge rich data hidden
in the data set and databases. This practice leads to unwanted biases, errors and excessive
medical costs which affects the quality of service provided to patients.
Data mining holds great potential for the healthcare industry to enable health systems to
systematically use data and analytics to identify inefficiencies and best practices that improve
care and reduce costs. According to (Wurz & Takala, 2006) the opportunities to improve care
and reduce costs concurrently could apply to as much as 30% of overall healthcare spending. The
successful application of data mining in highly visible fields like e-business, marketing and retail
has led to its application in other industries and sectors. Among these sectors just discovering is
healthcare. The healthcare environment is still „information rich‟ but „knowledge poor‟. There
is a wealth of data available within the healthcare systems. However, there is a lack of effective
analysis tools to discover hidden relationships and trends in the data for African genres
.
2. LITERATURE SURVEY :
Waveform analysis, time-frequency analysis, Neuro Fuzzy RBF ANN and Total Least Square-
based Prony modeling algorithms are some of the techniques used to identify heart disease
in the literature. However, in a study by Marshall et al (Marshall et al 1991), classification
accuracy was not good with this technique (up to 79%) and the range of improvements to
select the appropriate model was still sufficient. They also demonstrated the efficiency of
neural networks in diagnosing heart attacks (acute myocardial infarction) by comparing
multiple neural network classifiers, the multilayer perceptron and the Boltzmann
perceptron classifier. Most of these approaches relate to diagnosis, not to the
understanding of fundamental knowledge.
3. Methodology Used:
In this project the dataset which is used is taken from Kaggle Heart Disease UCI. So now in
order to perform operations, regressions on dataset. First we’ll import few of the libraries of
python.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib.cm import rainbow
%matplotlib inline
import warnings
here in this project we’ll be experimenting with three of the algorithms. So we’ll be using
three of the algorithms which are KNeighborsClassifier, DecisionTreeClassifier,
RandomForestClassifier.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
Truncation errors are committed when an iterative method is terminated or a mathematical
procedure is approximated, and the approximate solution differs from the exact solution.
Similarly, discretization induces a discretization error because the solution of the discrete
problem does not coincide with the solution of the continuous problem. For instance, in the
iteration in the sidebar to compute the solution of , after 10 or so iterations, it can be
concluded that the root is roughly 1.99 (for example). Therefore, there is a truncation error
of 0.01.
Once an error is generated, it will generally propagate through the calculation. For instance,
already noted is that the operation + on a calculator (or a computer) is inexact. It follows
that a calculation of the type is even more inexact.
The truncation error is created when a mathematical procedure is approximated. To
integrate a function exactly it is required to find the sum of infinite trapezoids, but
numerically only the sum of only finite trapezoids can be found, and hence the
approximation of the mathematical procedure. Similarly, to differentiate a function, the
differential element approaches zero but numerically only a finite value of the differential
element can be chosen.
Experimental result analysis:
Using different classifiers
The following section describes the results obtained using different classifiers on the heart
disease dataset with cross validation method with 10 folds using WEKA software solution,
version 3.8.4. The University of Waikato, Hamilton, New Zeeland, using a window 10 pro,
Intil® core (TM) i5 CPU, 4GB RM, 64-bit Operating System. Parameters for these classifiers
are the default parameters by the software, unless otherwise specified as per the sensitivity
analysis section of this paper.
Parameter’s sensitivity
We will present some parameters sensitivity for Decision tree J48 classifier and change its
pruning confidence factor parameter, where smaller pruning value would give more
pruning, and we will study the accuracy performance, kappa statistic, MAE and RAE
performance of the Decision tree J48 classifier due to these changes. Decision tree J48 was
used for the sensitivity analysis, because it had the max accuracy percentage out of all other
classifiers. Also, the training sample size for Naive Bay classifier will be used as a sensitivity
parameter, by changing its training set size and observe the changes in its classification
accuracy with respect to the portion of the training samples with respect to the total
samples. Naïve Bay was selected as an example of low accuracy rate classifier, ad to see the
changes of its performance in term of the changes of the training sample size. Regarding the
sensitivity analysis, parameter start with the default value of the parameter, then it was
changed accordingly to study the changes of the classifier performance in term of these
parameters.
Decision tree J48 pruning confidence factor (PCF)
Pruning is one of the characteristics associated with the Decision tree J48 classifier, and
Pruning Confidence Factor (PCF) is one of its parameters, and less value of such parameter
means more pruning, and our used value for the classifiers comparison in the previous
section was PCF = 0.25.
Naïve Bayes
In this section, we will select the training/test method instead of the cross validation, with
10 folds, for the Naïve Bayes classifier and change the percentage of the training samples to
study the changes in the classifier accuracy. Table 4 shows the result of these changes.
Feature extraction
A feature extraction method was performed using Classifier Subset Evaluator by applying a
training classification data to estimate the accuracy of these subsets for all used classifiers
on the HD dataset and measure the quality of the generated subsets in order to evaluate
the classification performance after selecting the relevant attributes per classification
algorithm, and the results of the classifier are shown in Table 5, and a visual representation
is shown in Fig. 10.
import seaborn as sns
#get correlations of each features in dataset
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
columns_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
dataset[columns_to_scale] = standardScaler.fit_transform(dataset[column
s_to_scale])
sns.set_style('whitegrid')
sns.countplot(x='target',data=df,palette='RdBu_r')
dataset = pd.get_dummies(df, columns = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slo
pe', 'ca', 'thal'])
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
columns_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
dataset[columns_to_scale] = standardScaler.fit_transform(dataset[columns_to_scale])
dataset.head()
y = dataset['target']
X = dataset.drop(['target'], axis = 1)
plt.plot([k for k in range(1, 21)], knn_scores, color = 'red')
for i in range(1,21):
plt.text(i, knn_scores[i-1], (i, knn_scores[i-1]))
plt.xticks([i for i in range(1, 21)])
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Scores')
plt.title('K Neighbors Classifier scores for different K values')
plt.plot([k for k in range(1, 21)], knn_scores, color = 'red')
for i in range(1,21):
plt.text(i, knn_scores[i-1], (i, knn_scores[i-1]))
plt.xticks([i for i in range(1, 21)])
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Scores')
plt.title('K Neighbors Classifier scores for different K values')
Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
randomforest_classifier= RandomForestClassifier(n_estimators=10)
score=cross_val_score(randomforest_classifier,X,y,cv=10)
score.mean()
Conclusion & Future Scope:
Heart diseases when aggravated spiral way beyond control. Heart diseases are complicated
and take away lots of lives every year. When the early symptoms of heart diseases are
ignored, the patient might end up with drastic consequences in a short span of time.
Sedentary lifestyle and excessive stress in today’s world have worsened the situation. If the
disease is detected early then it can be kept under control. However, it is always advisable
to exercise daily and discard unhealthy habits at the earliest. Tobacco consumption and
unhealthy diets increase the chances of stroke and heart diseases. Eating at least 5 helpings
of fruits and vegetables a day is a good practice. For heart disease patients, it is advisable to
restrict the intake of salt to one teaspoon per day. One of the major drawbacks of these
works is that the main focus has been on the application of classification techniques for
heart disease prediction, rather than studying various data cleaning and pruning techniques
that prepare and make a dataset suitable for mining. It has been observed that a properly
cleaned and pruned dataset provides much better accuracy than an unclean one with
missing values. Selection of suitable techniques for data cleaning along with proper
classification algorithms will lead to the development of prediction systems that give
enhanced accuracy. In future an intelligent system may be developed that can lead to
selection of proper treatment methods for a patient diagnosed with heart disease. A lot of
work has been done already in making models that can predict whether a patient is likely to
develop heart disease or not. There are several treatment methods for a patient once
diagnosed with a particular form of heart disease. Data mining can be of very good help in
deciding the line of treatment to be followed by extracting knowledge from such suitable
databases.