5.
2 SAMPLE CODE
Importing the dependencies
#loading
dataset import
pandas as pd
import numpy as
np
#visualisation
import
matplotlib.pypl
ot as plt
%matplotlib inline
#%matplotlib inline is a magic command for IPython that allows
you to a dd plots to the browser interface.
import seaborn as sns
#seaborn is a library for making statistical graphics
#EDA from collections
import Counter import
pandas_profiling as pp
#data processing from sklearn.preprocessing import
StandardScaler #StandardScaler standa rdizes a feature by
subtracting the mean and then scaling to unit varia nce. Unit
variance means dividing all the values by the standard deviat
ion.
# data splitting from
sklearn.model_selection import
train_test_split
# data modeling from sklearn.metrics import
confusion_matrix,accuracy_score,roc_curve,c
lassification_report from sklearn.linear_model import
LogisticRegression from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier from sklearn.ensemble import
RandomForestClassifier from sklearn.tree import
DecisionTreeClassifier from sklearn.neighbors import
KNeighborsClassifier from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
Data collection and processing
heart_data = pd.read_csv('/content/heart.csv')
heart_data.head()
Output:
heart_data.tail()
Output:
heart_data.shape
Output:
heart_data.info()
Output:
Missing Value Detection
heart_data.isnull().sum()
Output:
Descriptive statistics
heart_data.describe()
Output:
heart_data['target'].value_counts()
Output:
1----> Defective heart
0----> Healthy heart
It is always better to check the correlation between the features so that we
can analyze that which feature is negatively correlated and which is
positively correlated so, Let’s check the correlation between various
features.
plt.figure(figsize=(20,12))
sns.set_context('notebook',font_scale = 1.3)
sns.heatmap(data.corr(),annot=True,linewidth =2)
plt.tight_layout()
Output:
By far we have checked the correlation between the features but it is also a good
practice to check the correlation of the target variable.
sns.set_context('notebook',font_scale = 2.3)
data.drop('target', axis=1).corrwith(data.target).plot(kind='b
ar',
grid=True, figsize=(20, 10),title="Correlation with the target
feature")
plt.tight_layout()
Inference: Insights from the above graph are:
• Four feature( “cp”, “restecg”, “thalach”, “slope” ) are positively
correlated with the target feature.
• Other features are negatively correlated with the target feature.
So, we have done enough collective analysis now let’s go for the analysis of the
individual features which comprises both univariate and bivariate analysis.
Age(“age”) Analysis
Here we will be checking the 10 ages and their
counts.
plt.figure(figsize=(25,12))
sns.set_context('notebook',font_scale = 1.5)
sns.barplot(x=data.age.value_counts()[:10].index,y=data.age.
value_counts()[:10].values)
plt.tight_layout()
Output:
Inference: Here we can see that the 58 age column has the highest
frequency.
Let’s check the range of age in the
dataset.
minAge = min(data.age)
maxAge = max(data.age)
meanAge = data.age.mean()
print('Min Age :',minAge)
print('Max Age :',maxAge)
print('Mean Age :',meanAge)
Output:
Min Age : 29 Max Age : 77 Mean Age : 54.366336633663366
We should divide the Age feature into three parts – “Young”, “Middle”
and “Elder”
Young = data[(data.age>=29)&(data.age<40)] Middle =
data[(data.age>=40)&(data.age<55)] Elder =
data[(data.age>55)] plt.figure(figsize=(23,10))
sns.set_context('notebook',font_scale = 1.5)
sns.barplot(x=['young ages','middle ages','elderly
ages'],y=[l en(Young),len(Middle),len(Elder)])
plt.tight_layout()
Output:
Inference: Here we can see that elder people are the most affected by
heart disease and young ones are the least affected.
To prove the above inference we will plot the pie chart.
colors = ['blue','green','yellow'] explode = [0,0,0.1]
plt.figure(figsize=(10,10))
sns.set_context('notebook',font_scale = 1.2)
plt.pie([len(Young),len(Middle),len(Elder)],labels=['young
age s','middle ages','elderly
ages'],explode=explode,colors=colors
, autopct='%1.1f%%')
plt.tight_layout()
Output:
Sex(“sex”) Feature
Analysis
plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale
= 1.5) sns.countplot(data['sex'])
plt.tight_layout()
Output:
Inference: Here it is clearly visible that, Ratio of Male to Female is approx
2:1.
Now let’s plot the relation between sex and
slope.
plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale = 1.5)
sns.countplot(data['sex'],hue=data["slope"])
plt.tight_layout()
Output:
Inference: Here it is clearly visible that the slope value is higher in the case
of males(1).
Chest Pain Type(“cp”)
Analysis
plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale =
1.5) sns.countplot(data['cp'])
plt.tight_layout()
Output:
Inference: As seen, there are 4 types of chest pain
1. status at least
2. condition slightly distressed
3. condition medium problem
4. condition too bad
Analyzing cp vs target column
Inference: From the above graph we can make some inferences,
• People having the least chest pain are not likely to have heart
disease.
• People having severe chest pain are likely to have heart disease.
Elderly people are more likely to have chest pain.
Thal Analysis
plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale = 1.5)
sns.countplot(data['thal'])
plt.tight_layout()
Output:
Target
plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale =
1.5) sns.countplot(data['target'])
plt.tight_layout()
Output:
Inference: The ratio between 1 and 0 is much less than 1.5 which indicates
that the target feature is not imbalanced. So for a balanced dataset, we
can use accuracy_score as evaluation metrics for our model.
Feature Engineering
Now we will see the complete description of the continuous data as well
as the categorical data
categorical_val = []
continous_val = [] for
column in data.columns:
print("--------------------")
print(f"{column} : {data[column].unique()}")
if len(data[column].unique()) <= 10:
categorical_val.append(column) else:
continous_val.append(column)
Output:
Now here first we will be removing the target column from our set of features
then we will categorize all the categorical variables using the get dummies method
which will create a separate column for each category suppose X variable contains
2 types of unique values then it will create 2 different columns for the X variable.
categorical_val.remove('target')
dfs = pd.get_dummies(data, columns = categorical_val)
dfs.head(6)
Now we will be using the standard scaler method to scale down the data so that it
won’t raise the outliers also dataset which is scaled to general units leads to having
better accuracy.
sc = StandardScaler()
col_to_scale = ['age', 'trestbps', 'chol', 'thalach',
'oldpeak'] dfs[col_to_scale] =
sc.fit_transform(dfs[col_to_scale]) dfs.head(6)
Output:
Splitting the Features and Target
X=heart_data.drop(columns='target',axis=1)
Y=heart_data['target']
print(X)
print(Y)
Splitting the Data into Training data and Test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test
_size=0.2, stratify=Y, random_state=2)
print(X.shape,X_train.shape,X_test.shape)
Output:
MODEL TRAINING
Y = heart_data["target"]
X = heart_data.drop('target',axis=1) X_train, X_test,
Y_train, Y_test = train_test_split(X, Y, test
_size=0.20, random_state = 0)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Before applying algorithm we should check whether the data is equally splitted or not,
because if data is not splitted equally it will cause for data imbalacing problem
print(Y_test.unique())
Counter(Y_train)
Output:
ML MODELS:
Here I take different machine learning algorithm and try to find algorithm which
predict accurately.
1. Logistic Regression
2. Naive Bayes
3. Random Forest Classifier
4. Extreme Gradient Boost
5. K-Nearest Neighbour
6. Decision Tree
7. Support Vector Machine 1.Logistic Regression
model = LogisticRegression()
model.fit(X_train,Y_train)
m1 = 'Logistic Regression' lr = LogisticRegression() model1 =
lr.fit(X_train, Y_train) lr_predict = lr.predict(X_test)
lr_conf_matrix = confusion_matrix(Y_test, lr_predict)
lr_acc_score = accuracy_score(Y_test, lr_predict)
print("confussion matrix") print(lr_conf_matrix) print("\n")
print("Accuracy of Logistic
Regression:",lr_acc_score*100,'\n'
)
print(classification_report(Y_test,lr_predict))
Output:
2.Naive Bayes
m2 = 'Naive Bayes' nb = GaussianNB() model2 =
nb.fit(X_train,Y_train) nb_predict = nb.predict(X_test)
nb_conf_matrix = confusion_matrix(Y_test, nb_predict)
nb_acc_score = accuracy_score(Y_test, nb_predict)
print("confussion matrix") print(nb_conf_matrix) print("\n")
print("Accuracy of Naive Bayes
model:",nb_acc_score*100,'\n')
print(classification_report(Y_test,nb_predict))
Output:
3.Random Forest Classifier
m3 = 'Random Forest Classfier'
rf = RandomForestClassifier(n_estimators=20,
random_state=2,max_dept h=5)
model3 = rf.fit(X_train,Y_train) rf_predict =
rf.predict(X_test) rf_conf_matrix =
confusion_matrix(Y_test, rf_predict) rf_acc_score
= accuracy_score(Y_test, rf_predict)
print("confussion matrix") print(rf_conf_matrix)
print("\n") print("Accuracy of Random
Forest:",rf_acc_score*100,'\n')
print(classification_report(Y_test,rf_predict))
Output:
4.Extreme Gradient Boost
m4 = 'Extreme Gradient Boost' xgb =
XGBClassifier(learning_rate=0.01, n_estimators=25,
max_d epth=15,gamma=0.6,
subsample=0.52,colsample_bytree=0.6,seed=27
,
reg_lambda=2, booster='dart',
colsample_by level=0.6, colsample_bynode=0.5) model4 =
xgb.fit(X_train, Y_train) xgb_predict =
xgb.predict(X_test) xgb_conf_matrix =
confusion_matrix(Y_test, xgb_predict) xgb_acc_score =
accuracy_score(Y_test, xgb_predict) print("confussion
matrix") print(xgb_conf_matrix) print("\n")
print("Accuracy of Extreme Gradient
Boost:",xgb_acc_score*100,
'\n')
print(classification_report(Y_test,xgb_predict))
Output:
5.K-Nearest Neighbour
m5 = 'K-NeighborsClassifier' knn =
KNeighborsClassifier(n_neighbors=10) model5 =
knn.fit(X_train, Y_train) knn_predict =
knn.predict(X_test) knn_conf_matrix =
confusion_matrix(Y_test, knn_predict) knn_acc_score =
accuracy_score(Y_test, knn_predict) print("confussion
matrix") print(knn_conf_matrix) print("\n")
print("Accuracy of K-
NeighborsClassifier:",knn_acc_score*100,'\n')
print(classification_report(Y_test,knn_predict))
Output:
5.Decision Tree
m6 = 'DecisionTreeClassifier' dt =
DecisionTreeClassifier(criterion =
'entropy',random_state
=0,max_depth = 6)
model6 = dt.fit(X_train, Y_train) dt_predict =
dt.predict(X_test) dt_conf_matrix =
confusion_matrix(Y_test, dt_predict) dt_acc_score =
accuracy_score(Y_test, dt_predict) print("confussion
matrix") print(dt_conf_matrix) print("\n")
print("Accuracy of
DecisionTreeClassifier:",dt_acc_score*100,'
\n')
print(classification_report(Y_test,dt_predict))
Output:
7.Support Vector Machine
m7 = 'Support Vector Classifier' svc =
SVC(kernel='rbf', C=2) #the inside things are predifi
ned parameters model7 = svc.fit(X_train, Y_train)
svc_predict = svc.predict(X_test) svc_conf_matrix =
confusion_matrix(Y_test, svc_predict) svc_acc_score =
accuracy_score(Y_test, svc_predict) print("confussion
matrix") print(svc_conf_matrix) print("\n")
print("Accuracy of Support Vector
Classifier:",svc_acc_score*1
00,'\n')
print(classification_report(Y_test,svc_predict))
Output:
lr_false_positive_rate,lr_true_positive_rate,lr_threshold = roc_curve(Y
_test,lr_predict)
nb_false_positive_rate,nb_true_positive_rate,nb_threshold = roc_curve(Y
_test,nb_predict)
rf_false_positive_rate,rf_true_positive_rate,rf_threshold = roc_curve(Y
_test,rf_predict)
xgb_false_positive_rate,xgb_true_positive_rate,xgb_threshold = roc_curv
e(Y_test,xgb_predict)
knn_false_positive_rate,knn_true_positive_rate,knn_threshold = roc_curv
e(Y_test,knn_predict)
dt_false_positive_rate,dt_true_positive_rate,dt_threshold = roc_curve(Y
_test,dt_predict)
svc_false_positive_rate,svc_true_positive_rate,svc_threshold = roc_curv
e(Y_test,svc_predict)
sns.set_style('whitegrid')
plt.figure(figsize=(10,5))
plt.title('Receiver Operating Characterstic Curve')
plt.plot(lr_false_positive_rate,lr_true_positive_rate,label='Logistic R
egression')
plt.plot(nb_false_positive_rate,nb_true_positive_rate,label='Naive Baye
s')
plt.plot(rf_false_positive_rate,rf_true_positive_rate,label='Random For
est')
plt.plot(xgb_false_positive_rate,xgb_true_positive_rate,label='Extreme
Gradient Boost')
plt.plot(knn_false_positive_rate,knn_true_positive_rate,label='K-
Nearest Neighbor')
plt.plot(dt_false_positive_rate,dt_true_positive_rate,label='Desion Tre
e')
plt.plot(svc_false_positive_rate,svc_true_positive_rate,label='Support
Vector Classifier')
plt.plot([0,1],ls='--')
plt.plot([0,0],[1,0],c='.5')
plt.plot([1,1],c='.5')
plt.ylabel('True positive rate')
plt.xlabel('False positive rate')
plt.legend()
plt.show()
Output:
Model Evaluation
Accuracy Score
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction,Y_t
rain)
print('Accuracy on Training data: ', training_data_accuracy)
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction,Y_test)
print('Accuracy on Test data: ', test_data_accuracy)
Output:
model_ev = pd.DataFrame({'Model': ['Logistic Regression','Naiv
e Bayes','Random Forest','Extreme Gradient Boost',
'K-
Nearest Neighbour','Decision Tree','Support Vector Machine'],
'Accuracy': [lr_acc_score*100,
nb_acc_score*100,rf_acc_score*100,xgb_acc_
score*100,knn_acc_score*100,dt_acc_score*100,svc_acc_score*100
]})
model_ev
Output:
colors =
['red','green','blue','gold','silver','yellow','orang e',]
plt.figure(figsize=(12,5)) plt.title("barplot Represent
Accuracy of different models") plt.xlabel("Accuracy %")
plt.ylabel("Algorithms")
plt.bar(model_ev['Model'],model_ev['Accuracy'],color =
colors) plt.show()
Output:
BUILDING A PREDICTIVE SYSTEM
input_data = (41,0,1,130,204,0,0,172,0,1.4,2,0,2)
#change the input data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)
#reshape the numpy array as we are predicting for only one
ins tance input_data_reshaped =
input_data_as_numpy_array.reshape(1,-1)
prediction = model.predict(input_data_reshaped)
print(prediction)
if(prediction[0]==0):
print('The person does not have a Heart Disease')
else:
print('The person has Heart Disease')
input_data = (63,1,3,145,233,1,0,150,0,2.3,0,0,1)
#change the input data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)
#reshape the numpy array as we are predicting for only one ins
tance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)
prediction = model.predict(input_data_reshaped)
print(prediction)
if(prediction[0]==0):
print('The person does not have a Heart Disease')
else:
print('The person has Heart Disease')