KEMBAR78
Samplecode (HDPS) | PDF | Data Analysis | Statistics
0% found this document useful (0 votes)
9 views29 pages

Samplecode (HDPS)

The document provides a comprehensive guide on data analysis and machine learning modeling for heart disease prediction using Python. It includes steps for data import, preprocessing, exploratory data analysis (EDA), feature engineering, model training with various algorithms, and evaluation metrics. The final section outlines how to build a predictive system based on the trained models.

Uploaded by

aryanreddy2705
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views29 pages

Samplecode (HDPS)

The document provides a comprehensive guide on data analysis and machine learning modeling for heart disease prediction using Python. It includes steps for data import, preprocessing, exploratory data analysis (EDA), feature engineering, model training with various algorithms, and evaluation metrics. The final section outlines how to build a predictive system based on the trained models.

Uploaded by

aryanreddy2705
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

5.

2 SAMPLE CODE
Importing the dependencies

#loading
dataset import
pandas as pd
import numpy as
np
#visualisation
import
matplotlib.pypl
ot as plt
%matplotlib inline
#%matplotlib inline is a magic command for IPython that allows
you to a dd plots to the browser interface.
import seaborn as sns
#seaborn is a library for making statistical graphics
#EDA from collections
import Counter import
pandas_profiling as pp
#data processing from sklearn.preprocessing import
StandardScaler #StandardScaler standa rdizes a feature by
subtracting the mean and then scaling to unit varia nce. Unit
variance means dividing all the values by the standard deviat
ion.
# data splitting from
sklearn.model_selection import
train_test_split
# data modeling from sklearn.metrics import
confusion_matrix,accuracy_score,roc_curve,c
lassification_report from sklearn.linear_model import
LogisticRegression from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier from sklearn.ensemble import
RandomForestClassifier from sklearn.tree import
DecisionTreeClassifier from sklearn.neighbors import
KNeighborsClassifier from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
Data collection and processing
heart_data = pd.read_csv('/content/heart.csv')

heart_data.head()

Output:

heart_data.tail()

Output:

heart_data.shape

Output:

heart_data.info()

Output:
Missing Value Detection

heart_data.isnull().sum()

Output:

Descriptive statistics
heart_data.describe()

Output:
heart_data['target'].value_counts()

Output:

1----> Defective heart


0----> Healthy heart

It is always better to check the correlation between the features so that we


can analyze that which feature is negatively correlated and which is
positively correlated so, Let’s check the correlation between various
features.

plt.figure(figsize=(20,12))
sns.set_context('notebook',font_scale = 1.3)
sns.heatmap(data.corr(),annot=True,linewidth =2)
plt.tight_layout()

Output:
By far we have checked the correlation between the features but it is also a good
practice to check the correlation of the target variable.

sns.set_context('notebook',font_scale = 2.3)
data.drop('target', axis=1).corrwith(data.target).plot(kind='b
ar',
grid=True, figsize=(20, 10),title="Correlation with the target

feature")
plt.tight_layout()
Inference: Insights from the above graph are:

• Four feature( “cp”, “restecg”, “thalach”, “slope” ) are positively


correlated with the target feature.
• Other features are negatively correlated with the target feature.

So, we have done enough collective analysis now let’s go for the analysis of the
individual features which comprises both univariate and bivariate analysis.

Age(“age”) Analysis
Here we will be checking the 10 ages and their
counts.
plt.figure(figsize=(25,12))
sns.set_context('notebook',font_scale = 1.5)
sns.barplot(x=data.age.value_counts()[:10].index,y=data.age.
value_counts()[:10].values)
plt.tight_layout()

Output:
Inference: Here we can see that the 58 age column has the highest
frequency.

Let’s check the range of age in the


dataset.
minAge = min(data.age)
maxAge = max(data.age)
meanAge = data.age.mean()
print('Min Age :',minAge)
print('Max Age :',maxAge)
print('Mean Age :',meanAge)

Output:

Min Age : 29 Max Age : 77 Mean Age : 54.366336633663366

We should divide the Age feature into three parts – “Young”, “Middle”
and “Elder”

Young = data[(data.age>=29)&(data.age<40)] Middle =


data[(data.age>=40)&(data.age<55)] Elder =
data[(data.age>55)] plt.figure(figsize=(23,10))
sns.set_context('notebook',font_scale = 1.5)
sns.barplot(x=['young ages','middle ages','elderly
ages'],y=[l en(Young),len(Middle),len(Elder)])
plt.tight_layout()
Output:

Inference: Here we can see that elder people are the most affected by
heart disease and young ones are the least affected.

To prove the above inference we will plot the pie chart.

colors = ['blue','green','yellow'] explode = [0,0,0.1]


plt.figure(figsize=(10,10))
sns.set_context('notebook',font_scale = 1.2)
plt.pie([len(Young),len(Middle),len(Elder)],labels=['young
age s','middle ages','elderly
ages'],explode=explode,colors=colors
, autopct='%1.1f%%')
plt.tight_layout()

Output:
Sex(“sex”) Feature
Analysis
plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale
= 1.5) sns.countplot(data['sex'])
plt.tight_layout()

Output:
Inference: Here it is clearly visible that, Ratio of Male to Female is approx
2:1.

Now let’s plot the relation between sex and


slope.
plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale = 1.5)
sns.countplot(data['sex'],hue=data["slope"])
plt.tight_layout()

Output:
Inference: Here it is clearly visible that the slope value is higher in the case
of males(1).

Chest Pain Type(“cp”)


Analysis
plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale =
1.5) sns.countplot(data['cp'])
plt.tight_layout()

Output:

Inference: As seen, there are 4 types of chest pain

1. status at least
2. condition slightly distressed
3. condition medium problem
4. condition too bad

Analyzing cp vs target column


Inference: From the above graph we can make some inferences,

• People having the least chest pain are not likely to have heart
disease.
• People having severe chest pain are likely to have heart disease.

Elderly people are more likely to have chest pain.

Thal Analysis

plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale = 1.5)
sns.countplot(data['thal'])
plt.tight_layout()

Output:
Target
plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale =
1.5) sns.countplot(data['target'])
plt.tight_layout()

Output:
Inference: The ratio between 1 and 0 is much less than 1.5 which indicates
that the target feature is not imbalanced. So for a balanced dataset, we
can use accuracy_score as evaluation metrics for our model.

Feature Engineering
Now we will see the complete description of the continuous data as well
as the categorical data
categorical_val = []
continous_val = [] for
column in data.columns:
print("--------------------")
print(f"{column} : {data[column].unique()}")
if len(data[column].unique()) <= 10:
categorical_val.append(column) else:
continous_val.append(column)

Output:
Now here first we will be removing the target column from our set of features
then we will categorize all the categorical variables using the get dummies method
which will create a separate column for each category suppose X variable contains
2 types of unique values then it will create 2 different columns for the X variable.

categorical_val.remove('target')
dfs = pd.get_dummies(data, columns = categorical_val)
dfs.head(6)

Now we will be using the standard scaler method to scale down the data so that it
won’t raise the outliers also dataset which is scaled to general units leads to having
better accuracy.

sc = StandardScaler()
col_to_scale = ['age', 'trestbps', 'chol', 'thalach',
'oldpeak'] dfs[col_to_scale] =
sc.fit_transform(dfs[col_to_scale]) dfs.head(6)

Output:
Splitting the Features and Target
X=heart_data.drop(columns='target',axis=1)
Y=heart_data['target']

print(X)

print(Y)
Splitting the Data into Training data and Test data

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test


_size=0.2, stratify=Y, random_state=2)

print(X.shape,X_train.shape,X_test.shape)

Output:

MODEL TRAINING
Y = heart_data["target"]
X = heart_data.drop('target',axis=1) X_train, X_test,
Y_train, Y_test = train_test_split(X, Y, test
_size=0.20, random_state = 0)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Before applying algorithm we should check whether the data is equally splitted or not,
because if data is not splitted equally it will cause for data imbalacing problem

print(Y_test.unique())
Counter(Y_train)

Output:
ML MODELS:
Here I take different machine learning algorithm and try to find algorithm which
predict accurately.

1. Logistic Regression
2. Naive Bayes
3. Random Forest Classifier
4. Extreme Gradient Boost
5. K-Nearest Neighbour
6. Decision Tree
7. Support Vector Machine 1.Logistic Regression
model = LogisticRegression()
model.fit(X_train,Y_train)

m1 = 'Logistic Regression' lr = LogisticRegression() model1 =


lr.fit(X_train, Y_train) lr_predict = lr.predict(X_test)
lr_conf_matrix = confusion_matrix(Y_test, lr_predict)
lr_acc_score = accuracy_score(Y_test, lr_predict)
print("confussion matrix") print(lr_conf_matrix) print("\n")
print("Accuracy of Logistic
Regression:",lr_acc_score*100,'\n'
)
print(classification_report(Y_test,lr_predict))

Output:
2.Naive Bayes

m2 = 'Naive Bayes' nb = GaussianNB() model2 =


nb.fit(X_train,Y_train) nb_predict = nb.predict(X_test)
nb_conf_matrix = confusion_matrix(Y_test, nb_predict)
nb_acc_score = accuracy_score(Y_test, nb_predict)
print("confussion matrix") print(nb_conf_matrix) print("\n")
print("Accuracy of Naive Bayes
model:",nb_acc_score*100,'\n')
print(classification_report(Y_test,nb_predict))

Output:

3.Random Forest Classifier

m3 = 'Random Forest Classfier'


rf = RandomForestClassifier(n_estimators=20,
random_state=2,max_dept h=5)
model3 = rf.fit(X_train,Y_train) rf_predict =
rf.predict(X_test) rf_conf_matrix =
confusion_matrix(Y_test, rf_predict) rf_acc_score
= accuracy_score(Y_test, rf_predict)
print("confussion matrix") print(rf_conf_matrix)
print("\n") print("Accuracy of Random
Forest:",rf_acc_score*100,'\n')
print(classification_report(Y_test,rf_predict))

Output:
4.Extreme Gradient Boost

m4 = 'Extreme Gradient Boost' xgb =


XGBClassifier(learning_rate=0.01, n_estimators=25,
max_d epth=15,gamma=0.6,
subsample=0.52,colsample_bytree=0.6,seed=27
,
reg_lambda=2, booster='dart',
colsample_by level=0.6, colsample_bynode=0.5) model4 =
xgb.fit(X_train, Y_train) xgb_predict =
xgb.predict(X_test) xgb_conf_matrix =
confusion_matrix(Y_test, xgb_predict) xgb_acc_score =
accuracy_score(Y_test, xgb_predict) print("confussion
matrix") print(xgb_conf_matrix) print("\n")
print("Accuracy of Extreme Gradient
Boost:",xgb_acc_score*100,
'\n')
print(classification_report(Y_test,xgb_predict))

Output:
5.K-Nearest Neighbour

m5 = 'K-NeighborsClassifier' knn =
KNeighborsClassifier(n_neighbors=10) model5 =
knn.fit(X_train, Y_train) knn_predict =
knn.predict(X_test) knn_conf_matrix =
confusion_matrix(Y_test, knn_predict) knn_acc_score =
accuracy_score(Y_test, knn_predict) print("confussion
matrix") print(knn_conf_matrix) print("\n")
print("Accuracy of K-
NeighborsClassifier:",knn_acc_score*100,'\n')
print(classification_report(Y_test,knn_predict))

Output:
5.Decision Tree

m6 = 'DecisionTreeClassifier' dt =
DecisionTreeClassifier(criterion =
'entropy',random_state
=0,max_depth = 6)
model6 = dt.fit(X_train, Y_train) dt_predict =
dt.predict(X_test) dt_conf_matrix =
confusion_matrix(Y_test, dt_predict) dt_acc_score =
accuracy_score(Y_test, dt_predict) print("confussion
matrix") print(dt_conf_matrix) print("\n")
print("Accuracy of
DecisionTreeClassifier:",dt_acc_score*100,'
\n')
print(classification_report(Y_test,dt_predict))
Output:
7.Support Vector Machine

m7 = 'Support Vector Classifier' svc =


SVC(kernel='rbf', C=2) #the inside things are predifi
ned parameters model7 = svc.fit(X_train, Y_train)
svc_predict = svc.predict(X_test) svc_conf_matrix =
confusion_matrix(Y_test, svc_predict) svc_acc_score =
accuracy_score(Y_test, svc_predict) print("confussion
matrix") print(svc_conf_matrix) print("\n")
print("Accuracy of Support Vector
Classifier:",svc_acc_score*1
00,'\n')
print(classification_report(Y_test,svc_predict))

Output:
lr_false_positive_rate,lr_true_positive_rate,lr_threshold = roc_curve(Y
_test,lr_predict)
nb_false_positive_rate,nb_true_positive_rate,nb_threshold = roc_curve(Y
_test,nb_predict)
rf_false_positive_rate,rf_true_positive_rate,rf_threshold = roc_curve(Y
_test,rf_predict)

xgb_false_positive_rate,xgb_true_positive_rate,xgb_threshold = roc_curv
e(Y_test,xgb_predict)
knn_false_positive_rate,knn_true_positive_rate,knn_threshold = roc_curv
e(Y_test,knn_predict)
dt_false_positive_rate,dt_true_positive_rate,dt_threshold = roc_curve(Y
_test,dt_predict)
svc_false_positive_rate,svc_true_positive_rate,svc_threshold = roc_curv
e(Y_test,svc_predict)

sns.set_style('whitegrid')
plt.figure(figsize=(10,5))
plt.title('Receiver Operating Characterstic Curve')
plt.plot(lr_false_positive_rate,lr_true_positive_rate,label='Logistic R
egression')
plt.plot(nb_false_positive_rate,nb_true_positive_rate,label='Naive Baye
s')
plt.plot(rf_false_positive_rate,rf_true_positive_rate,label='Random For
est')
plt.plot(xgb_false_positive_rate,xgb_true_positive_rate,label='Extreme
Gradient Boost')
plt.plot(knn_false_positive_rate,knn_true_positive_rate,label='K-
Nearest Neighbor')
plt.plot(dt_false_positive_rate,dt_true_positive_rate,label='Desion Tre
e')
plt.plot(svc_false_positive_rate,svc_true_positive_rate,label='Support
Vector Classifier')
plt.plot([0,1],ls='--')
plt.plot([0,0],[1,0],c='.5')
plt.plot([1,1],c='.5')
plt.ylabel('True positive rate')
plt.xlabel('False positive rate')
plt.legend()
plt.show()
Output:

Model Evaluation
Accuracy Score

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction,Y_t
rain)

print('Accuracy on Training data: ', training_data_accuracy)

X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction,Y_test)

print('Accuracy on Test data: ', test_data_accuracy)

Output:
model_ev = pd.DataFrame({'Model': ['Logistic Regression','Naiv
e Bayes','Random Forest','Extreme Gradient Boost',
'K-
Nearest Neighbour','Decision Tree','Support Vector Machine'],
'Accuracy': [lr_acc_score*100,
nb_acc_score*100,rf_acc_score*100,xgb_acc_
score*100,knn_acc_score*100,dt_acc_score*100,svc_acc_score*100
]})
model_ev

Output:

colors =
['red','green','blue','gold','silver','yellow','orang e',]
plt.figure(figsize=(12,5)) plt.title("barplot Represent
Accuracy of different models") plt.xlabel("Accuracy %")
plt.ylabel("Algorithms")
plt.bar(model_ev['Model'],model_ev['Accuracy'],color =
colors) plt.show()

Output:
BUILDING A PREDICTIVE SYSTEM
input_data = (41,0,1,130,204,0,0,172,0,1.4,2,0,2)

#change the input data to a numpy array


input_data_as_numpy_array = np.asarray(input_data)

#reshape the numpy array as we are predicting for only one


ins tance input_data_reshaped =
input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshaped)
print(prediction)

if(prediction[0]==0):
print('The person does not have a Heart Disease')
else:
print('The person has Heart Disease')

input_data = (63,1,3,145,233,1,0,150,0,2.3,0,0,1)

#change the input data to a numpy array


input_data_as_numpy_array = np.asarray(input_data)

#reshape the numpy array as we are predicting for only one ins
tance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshaped)
print(prediction)

if(prediction[0]==0):
print('The person does not have a Heart Disease')
else:
print('The person has Heart Disease')

You might also like