3/15/2020 EDA Assignment
Haberman's Survival Data Set
Survival of patients who had undergone surgery for breast cancer.
About the Data Set: The dataset contains cases from a study that was conducted between 1958 and 1970
at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for
breast cancer.
Database Information:
Source of Data Set: https://www.kaggle.com/gilsousa/habermans-survival-data-set
(https://www.kaggle.com/gilsousa/habermans-survival-data-set)
Objective: To predict whether the patient will survive after 5 years or not based upon the patient's age, year
of treatment and the number of positive lymph nodes
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
haberman = pd.read_csv('haberman.csv', header = 0, names = ['age', 'year_of_operation',
'positive_axillary_nodes', 'survival_status'])
In [2]:
# 1 = positive, 2 = negative
haberman["survival_status"]=haberman["survival_status"].map({1:'positive',2:'negative'
})
haberman.head()
Out[2]:
age year_of_operation positive_axillary_nodes survival_status
0 30 64 1 positive
1 30 62 3 positive
2 30 65 0 positive
3 31 59 2 positive
4 31 65 4 positive
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 1/15
3/15/2020 EDA Assignment
In [3]:
haberman.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
year_of_operation 306 non-null int64
positive_axillary_nodes 306 non-null int64
survival_status 306 non-null object
dtypes: int64(3), object(1)
memory usage: 9.6+ KB
Observations:
306 entries ranging from 0 to 305
There are 4 columns, 3 describing the data and 1 for the output
Each column has 306 non-null values
The output, survival_status column can take only 2 values 1 or 2. Which I changed into positive and
negative.
Conclusions:
There are no missing values in this dataset.
This survival_status column is a binary-classification problem.
In [4]:
print (haberman.shape)
(306, 4)
Number of Rows: 306
Number of Columns: 4
Attribute Information:
Age of patient at time of operation (numerical)
Patient's year of operation (year - 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within
5 year
Missing Attribute Values: None
Positive axillary nodes: https://en.wikipedia.org/wiki/Positive_axillary_lymph_node
(https://en.wikipedia.org/wiki/Positive_axillary_lymph_node)
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 2/15
3/15/2020 EDA Assignment
In [5]:
haberman["survival_status"].value_counts()
Out[5]:
positive 225
negative 81
Name: survival_status, dtype: int64
Observation: In the above data which is an imbalance dataset, the number of patient who will survive after 5
years is 225 and 81 patients will died within 5 year.
In [6]:
print (haberman.describe())
age year_of_operation positive_axillary_nodes
count 306.000000 306.000000 306.000000
mean 52.457516 62.852941 4.026144
std 10.803452 3.249405 7.189654
min 30.000000 58.000000 0.000000
25% 44.000000 60.000000 0.000000
50% 52.000000 63.000000 1.000000
75% 60.750000 65.750000 4.000000
max 83.000000 69.000000 52.000000
Observation:
The Maximum age of the patients is 83 years & Minimum age is 30 years, with a mean value of
52.46 years.
The median age of the patients is 52 years.
All the operation of the given data done between 1958 - 1969.
The Maximum positive axillary nodes found in a patients is 52 & Minimum is 0 years, with an
average of 4.03.
In [7]:
haberman_positive = haberman.loc[haberman.survival_status == 'positive']
haberman_negative = haberman.loc[haberman.survival_status == 'negative']
In [8]:
print (haberman_positive.describe())
age year_of_operation positive_axillary_nodes
count 225.000000 225.000000 225.000000
mean 52.017778 62.862222 2.791111
std 11.012154 3.222915 5.870318
min 30.000000 58.000000 0.000000
25% 43.000000 60.000000 0.000000
50% 52.000000 63.000000 0.000000
75% 60.000000 66.000000 3.000000
max 77.000000 69.000000 46.000000
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 3/15
3/15/2020 EDA Assignment
Observation: Mean value of positive axillary nodes is 2.79
Conclusion: Hence we can assume that less amount of nodes may be signify less amount of health risk.
In [9]:
print (haberman_negative.describe())
age year_of_operation positive_axillary_nodes
count 81.000000 81.000000 81.000000
mean 53.679012 62.827160 7.456790
std 10.167137 3.342118 9.185654
min 34.000000 58.000000 0.000000
25% 46.000000 59.000000 1.000000
50% 53.000000 63.000000 4.000000
75% 61.000000 65.000000 11.000000
max 83.000000 69.000000 52.000000
Observation: Mean value of positive axillary nodes is 7.46
Conclusion: Hence we can assume that higher amount of nodes may be signify higher amount of health
risk.
2-D Scatter plot
In [10]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
.map(plt.scatter, "positive_axillary_nodes", "age") \
.add_legend();
plt.show();
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 4/15
3/15/2020 EDA Assignment
Observation: More amount of positive survival status found where the positive axillary nodes are minimum.
Conclusion: The chance of recovery is greater if the less amount of positive axillary nodes were found.
In [11]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
.map(plt.scatter, "year_of_operation", "age") \
.add_legend();
plt.show();
Conclusion: Most of the operations done in between 40 to 70 years of age.
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 5/15
3/15/2020 EDA Assignment
In [12]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
.map(plt.scatter, "positive_axillary_nodes", "year_of_operation") \
.add_legend();
plt.show();
Conclusion: No significant conclusion can be observerd as the data overlapped.
Pairplots
Here we get 3c2 = 3 pairplots because we have 3 features in which we can select only 2.
In the pairplots we did not take the principle diagonal graphs.
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 6/15
3/15/2020 EDA Assignment
In [13]:
sns.set_style('whitegrid')
sns.pairplot(haberman, hue='survival_status', vars=['age', 'year_of_operation', 'positi
ve_axillary_nodes'], size=4)
plt.show()
Conclusion:
We were unable to get any conclusion from the plot between year of operation vs age and year of
operation vs positive axillary nodes because the data mostly overlapped.
Where as it can be said that from the plot between positive axillary nodes & age it can be assumesd
that the chance of recovery is greater if the less amount of positive axillary nodes were found.
1D scatter
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 7/15
3/15/2020 EDA Assignment
In [14]:
# Patient Age
sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"age")\
.add_legend();
plt.show();
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
Observation: Patients with age range 30-40 have survived the most.
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 8/15
3/15/2020 EDA Assignment
In [15]:
# Year of Operation
sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"year_of_operation")\
.add_legend();
plt.show()
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
Observation:
Operation year having range (63-66) had highest un-successfull rate.
Operation year (60-62) had highest successfull rate.
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 9/15
3/15/2020 EDA Assignment
In [16]:
# Positive axillary nodes
sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"positive_axillary_nodes")\
.add_legend();
plt.show()
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Shiladitya\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6
462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced
by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
Observation: positive axillary nodes=0 has the highest Survival rate.
Conclusion: Positive axillary nodes is most significant amonog all as its value is lessly overlapped.
PDF & CDF
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 10/15
3/15/2020 EDA Assignment
In [17]:
counts, bin_edges = np.histogram(haberman_positive['positive_axillary_nodes'], bins=20,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('positive_axillary_nodes')
plt.show()
[0.73333333 0.10222222 0.02666667 0.05333333 0.01333333 0.00888889
0.02222222 0.00444444 0.00888889 0.00888889 0.00444444 0.
0.00444444 0.00444444 0. 0. 0. 0.
0. 0.00444444]
[ 0. 2.3 4.6 6.9 9.2 11.5 13.8 16.1 18.4 20.7 23. 25.3 27.6 29.9
32.2 34.5 36.8 39.1 41.4 43.7 46. ]
Observation: About 85% of the patients have a positive axillary nodes <= 10 who survive more than 5 years
after operation.
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 11/15
3/15/2020 EDA Assignment
In [18]:
counts, bin_edges = np.histogram(haberman_negative['positive_axillary_nodes'], bins=20,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('positive_axillary_nodes')
plt.show()
[0.39506173 0.17283951 0.0617284 0.08641975 0.04938272 0.08641975
0.01234568 0.03703704 0.0617284 0.01234568 0. 0.
0. 0.01234568 0. 0. 0. 0.
0. 0.01234568]
[ 0. 2.6 5.2 7.8 10.4 13. 15.6 18.2 20.8 23.4 26. 28.6 31.2 33.8
36.4 39. 41.6 44.2 46.8 49.4 52. ]
Observation: About 85% of the patients have a positive axillary nodes <= 20 who survive less than 5 years
after operation.
Conclusion: No of positive auxillary nodes is directly related with the chance of survive.
Box Plot & Whiskers
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 12/15
3/15/2020 EDA Assignment
In [19]:
sns.boxplot(x='survival_status',y='positive_axillary_nodes', data=haberman)
plt.show()
sns.boxplot(x='survival_status',y='year_of_operation', data=haberman)
plt.show()
sns.boxplot(x='survival_status',y='age', data=haberman)
plt.show()
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 13/15
3/15/2020 EDA Assignment
In [20]:
sns.violinplot(x='survival_status',y='positive_axillary_nodes', data=haberman)
plt.show()
sns.violinplot(x='survival_status',y='year_of_operation', data=haberman)
plt.show()
sns.violinplot(x='survival_status',y='age', data=haberman)
plt.show()
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 14/15
3/15/2020 EDA Assignment
Observation: Out of the 3 features, positive_auxiliary_nodes has the most significant distinct-distribution
among the two-classes. From the above observation we can only conclude that higher the
positive_axillary_nodes, higher the chances of their death. The age of the patient does not seem to have any
relation with survial status.
Final Conclusion:
From the dataset we can say that the majority of operations are performed on people age between
40 to70. Observing scatter plot between year_of_operation vs age.
We can say that a large number of operation were done in between 1960 and 1965 (From box plot
between year_of_operation vs survival_status)
We get a conclusion that patients with 0 positive axillary nodes are more likely to survive
irrespective to there age. (positive_axillary_nodes vs age)
Patients with age range 30-40 have survived the most.
From the box plot we can say that, the more number of positive axillary nodes, the more chances
that the patients would die.
file:///C:/Users/Shiladitya/Desktop/EDA Assignment.html 15/15