9/11/25, 11:51 PM Data Mining Lab.
ipynb - Colab
Name : Syed Muhammad Hassan Reg No : B23F1000DS056 Course: Data Mining Lab Program : Data Science
Import Libraries , Numpy is used for numerical computation , pandas is also a library that does numerical computation but it provides
advanced options and it is easy . Matlpotlib and seaborn is used for visualization.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Diabates_data= pd.read_csv('/content/diabetes.csv')
Diabates_data.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
Next steps: Generate code with Diabates_data toggle_off View recommended plots New interactive sheet
This is the data of females glucose level , blood pressure and skin thickness is collacted from triceps
The head will print first 5 rows and if we want to reverse it tail will show last 5 rows
Diabates_data.tail()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
763 10 101 76 48 180 32.9 0.171 63 0
764 2 122 70 27 0 36.8 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 0 0 30.1 0.349 47 1
767 1 93 70 31 0 30.4 0.315 23 0
Diabates_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
Describe function shows statiscial data mean , std , mod and median
Diabates_data.describe()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction A
t 768 000000 768 000000 768 000000 768 000000 768 000000 768 000000 768 000000 768 0000
https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 1/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.0000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.2408
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.7602
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.0000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.0000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.0000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.0000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.0000
Double-click (or enter) to edit
Double-click (or enter) to edit
Diabates_data.shape
(768, 9)
Shape is used to show that there are rows and columns there are 768 rows and 9 columns
It has no null value
Diabates_data.isnull().sum()
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
This condition checks how many values are there for one label and zero label
Diabates_data['Outcome'].value_counts()
count
Outcome
0 500
1 268
dtype: int64
0 means non diabatic so 500 patients are non diabateic and 268 is for 1 who are diabetic
We get the mean values for glucose and other . this shows the average for labeled data it shows mean of diabatic and non- diabatec and
we can see the clear difference between bp of diabetic and non-diabetic
Diabates_data.groupby('Outcome').mean()
https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 2/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction A
Outcome
0 3.298000 109.980000 68.184000 19.664000 68.792000 30.304200 0.429734 31.1900
1 4.865672 141.257463 70.824627 22.164179 100.335821 35.142537 0.550500 37.0671
A negative value in a correlation matrix heatmap indicates a negative correlation or an inverse relationship between the two variables it
connects .-1: A perfect negative correlation. The variables move in perfect opposition
Positive value shows that there is strong relation between values means true positive means actual and prdicted value is same, in this
diabetic case like skinnthickness and insulin has high relationship means actual and predicted are close. in age and thickness there is
weak relation means age is not related to skin thickness
plt.figure(figsize=(10,8))
sns.heatmap(Diabates_data.corr(), annot=True, cmap='Blues')
plt.title('Correlation Heatmap of Diabetes Dataset')
plt.show()
the histogram shows the peak and lowest values
Diabates_data.hist(figsize=(15, 10))
plt.tight_layout()
https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 3/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab
plt.show()
the dots represents the patient featues
plt.figure(figsize=(15, 10))
Diabates_data.boxplot()
plt.title('Boxplot of Diabetes Dataset Features')
plt.ylabel('Value')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 4/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab
piechart gives the clear picture of diabetic and non dibeti patient ratio
outcome_counts = Diabates_data['Outcome'].value_counts()
plt.figure(figsize=(6, 6))
plt.pie(outcome_counts, labels=['Non-Diabetic (0)', 'Diabetic (1)'], autopct='%1.1f%%', startangle=140, colors=['skyblue', 'lightc
plt.title('Distribution of Diabetes Outcome')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
we can compare both diabetic and nondiabetic bp , skinn thickness
Diabates_data.groupby('Outcome').mean().plot(kind='bar', figsize=(12, 6))
plt.title('Mean of Features by Diabetes Outcome')
https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 5/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab
plt.ylabel('Mean Value')
plt.xticks(rotation=0)
plt.show()
sns.countplot(x='Outcome', data=Diabates_data)
plt.title('Count of Diabetes Outcome')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.xticks([0, 1], ['Non-Diabetic', 'Diabetic'])
plt.show()
it shows the data spread of points how points lie on plane we can apply knn clustring on that and divide it into further clusters.
This code scales all the feature columns (like Glucose, BMI) of the diabetes dataset to a range between 0 and 1, which is a common
requirement for machine learning algorithms.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(Diabates_data.drop('Outcome', axis=1))
scaled df = pd.DataFrame(scaled data, columns=Diabates data.drop('Outcome', axis=1).columns)
https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 6/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab
_ p ( _ , _ p( , ) )
scaled_df['Outcome'] = Diabates_data['Outcome']
display(scaled_df.head())
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 0.352941 0.743719 0.590164 0.353535 0.000000 0.500745 0.234415 0.483333 1
1 0.058824 0.427136 0.540984 0.292929 0.000000 0.396423 0.116567 0.166667 0
2 0.470588 0.919598 0.524590 0.000000 0.000000 0.347243 0.253629 0.183333 1
3 0.058824 0.447236 0.540984 0.232323 0.111111 0.418778 0.038002 0.000000 0
4 0.000000 0.688442 0.327869 0.353535 0.198582 0.642325 0.943638 0.200000 1
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Glucose', y='BMI', hue='Outcome', data=Diabates_data)
plt.title('Scatter Plot of Glucose vs. BMI by Outcome')
plt.xlabel('Glucose')
plt.ylabel('BMI')
plt.show()
Summary age and pregnency has highest correlation 0.54 this is the dataset of females it is biased but this hows that pregnent women
has higher chance of diabetese , Skinn thickness and insulin is highly correlated means insulin should be matched with skinn thickness
because both featues has positive relationship means if insulin applied on skinnthickness that doesnt matches it requiremnt ther is high
chance of diabates
import random
sm, la = [], []
while len(sm) < 2 or len(la) < 2:
num random randint(0 1000)
https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 7/7