KEMBAR78
Data Mining Lab - Ipynb - Colab | PDF | Diabetes
0% found this document useful (0 votes)
11 views7 pages

Data Mining Lab - Ipynb - Colab

The document is a Jupyter notebook detailing a Data Mining Lab project focused on analyzing diabetes data using Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn. It includes data importation, exploratory data analysis, statistical summaries, and visualizations to understand the relationships between various health metrics and diabetes outcomes. Key findings indicate correlations between features like skin thickness and insulin levels, and the dataset consists of 768 entries with no missing values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

Data Mining Lab - Ipynb - Colab

The document is a Jupyter notebook detailing a Data Mining Lab project focused on analyzing diabetes data using Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn. It includes data importation, exploratory data analysis, statistical summaries, and visualizations to understand the relationships between various health metrics and diabetes outcomes. Key findings indicate correlations between features like skin thickness and insulin levels, and the dataset consists of 768 entries with no missing values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

9/11/25, 11:51 PM Data Mining Lab.

ipynb - Colab

Name : Syed Muhammad Hassan Reg No : B23F1000DS056 Course: Data Mining Lab Program : Data Science

Import Libraries , Numpy is used for numerical computation , pandas is also a library that does numerical computation but it provides
advanced options and it is easy . Matlpotlib and seaborn is used for visualization.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Diabates_data= pd.read_csv('/content/diabetes.csv')

Diabates_data.head()

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

Next steps: Generate code with Diabates_data toggle_off View recommended plots New interactive sheet

This is the data of females glucose level , blood pressure and skin thickness is collacted from triceps

The head will print first 5 rows and if we want to reverse it tail will show last 5 rows

Diabates_data.tail()

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

763 10 101 76 48 180 32.9 0.171 63 0

764 2 122 70 27 0 36.8 0.340 27 0

765 5 121 72 23 112 26.2 0.245 30 0

766 1 126 60 0 0 30.1 0.349 47 1

767 1 93 70 31 0 30.4 0.315 23 0

Diabates_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Describe function shows statiscial data mean , std , mod and median

Diabates_data.describe()

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction A

t 768 000000 768 000000 768 000000 768 000000 768 000000 768 000000 768 000000 768 0000
https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 1/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.0000

mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.2408

std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.7602

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.0000

25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.0000

50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.0000

75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.0000

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.0000

Double-click (or enter) to edit

Double-click (or enter) to edit

Diabates_data.shape

(768, 9)

Shape is used to show that there are rows and columns there are 768 rows and 9 columns

It has no null value

Diabates_data.isnull().sum()

Pregnancies 0

Glucose 0

BloodPressure 0

SkinThickness 0

Insulin 0

BMI 0

DiabetesPedigreeFunction 0

Age 0

Outcome 0

dtype: int64

This condition checks how many values are there for one label and zero label

Diabates_data['Outcome'].value_counts()

count

Outcome

0 500

1 268

dtype: int64

0 means non diabatic so 500 patients are non diabateic and 268 is for 1 who are diabetic

We get the mean values for glucose and other . this shows the average for labeled data it shows mean of diabatic and non- diabatec and
we can see the clear difference between bp of diabetic and non-diabetic

Diabates_data.groupby('Outcome').mean()

https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 2/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction A

Outcome

0 3.298000 109.980000 68.184000 19.664000 68.792000 30.304200 0.429734 31.1900

1 4.865672 141.257463 70.824627 22.164179 100.335821 35.142537 0.550500 37.0671

A negative value in a correlation matrix heatmap indicates a negative correlation or an inverse relationship between the two variables it
connects .-1: A perfect negative correlation. The variables move in perfect opposition

Positive value shows that there is strong relation between values means true positive means actual and prdicted value is same, in this
diabetic case like skinnthickness and insulin has high relationship means actual and predicted are close. in age and thickness there is
weak relation means age is not related to skin thickness

plt.figure(figsize=(10,8))
sns.heatmap(Diabates_data.corr(), annot=True, cmap='Blues')
plt.title('Correlation Heatmap of Diabetes Dataset')
plt.show()

the histogram shows the peak and lowest values

Diabates_data.hist(figsize=(15, 10))
plt.tight_layout()

https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 3/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab

plt.show()

the dots represents the patient featues

plt.figure(figsize=(15, 10))
Diabates_data.boxplot()
plt.title('Boxplot of Diabetes Dataset Features')
plt.ylabel('Value')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 4/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab

piechart gives the clear picture of diabetic and non dibeti patient ratio

outcome_counts = Diabates_data['Outcome'].value_counts()
plt.figure(figsize=(6, 6))
plt.pie(outcome_counts, labels=['Non-Diabetic (0)', 'Diabetic (1)'], autopct='%1.1f%%', startangle=140, colors=['skyblue', 'lightc
plt.title('Distribution of Diabetes Outcome')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

we can compare both diabetic and nondiabetic bp , skinn thickness

Diabates_data.groupby('Outcome').mean().plot(kind='bar', figsize=(12, 6))


plt.title('Mean of Features by Diabetes Outcome')

https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 5/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab

plt.ylabel('Mean Value')
plt.xticks(rotation=0)
plt.show()

sns.countplot(x='Outcome', data=Diabates_data)
plt.title('Count of Diabetes Outcome')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.xticks([0, 1], ['Non-Diabetic', 'Diabetic'])
plt.show()

it shows the data spread of points how points lie on plane we can apply knn clustring on that and divide it into further clusters.

This code scales all the feature columns (like Glucose, BMI) of the diabetes dataset to a range between 0 and 1, which is a common
requirement for machine learning algorithms.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(Diabates_data.drop('Outcome', axis=1))

scaled df = pd.DataFrame(scaled data, columns=Diabates data.drop('Outcome', axis=1).columns)


https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 6/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab
_ p ( _ , _ p( , ) )

scaled_df['Outcome'] = Diabates_data['Outcome']

display(scaled_df.head())

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 0.352941 0.743719 0.590164 0.353535 0.000000 0.500745 0.234415 0.483333 1

1 0.058824 0.427136 0.540984 0.292929 0.000000 0.396423 0.116567 0.166667 0

2 0.470588 0.919598 0.524590 0.000000 0.000000 0.347243 0.253629 0.183333 1

3 0.058824 0.447236 0.540984 0.232323 0.111111 0.418778 0.038002 0.000000 0

4 0.000000 0.688442 0.327869 0.353535 0.198582 0.642325 0.943638 0.200000 1

plt.figure(figsize=(8, 6))
sns.scatterplot(x='Glucose', y='BMI', hue='Outcome', data=Diabates_data)
plt.title('Scatter Plot of Glucose vs. BMI by Outcome')
plt.xlabel('Glucose')
plt.ylabel('BMI')
plt.show()

Summary age and pregnency has highest correlation 0.54 this is the dataset of females it is biased but this hows that pregnent women
has higher chance of diabetese , Skinn thickness and insulin is highly correlated means insulin should be matched with skinn thickness
because both featues has positive relationship means if insulin applied on skinnthickness that doesnt matches it requiremnt ther is high
chance of diabates

import random

sm, la = [], []
while len(sm) < 2 or len(la) < 2:
num random randint(0 1000)

https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 7/7

You might also like