0% found this document useful (0 votes)

11 views7 pages

Data Mining Lab - Ipynb - Colab

The document is a Jupyter notebook detailing a Data Mining Lab project focused on analyzing diabetes data using Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn. It includes data importation, exploratory data analysis, statistical summaries, and visualizations to understand the relationships between various health metrics and diabetes outcomes. Key findings indicate correlations between features like skin thickness and insulin levels, and the dataset consists of 768 entries with no missing values.

Uploaded by

Syed Mazhar Hussain Jaffery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views7 pages

Data Mining Lab - Ipynb - Colab

Uploaded by

Syed Mazhar Hussain Jaffery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

9/11/25, 11:51 PM Data Mining Lab.

ipynb - Colab

Name : Syed Muhammad Hassan Reg No : B23F1000DS056 Course: Data Mining Lab Program : Data Science

Import Libraries , Numpy is used for numerical computation , pandas is also a library that does numerical computation but it provides
advanced options and it is easy . Matlpotlib and seaborn is used for visualization.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Diabates_data= pd.read_csv('/content/diabetes.csv')

Diabates_data.head()

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

Next steps: Generate code with Diabates_data toggle_off View recommended plots New interactive sheet

This is the data of females glucose level , blood pressure and skin thickness is collacted from triceps

The head will print first 5 rows and if we want to reverse it tail will show last 5 rows

Diabates_data.tail()

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

763 10 101 76 48 180 32.9 0.171 63 0

764 2 122 70 27 0 36.8 0.340 27 0

765 5 121 72 23 112 26.2 0.245 30 0

766 1 126 60 0 0 30.1 0.349 47 1

767 1 93 70 31 0 30.4 0.315 23 0

Diabates_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Describe function shows statiscial data mean , std , mod and median

Diabates_data.describe()

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction A

t 768 000000 768 000000 768 000000 768 000000 768 000000 768 000000 768 000000 768 0000
https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 1/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.0000

mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.2408

std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.7602

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.0000

25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.0000

50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.0000

75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.0000

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.0000

Double-click (or enter) to edit

Diabates_data.shape

(768, 9)

Shape is used to show that there are rows and columns there are 768 rows and 9 columns

It has no null value

Diabates_data.isnull().sum()

Pregnancies 0

Glucose 0

BloodPressure 0

SkinThickness 0

Insulin 0

BMI 0

DiabetesPedigreeFunction 0

Age 0

Outcome 0

dtype: int64

This condition checks how many values are there for one label and zero label

Diabates_data['Outcome'].value_counts()

count

Outcome

0 500

1 268

dtype: int64

0 means non diabatic so 500 patients are non diabateic and 268 is for 1 who are diabetic

We get the mean values for glucose and other . this shows the average for labeled data it shows mean of diabatic and non- diabatec and
we can see the clear difference between bp of diabetic and non-diabetic

Diabates_data.groupby('Outcome').mean()

https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 2/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction A

Outcome

0 3.298000 109.980000 68.184000 19.664000 68.792000 30.304200 0.429734 31.1900

1 4.865672 141.257463 70.824627 22.164179 100.335821 35.142537 0.550500 37.0671

A negative value in a correlation matrix heatmap indicates a negative correlation or an inverse relationship between the two variables it
connects .-1: A perfect negative correlation. The variables move in perfect opposition

Positive value shows that there is strong relation between values means true positive means actual and prdicted value is same, in this
diabetic case like skinnthickness and insulin has high relationship means actual and predicted are close. in age and thickness there is
weak relation means age is not related to skin thickness

plt.figure(figsize=(10,8))
sns.heatmap(Diabates_data.corr(), annot=True, cmap='Blues')
plt.title('Correlation Heatmap of Diabetes Dataset')
plt.show()

the histogram shows the peak and lowest values

Diabates_data.hist(figsize=(15, 10))
plt.tight_layout()

https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 3/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab

plt.show()

the dots represents the patient featues

plt.figure(figsize=(15, 10))
Diabates_data.boxplot()
plt.title('Boxplot of Diabetes Dataset Features')
plt.ylabel('Value')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 4/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab

piechart gives the clear picture of diabetic and non dibeti patient ratio

outcome_counts = Diabates_data['Outcome'].value_counts()
plt.figure(figsize=(6, 6))
plt.pie(outcome_counts, labels=['Non-Diabetic (0)', 'Diabetic (1)'], autopct='%1.1f%%', startangle=140, colors=['skyblue', 'lightc
plt.title('Distribution of Diabetes Outcome')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

we can compare both diabetic and nondiabetic bp , skinn thickness

Diabates_data.groupby('Outcome').mean().plot(kind='bar', figsize=(12, 6))

plt.title('Mean of Features by Diabetes Outcome')

https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 5/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab

plt.ylabel('Mean Value')
plt.xticks(rotation=0)
plt.show()

sns.countplot(x='Outcome', data=Diabates_data)
plt.title('Count of Diabetes Outcome')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.xticks([0, 1], ['Non-Diabetic', 'Diabetic'])
plt.show()

it shows the data spread of points how points lie on plane we can apply knn clustring on that and divide it into further clusters.

This code scales all the feature columns (like Glucose, BMI) of the diabetes dataset to a range between 0 and 1, which is a common
requirement for machine learning algorithms.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(Diabates_data.drop('Outcome', axis=1))

scaled df = pd.DataFrame(scaled data, columns=Diabates data.drop('Outcome', axis=1).columns)

https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 6/7
9/11/25, 11:51 PM Data Mining Lab.ipynb - Colab
_ p ( _ , _ p( , ) )

scaled_df['Outcome'] = Diabates_data['Outcome']

display(scaled_df.head())

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 0.352941 0.743719 0.590164 0.353535 0.000000 0.500745 0.234415 0.483333 1

1 0.058824 0.427136 0.540984 0.292929 0.000000 0.396423 0.116567 0.166667 0

2 0.470588 0.919598 0.524590 0.000000 0.000000 0.347243 0.253629 0.183333 1

3 0.058824 0.447236 0.540984 0.232323 0.111111 0.418778 0.038002 0.000000 0

4 0.000000 0.688442 0.327869 0.353535 0.198582 0.642325 0.943638 0.200000 1

plt.figure(figsize=(8, 6))
sns.scatterplot(x='Glucose', y='BMI', hue='Outcome', data=Diabates_data)
plt.title('Scatter Plot of Glucose vs. BMI by Outcome')
plt.xlabel('Glucose')
plt.ylabel('BMI')
plt.show()

Summary age and pregnency has highest correlation 0.54 this is the dataset of females it is biased but this hows that pregnent women
has higher chance of diabetese , Skinn thickness and insulin is highly correlated means insulin should be matched with skinn thickness
because both featues has positive relationship means if insulin applied on skinnthickness that doesnt matches it requiremnt ther is high
chance of diabates

import random

sm, la = [], []
while len(sm) < 2 or len(la) < 2:
num random randint(0 1000)

https://colab.research.google.com/drive/1X9x2ZWrgatjB2JRmefzQfPKE44U3HGML#scrollTo=g7eZgr6pHBlc&printMode=true 7/7

Logidtic Regression ASSIGNMENT
No ratings yet
Logidtic Regression ASSIGNMENT
13 pages
Diabetes EDA and Kears Modeling
No ratings yet
Diabetes EDA and Kears Modeling
26 pages
Diabetes
No ratings yet
Diabetes
97 pages
Diabetes Prediction 1704256341
No ratings yet
Diabetes Prediction 1704256341
17 pages
Project
No ratings yet
Project
8 pages
Healthcare-Project-Simplilearn - Week1
No ratings yet
Healthcare-Project-Simplilearn - Week1
6 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
1 page
ML Practical 04
No ratings yet
ML Practical 04
20 pages
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
No ratings yet
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
10 pages
Pima Indians Diabetes Patient Classification
No ratings yet
Pima Indians Diabetes Patient Classification
22 pages
Pythone Code For Predicting Diabetes Using ML
No ratings yet
Pythone Code For Predicting Diabetes Using ML
18 pages
Capstone Project 2
No ratings yet
Capstone Project 2
15 pages
Diabetes Prediction Model Guide
No ratings yet
Diabetes Prediction Model Guide
20 pages
6034 Logistic Regression
No ratings yet
6034 Logistic Regression
6 pages
ADS Exp-1
No ratings yet
ADS Exp-1
3 pages
Logistic - Ipynb - Colaboratory
No ratings yet
Logistic - Ipynb - Colaboratory
6 pages
Diabetes Prediction System
No ratings yet
Diabetes Prediction System
4 pages
Diabetes Prediction with SVM & RF
No ratings yet
Diabetes Prediction with SVM & RF
8 pages
Fds 1
No ratings yet
Fds 1
44 pages
Diabetes
No ratings yet
Diabetes
7 pages
Week-01 B
No ratings yet
Week-01 B
4 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
20 pages
Univariate and Multivariate Analysis - Jupyter Notebook
No ratings yet
Univariate and Multivariate Analysis - Jupyter Notebook
5 pages
ML Proj Diabetes
No ratings yet
ML Proj Diabetes
51 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
16 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
Python Programs
No ratings yet
Python Programs
5 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
Exp 4
No ratings yet
Exp 4
4 pages
Ml4.ipynb - Colab
No ratings yet
Ml4.ipynb - Colab
3 pages
Python Analysis of Diabetes Data
No ratings yet
Python Analysis of Diabetes Data
21 pages
Deeks Ex5
No ratings yet
Deeks Ex5
4 pages
مختار النعيري - The Course Work Submission
No ratings yet
مختار النعيري - The Course Work Submission
31 pages
Diabetes - Test Report
No ratings yet
Diabetes - Test Report
62 pages
AML Sessional 1 Students
No ratings yet
AML Sessional 1 Students
16 pages
Cia 2 ML 2348352
No ratings yet
Cia 2 ML 2348352
6 pages
Linear Merged Pagenumber
No ratings yet
Linear Merged Pagenumber
48 pages
Aiml Experiment 6
No ratings yet
Aiml Experiment 6
1 page
Mean Vector and Correlation Matrix in R - Jupyter Notebook
No ratings yet
Mean Vector and Correlation Matrix in R - Jupyter Notebook
7 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
Data Perparation Penting
No ratings yet
Data Perparation Penting
12 pages
Diabetes Data Analysis Guide
No ratings yet
Diabetes Data Analysis Guide
6 pages
LAB8 LogisticReg HeartDisease
No ratings yet
LAB8 LogisticReg HeartDisease
31 pages
Exp 5
No ratings yet
Exp 5
7 pages
Logistic Regression for Heart Disease
No ratings yet
Logistic Regression for Heart Disease
8 pages
E AI Lab EX 2and 3
No ratings yet
E AI Lab EX 2and 3
9 pages
x23 Group 1 - Final Project cst383
No ratings yet
x23 Group 1 - Final Project cst383
25 pages
Pima Indians Diabetes Database Analysis - Kaggle
No ratings yet
Pima Indians Diabetes Database Analysis - Kaggle
37 pages
8.perform Correlation and Scatter Plots
No ratings yet
8.perform Correlation and Scatter Plots
5 pages
Diabetes Case Study
No ratings yet
Diabetes Case Study
1 page
ML Minor May
No ratings yet
ML Minor May
5 pages
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
Python 2025
No ratings yet
Python 2025
25 pages
Diabetes Prediction - ML
No ratings yet
Diabetes Prediction - ML
29 pages
Pima Indians Diabetes Dataset Analysis - Notebook by Swapnil Gupta (Swapnilg4u) - Jovian
No ratings yet
Pima Indians Diabetes Dataset Analysis - Notebook by Swapnil Gupta (Swapnilg4u) - Jovian
1 page
21BCE9757 ITT Summer Internship AI ML Report
No ratings yet
21BCE9757 ITT Summer Internship AI ML Report
18 pages
RA2111003011432
No ratings yet
RA2111003011432
3 pages
Apply Logistic Regression Model Techniques To Predict Data On Any Dataset
No ratings yet
Apply Logistic Regression Model Techniques To Predict Data On Any Dataset
5 pages
Critical Piping Drainage System
75% (4)
Critical Piping Drainage System
49 pages
Army NEPA Environmental Analysis Guide
No ratings yet
Army NEPA Environmental Analysis Guide
10 pages
Robotics Paper
No ratings yet
Robotics Paper
24 pages
Technique For Collection of Blood Specimen From Poultry Bird (Chicken) - Pashudhan Praharee
No ratings yet
Technique For Collection of Blood Specimen From Poultry Bird (Chicken) - Pashudhan Praharee
5 pages
Benefits of Having Less Homework
100% (2)
Benefits of Having Less Homework
8 pages
Section 27 Ergonomics and Body Mechanics
No ratings yet
Section 27 Ergonomics and Body Mechanics
6 pages
Reinsurance Explained
No ratings yet
Reinsurance Explained
55 pages
AFA PPT - Artificial Insemination
No ratings yet
AFA PPT - Artificial Insemination
8 pages
Portable Flood Pump Design
No ratings yet
Portable Flood Pump Design
9 pages
SLEM PE 10 WEEK 5 Q 1 FINALdddddd
No ratings yet
SLEM PE 10 WEEK 5 Q 1 FINALdddddd
14 pages
Early Childhood Skill Development Guide
No ratings yet
Early Childhood Skill Development Guide
2 pages
The Institutions of Programmatic Action: Policy Programs in French and German Health Policy Johanna Hornung
100% (3)
The Institutions of Programmatic Action: Policy Programs in French and German Health Policy Johanna Hornung
48 pages
GH Yen Q GNX X Wu 1650545298267
No ratings yet
GH Yen Q GNX X Wu 1650545298267
5 pages
Art Therapy: Concepts and Discussions
No ratings yet
Art Therapy: Concepts and Discussions
11 pages
Juicy Neipa
No ratings yet
Juicy Neipa
3 pages
Naturally Enhanced Eggs As A Source of Vitamin D
No ratings yet
Naturally Enhanced Eggs As A Source of Vitamin D
39 pages
Asim General Contracting Profile
No ratings yet
Asim General Contracting Profile
47 pages
A. Cardiac Cycle: 1. Atrial Systole (Contraction of Atria) - 0.1s
No ratings yet
A. Cardiac Cycle: 1. Atrial Systole (Contraction of Atria) - 0.1s
6 pages
Mahamudra Practice Guide
100% (1)
Mahamudra Practice Guide
29 pages
Advanced Laser Physics Exam
No ratings yet
Advanced Laser Physics Exam
1 page
Crude Oil Basrah Assay
No ratings yet
Crude Oil Basrah Assay
2 pages
Spontaneous Intracerebral Hemorrhage
No ratings yet
Spontaneous Intracerebral Hemorrhage
37 pages
Grand Viva Final Vermicompost
No ratings yet
Grand Viva Final Vermicompost
42 pages
Machanical System Design
No ratings yet
Machanical System Design
4 pages
11 19 20 Elective 3 QUIZ
No ratings yet
11 19 20 Elective 3 QUIZ
3 pages
High Strength MAG Welding Wire Guide
No ratings yet
High Strength MAG Welding Wire Guide
1 page
Molds, Mycotoxins and Their Effect On Horses
100% (1)
Molds, Mycotoxins and Their Effect On Horses
10 pages
Topical Application of Fluoride and Its Anti-Cariogenic Effect
No ratings yet
Topical Application of Fluoride and Its Anti-Cariogenic Effect
7 pages
Chapter 2 Problems
No ratings yet
Chapter 2 Problems
3 pages
Multiple-Choice For English
80% (90)
Multiple-Choice For English
62 pages

Data Mining Lab - Ipynb - Colab

Uploaded by

Data Mining Lab - Ipynb - Colab

Uploaded by

9/11/25, 11:51 PM Data Mining Lab.

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 6 148 72 35 0 33.6 0.627 50 1

2 8 183 64 0 0 23.3 0.672 32 1

4 0 137 40 35 168 43.1 2.288 33 1

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

763 10 101 76 48 180 32.9 0.171 63 0

764 2 122 70 27 0 36.8 0.340 27 0

765 5 121 72 23 112 26.2 0.245 30 0

766 1 126 60 0 0 30.1 0.349 47 1

767 1 93 70 31 0 30.4 0.315 23 0

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction A

mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.2408

std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.7602

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.0000

25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.0000

50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.0000

75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.0000

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.0000

Double-click (or enter) to edit

Double-click (or enter) to edit

It has no null value

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction A

0 3.298000 109.980000 68.184000 19.664000 68.792000 30.304200 0.429734 31.1900

1 4.865672 141.257463 70.824627 22.164179 100.335821 35.142537 0.550500 37.0671

the histogram shows the peak and lowest values

the dots represents the patient featues

we can compare both diabetic and nondiabetic bp , skinn thickness

Diabates_data.groupby('Outcome').mean().plot(kind='bar', figsize=(12, 6))

from sklearn.preprocessing import MinMaxScaler

scaled df = pd.DataFrame(scaled data, columns=Diabates data.drop('Outcome', axis=1).columns)

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

0 0.352941 0.743719 0.590164 0.353535 0.000000 0.500745 0.234415 0.483333 1

1 0.058824 0.427136 0.540984 0.292929 0.000000 0.396423 0.116567 0.166667 0

2 0.470588 0.919598 0.524590 0.000000 0.000000 0.347243 0.253629 0.183333 1

3 0.058824 0.447236 0.540984 0.232323 0.111111 0.418778 0.038002 0.000000 0

4 0.000000 0.688442 0.327869 0.353535 0.198582 0.642325 0.943638 0.200000 1

You might also like