KEMBAR78
Diabetes Prediction with Logistic Regression | PDF | Regression Analysis | Logistic Regression
0% found this document useful (0 votes)
46 views9 pages

Diabetes Prediction with Logistic Regression

The document discusses building a logistic regression model to predict diabetes using health parameter data. It defines the problem, imports necessary libraries, builds the dataset by loading data, exploring it, and splitting it into training and test sets. Finally, it trains a logistic regression model on the training set.

Uploaded by

Yagnesh Vyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views9 pages

Diabetes Prediction with Logistic Regression

The document discusses building a logistic regression model to predict diabetes using health parameter data. It defines the problem, imports necessary libraries, builds the dataset by loading data, exploring it, and splitting it into training and test sets. Finally, it trains a logistic regression model on the training set.

Uploaded by

Yagnesh Vyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

logisticregression-1

June 8, 2023

1 Health Parameter Analysis and Diabetes Prediction using Lo-


gistic Regression
Logistic regression is a useful model for predicting binary outcomes, where there are only two
possible classes. It is commonly used because it provides interpretable results, allows for estimating
probabilities, and is computationally efficient. Logistic regression has fewer assumptions compared
to other models, can handle non-linear relationships, and is robust to outliers. It also supports
regularization techniques to prevent overfitting. Logistic regression is a well-studied and established
model in statistics and machine learning. However, it may not be suitable for all classification
problems, especially those with highly nonlinear relationships.

1.1 Define the Problem :


The problem is to create a logistic regression model based on the provided dataset that can pre-
dict the outcome of diabetes based on health parameters. The dataset contains various columns,
including the outcome column, which represents whether a person has diabetes or not. The goal is
to train the model using the training set and evaluate its performance on the testing set using the
confusion matrix and accuracy score.

1.2 Importing necessary libraries :


[1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

1.3 Build the dataset:


To build the dataset, you need to perform the following steps:

a) Load the dataset using pandas: Use the pandas library to load the dataset from the
‘diabetes.csv’ file

1
[2]: data = pd.read_csv("diabetes.csv")
data.head(5)

[2]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \


0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1

b) Exploring the dataset Exploring the dataset is crucial for understanding its structure, iden-
tifying missing values, data distribution, correlations, and outliers, as well as for making informed
decisions regarding data preprocessing and feature selection. It provides insights that guide data
analysis and model building.
[3]: data.shape # Checking number of rows and coulmn in the dataset

[3]: (768, 9)

[4]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
data.info() - Checking the information about the dataset, including the number of entries, the
number of columns, column names, data types of each column, and any missing values. It will help
in understanding the structure and properties of the dataset.

2
[5]: data.describe()

[5]: Pregnancies Glucose BloodPressure SkinThickness Insulin \


count 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479
std 3.369578 31.972618 19.355807 15.952218 115.244002
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000
75% 6.000000 140.250000 80.000000 32.000000 127.250000
max 17.000000 199.000000 122.000000 99.000000 846.000000

BMI DiabetesPedigreeFunction Age Outcome


count 768.000000 768.000000 768.000000 768.000000
mean 31.992578 0.471876 33.240885 0.348958
std 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.078000 21.000000 0.000000
25% 27.300000 0.243750 24.000000 0.000000
50% 32.000000 0.372500 29.000000 0.000000
75% 36.600000 0.626250 41.000000 1.000000
max 67.100000 2.420000 81.000000 1.000000

data.describe() - Checks the descriptive statistics for the numerical columns in the dataset. The
statistics include count, mean, standard deviation, minimum value, 25th percentile (Q1), median
(50th percentile or Q2), 75th percentile (Q3), and maximum value. It provides a summary of the
central tendency, spread, and distribution of the numerical data.
[6]: data.isnull().sum() # Checking if the dataset contains any null values.

[6]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

[7]: num_duplicates = data.duplicated().sum() # Checking if the dataset contains any␣


↪duplicate values

if num_duplicates > 0:
print(f"The dataset contains {num_duplicates} duplicate values")
data = data.drop_duplicates
print("Number of duplicate values after dropping:", num_duplicates)

3
else:
print("The dataset doesn't contain any duplicate values.")

The dataset doesn't contain any duplicate values.

[8]: data.hist(figsize=(10, 8)) # Checking Data Distribution


plt.tight_layout()
plt.show()

c) Extract data from the outcome column as a variable named Y: Extract the values
from the ‘outcome’ column and assign them to a variable called Y.
[9]: X = data.iloc[:,:-1]
X.head(5)

[9]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \


0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3

4
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age
0 0.627 50
1 0.351 31
2 0.672 32
3 0.167 21
4 2.288 33

d) Extract data from every column except the outcome column as a variable named
X: Extract the data from all columns except the ‘outcome’ column and assign them to a variable
called X.
[10]: Y = data.iloc[:,-1]
Y.head(5)

[10]: 0 1
1 0
2 1
3 0
4 1
Name: Outcome, dtype: int64

e) Divide the dataset into two parts for training and testing: Split the dataset into a
training set and a testing set in a 70% - 30% proportion. This will be used to train the model on
the training set and evaluate its performance on the testing set.
[11]: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30,␣
↪random_state = 51)

[12]: print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(537, 8)
(231, 8)
(537,)
(231,)

1.4 Train the model:


[13]: logistic = LogisticRegression()
logistic.fit(X_train, Y_train)

[13]: LogisticRegression()

5
1.5 Evaluate the model :
[14]: Y_predict = logistic.predict(X_test)
print("Y_predict:\n",Y_predict)

Y_predict:
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1
0 1 0 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0
1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 0 1 1 1 0
0 0 0 0 1 0 1 0 0]

[15]: print("Y_test:\n",Y_test)

Y_test:
737 0
505 0
296 1
711 0
329 0
..
405 0
315 0
131 1
364 0
322 1
Name: Outcome, Length: 231, dtype: int64

[16]: score = accuracy_score(Y_test, Y_predict)


print("Accuracy Score: ",score * 100)

Accuracy Score: 79.22077922077922

[17]: confusion_matrix = confusion_matrix(Y_test, Y_predict)


print("Confusion Matrix : \n",confusion_matrix)

Confusion Matrix :
[[131 11]
[ 37 52]]

1.6 Visually Understanding the performance of the model


[18]: # Create a figure and axis
fig, ax = plt.subplots()

# Plot the actual outcomes

6
ax.scatter(range(len(Y_test)), Y_test, color='blue', label='Actual Outcome')

# Plot the predicted outcomes


ax.scatter(range(len(Y_predict)), Y_predict, color='red', label='Predicted␣
↪Outcome')

# Set axis labels and title


ax.set_xlabel('Data Point')
ax.set_ylabel('Outcome')
ax.set_title('Actual vs. Predicted Outcomes')

# Add a legend
ax.legend()

# Show the plot


plt.show()

7
1.7 Use the model:
Once the model is trained and evaluated, you can use it to make predictions on new, unseen data.
This can be done by providing new input values to the model and using the predict function to
obtain the predicted outcome.

You might also like