KEMBAR78
Python ML for Healthcare Data | PDF | Logistic Regression | Categorical Variable
0% found this document useful (0 votes)
39 views3 pages

Python ML for Healthcare Data

Uploaded by

umadataengg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views3 pages

Python ML for Healthcare Data

Uploaded by

umadataengg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

Here’s an alternative Python code example that showcases a different approach to

data handling, preprocessing, and visualization for a healthcare-related machine


learning task. This example uses a synthetic dataset, applies different
preprocessing techniques, and implements a logistic regression model.

Alternative Python Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import seaborn as sns

# Sample dataset generation


np.random.seed(42)
data_size = 1000

# Creating a DataFrame with synthetic data


data = {
'age': np.random.randint(20, 80, data_size),
'gender': np.random.choice(['Male', 'Female'], data_size),
'blood_pressure': np.random.randint(80, 180, data_size),
'cholesterol': np.random.randint(150, 300, data_size),
'outcome': np.random.choice([0, 1], data_size) # 0: No Disease, 1: Disease
}

df = pd.DataFrame(data)

# Display the first few rows of the dataset


print("Dataset Preview:")
print(df.head())

# Data Preprocessing
# Convert categorical variable 'gender' to numeric
label_encoder = LabelEncoder()
df['gender'] = label_encoder.fit_transform(df['gender']) # Male: 0, Female: 1

# Handling missing values (if any)


df.fillna(df.mean(), inplace=True)

# Splitting the data into features and target variable


X = df[['age', 'gender', 'blood_pressure', 'cholesterol']]
y = df['outcome']

# Splitting the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Model Development: Logistic Regression


model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# Model Evaluation
y_pred = model.predict(X_test)

# Performance Metrics
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Greens', xticklabels=['No
Disease', 'Disease'], yticklabels=['No Disease', 'Disease'])
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# Visualizing the dataset distribution


plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='outcome', palette='pastel')
plt.title('Distribution of Disease Outcome')
plt.xlabel('Disease Outcome (0: No, 1: Yes)')
plt.ylabel('Count')
plt.show()

# Additional visualization: Age distribution by outcome


plt.figure(figsize=(10, 6))
sns.boxplot(x='outcome', y='age', data=df, palette='Set2')
plt.title('Age Distribution by Disease Outcome')
plt.xlabel('Disease Outcome (0: No, 1: Yes)')
plt.ylabel('Age')
plt.show()

Explanation of the Code


Dataset Generation: Similar to the previous example, a synthetic healthcare dataset
is generated with features such as age, gender, blood_pressure, and cholesterol,
along with a binary outcome.

Data Preprocessing:

The categorical variable gender is converted into numeric format using Label
Encoding.
Missing values are handled by filling them with the mean of the respective columns.
Splitting the Data: The dataset is divided into features (X) and target variable
(y), then further split into training and testing sets.

Feature Scaling: The features are standardized using StandardScaler to normalize


their distribution.

Model Development: A Logistic Regression model is trained on the training set.

Model Evaluation: The model's performance is evaluated using accuracy score,


classification report, and confusion matrix.

Visualization:
A heatmap of the confusion matrix provides insight into model performance.
A count plot shows the distribution of disease outcomes.
A boxplot visualizes the age distribution for each disease outcome, offering
insights into the relationship between age and health status.
Required Libraries
Ensure you have the necessary libraries installed. Use the following command to
install them if you haven’t already:

pip install numpy pandas matplotlib seaborn scikit-learn

Note
This alternative code showcases a different machine learning approach and
preprocessing techniques. You can further customize it based on your specific
research focus or dataset.

You might also like