Here’s an alternative Python code example that showcases a different approach to
data handling, preprocessing, and visualization for a healthcare-related machine
learning task. This example uses a synthetic dataset, applies different
preprocessing techniques, and implements a logistic regression model.
Alternative Python Code Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import seaborn as sns
# Sample dataset generation
np.random.seed(42)
data_size = 1000
# Creating a DataFrame with synthetic data
data = {
'age': np.random.randint(20, 80, data_size),
'gender': np.random.choice(['Male', 'Female'], data_size),
'blood_pressure': np.random.randint(80, 180, data_size),
'cholesterol': np.random.randint(150, 300, data_size),
'outcome': np.random.choice([0, 1], data_size) # 0: No Disease, 1: Disease
}
df = pd.DataFrame(data)
# Display the first few rows of the dataset
print("Dataset Preview:")
print(df.head())
# Data Preprocessing
# Convert categorical variable 'gender' to numeric
label_encoder = LabelEncoder()
df['gender'] = label_encoder.fit_transform(df['gender']) # Male: 0, Female: 1
# Handling missing values (if any)
df.fillna(df.mean(), inplace=True)
# Splitting the data into features and target variable
X = df[['age', 'gender', 'blood_pressure', 'cholesterol']]
y = df['outcome']
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Model Development: Logistic Regression
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# Model Evaluation
y_pred = model.predict(X_test)
# Performance Metrics
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Greens', xticklabels=['No
Disease', 'Disease'], yticklabels=['No Disease', 'Disease'])
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
# Visualizing the dataset distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='outcome', palette='pastel')
plt.title('Distribution of Disease Outcome')
plt.xlabel('Disease Outcome (0: No, 1: Yes)')
plt.ylabel('Count')
plt.show()
# Additional visualization: Age distribution by outcome
plt.figure(figsize=(10, 6))
sns.boxplot(x='outcome', y='age', data=df, palette='Set2')
plt.title('Age Distribution by Disease Outcome')
plt.xlabel('Disease Outcome (0: No, 1: Yes)')
plt.ylabel('Age')
plt.show()
Explanation of the Code
Dataset Generation: Similar to the previous example, a synthetic healthcare dataset
is generated with features such as age, gender, blood_pressure, and cholesterol,
along with a binary outcome.
Data Preprocessing:
The categorical variable gender is converted into numeric format using Label
Encoding.
Missing values are handled by filling them with the mean of the respective columns.
Splitting the Data: The dataset is divided into features (X) and target variable
(y), then further split into training and testing sets.
Feature Scaling: The features are standardized using StandardScaler to normalize
their distribution.
Model Development: A Logistic Regression model is trained on the training set.
Model Evaluation: The model's performance is evaluated using accuracy score,
classification report, and confusion matrix.
Visualization:
A heatmap of the confusion matrix provides insight into model performance.
A count plot shows the distribution of disease outcomes.
A boxplot visualizes the age distribution for each disease outcome, offering
insights into the relationship between age and health status.
Required Libraries
Ensure you have the necessary libraries installed. Use the following command to
install them if you haven’t already:
pip install numpy pandas matplotlib seaborn scikit-learn
Note
This alternative code showcases a different machine learning approach and
preprocessing techniques. You can further customize it based on your specific
research focus or dataset.