Disease Detection Machine
Learning Model
•Nouman Nazir — F2023393005
•Komal Shehzadi — F2023436009
Problem Statement and
Background
Problem Statement
• Develop and evaluate ML models to predict disease outcomes
• Leverage patient symptoms and profile data for accurate diagnosis
• Enable early and personalized healthcare
Background
• Healthcare data explosion enables ML-powered diagnostics
• Aim to automate disease prediction to support doctors
• Dataset sourced from Kaggle: Comprehensive Disease Symptom and Patient
Profile
• Focus: Relationship between patient traits and disease patterns
Methods – Data Preparation
Dataset
• Disease_symptom_and_patient_profile_dataset.csv (Kaggle)
Initial Renaming
• Difficulty Breathing → DB
• Blood Pressure → BP
• Cholesterol Level → CL
• Outcome → Results
Preprocessing Steps
• Label Encoding: Convert categorical features (Yes/No → 1/0)
• Train-Test Split: 80% training, 20% testing
Feature Engineering & Scaling
Feature Engineering
• Combined Symptoms:
• Fever_and_Cough, Fever_and_Fatigue, etc.
• Age Grouping: Child, Adult, Elderly
• Derived Features:
• Risk Score = f(Age, CL)
• Age Squared for non-linear modeling
• Disease Frequency counts
Encoding & Scaling
• One-Hot Encoding: For Age_Group
• Min-Max Scaling: Age, Risk_Score, Disease_Frequency, Age_Squared
Model Selection & Training
Models Used
• Logistic Regression (LR)
• K-Nearest Neighbors (KNN)
• Decision Tree Classifier (CART)
• Random Forest (RF)
Training Phases
• Phase 1: Basic label-encoded data
• Phase 2: With feature-engineered data
Tools and Techniques
Language & Environment
• Python with Jupyter / Google Colab
Libraries
• pandas, numpy, seaborn, matplotlib
• scikit-learn (sklearn):
• Preprocessing: LabelEncoder, MinMaxScaler
• Model: train_test_split, accuracy_score, classification_report
Techniques
• Supervised Learning
• Ensemble Learning (Random Forest)
• Feature Importance metrics
Evaluation Metrics
Metrics Used
• Accuracy Score
• Precision, Recall, F1-Score
• Support
• Confusion Matrix (TP, TN, FP, FN breakdown)
Performance Comparison
Before Feature Engineering
• Initial accuracy range: e.g., 65%–75%
• Classification metrics showed limited precision
After Feature Engineering
• Improved accuracy: e.g., 80%–90%
• Random Forest saw highest boost
• Better F1-scores and recall values
Feature Importance &
Conclusion
Key Features Identified
• Age, Risk_Score, Fever, Fatigue, DB
• Visualization via bar plots (Decision Tree & RF)
Conclusion
• Feature engineering improves ML performance
• Random Forest shows robustness
• Models can support real-world diagnostic decision-making
• Thank You