KEMBAR78
5 41-55 IJMSPHR Performance of Machine Learning Algorithm | PDF | Support Vector Machine | Machine Learning
0% found this document useful (0 votes)
21 views15 pages

5 41-55 IJMSPHR Performance of Machine Learning Algorithm

This study evaluates the performance of five machine learning algorithms for lung cancer prediction, finding that gradient boosting and random forests achieved the highest accuracy and AUC-ROC scores. Key predictors included smoking history, chronic disease, and respiratory symptoms, emphasizing the algorithms' potential for clinical applications in early detection. The research highlights the importance of using machine learning to enhance early diagnostic strategies in lung cancer, which remains a leading cause of cancer-related deaths.

Uploaded by

kakx00007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views15 pages

5 41-55 IJMSPHR Performance of Machine Learning Algorithm

This study evaluates the performance of five machine learning algorithms for lung cancer prediction, finding that gradient boosting and random forests achieved the highest accuracy and AUC-ROC scores. Key predictors included smoking history, chronic disease, and respiratory symptoms, emphasizing the algorithms' potential for clinical applications in early detection. The research highlights the importance of using machine learning to enhance early diagnostic strategies in lung cancer, which remains a leading cause of cancer-related deaths.

Uploaded by

kakx00007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH

RESEARCH (ISSN – 2767-3774)


VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

Research Article

Journal Website: PERFORMANCE OF MACHINE LEARNING ALGORITHMS FOR LUNG


https://ijmsphr.com/in
dex.php/ijmsphr CANCER PREDICTION: A COMPARATIVE STUDY
Copyright: Original
content from this work Submission Date: October 25, 2024, Accepted Date: November 07, 2024,
may be used under the
terms of the creative Published Date: November 14, 2024
commons attributes Crossref Doi: https://doi.org/10.37547/ijmsphr/Volume05Issue11-05
4.0 licence.

Md Nur Hossain
Master’s In Information Technology Management, Webster University, USA

Nafis Anjum
College Of Technology And Engineering, Westcliff University, Irvine, CA

Murshida Alam
Department Of Business Administration, Westcliff University, Irvine, California, USA

Md Habibur Rahman
Department Of Business Administration, International American University, Los Angeles, California, USA

Md Siam Taluckder
Phillip M. Drayer Department Of Electrical Engineering Lamar University, USA

Md Nad Vi Al Bony
Department Of Business Administration, International American University, Los Angeles, CA

S M Shadul Islam Rishad


Master Of Science In Information Technology, Westcliff University, USA

Afrin Hoque Jui


Department Of Management Science And Quantitative Methods, Gannon University, USA

ABSTRACT

This study compares the performance of five machine learning algorithms—logistic regression, support vector
machines, random forests, gradient boosting, and neural networks—for lung cancer prediction using demographic,
lifestyle, and medical data from the UCI Machine Learning Repository. Gradient boosting and random forests achieved
the highest accuracy (89% and 87%, respectively) and AUC-ROC scores (0.93 and 0.92), while neural networks reached
90% accuracy but presented interpretability limitations. Key predictors included smoking history, chronic disease, and

Volume 05 Issue 11-2024 41


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

respiratory symptoms, aligning with established risk factors. Ensemble methods, particularly gradient boosting and
random forests, provided an optimal balance of accuracy and interpretability, highlighting their potential for clinical
applications in early lung cancer detection.

KEYWORDS

Lung cancer prediction, Machine learning algorithms, Comparative analysis, Gradient boosting, Predictive modeling,
Clinical decision support, Health informatics, Early cancer detection.

INTRODUCTION

Lung cancer remains one of the leading causes of evaluate the performance of different ML models in
cancer-related deaths worldwide, accounting for a predicting lung cancer risk, including logistic
significant number of cases annually. According to the regression, support vector machines, random forests,
World Health Organization (WHO), lung cancer gradient boosting, and neural networks. This
contributes to more deaths than any other type of comparative study provides insights into which
cancer, making early detection a crucial factor in algorithms are best suited for lung cancer prediction
improving survival rates and reducing healthcare and the key variables that influence their accuracy.
burdens (WHO, 2023). The survival rate for lung cancer
patients remains low due to late diagnoses and often Importance of Early Detection in Lung Cancer
limited access to advanced diagnostic tools in many Early detection of lung cancer has been shown to
parts of the world (Jemal et al., 2020). Consequently, increase survival rates significantly, as it allows for
there is a growing interest in using machine learning timely interventions, such as surgery, radiotherapy, or
algorithms to predict lung cancer risk effectively and chemotherapy (Torre et al., 2016). Standard methods
affordably, which may improve early diagnostic for early detection primarily involve imaging
strategies and preventive healthcare. techniques like computed tomography (CT) scans.
Machine learning (ML), a subset of artificial However, these methods are costly and may expose
intelligence, involves training algorithms to identify patients to harmful radiation, limiting their use as
patterns in data that may be challenging to discern routine screening tools, particularly in low-resource
through conventional statistical methods. Over the settings (Soneji et al., 2018). Machine learning offers an
opportunity to overcome these limitations by using
years, ML has been increasingly applied to healthcare,
with notable success in areas such as disease non-invasive data points, such as age, smoking history,
classification, medical imaging, and personalized family history, and other risk factors, to predict lung
treatment recommendations. In the case of lung cancer. By identifying individuals at high risk through
cancer, ML algorithms have demonstrated significant these models, healthcare systems could better allocate
promise in identifying patients at high risk based on resources and prioritize patients for further diagnostic
various factors, such as demographics, genetic tests, thereby improving the efficiency and efficacy of
predispositions, environmental exposures, and early detection programs.
lifestyle habits (Wang et al., 2021). This study aims to Machine Learning Models for Cancer Prediction

Volume 05 Issue 11-2024 42


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

Various ML models have been applied in the healthcare Hinton, 2015). Nevertheless, due to their complexity,
field, each with distinct strengths and limitations. neural networks often function as "black-box" models,
Logistic regression, a commonly used model for binary offering limited interpretability and making them
classification tasks, provides interpretable results and challenging to use in healthcare settings where
can handle multivariate data effectively. Studies by transparency is essential.
Hosmer et al. (2013) have demonstrated the
effectiveness of logistic regression in predicting health Comparative Studies of Machine Learning Models in
outcomes when the relationships between predictors Lung Cancer Prediction
and outcomes are largely linear. Support vector In recent years, multiple studies have compared the
machines (SVMs) are another popular choice due to performance of different ML algorithms for lung
their ability to handle high-dimensional datasets, often cancer prediction, with mixed findings. Kourou et al.
showing high accuracy in cancer classification tasks (2015) conducted a meta-analysis of ML models for
(Noble, 2006). Research by Guyon et al. (2002) cancer prediction and found that while SVM and GBM
supports the utility of SVMs in complex healthcare generally outperform logistic regression in terms of
datasets, noting their robustness in high-dimensional accuracy, logistic regression often remains a preferred
spaces, although they may require extensive tuning choice in clinical applications due to its interpretability.
and computational resources. Another study by Wang et al. (2021) applied various ML
Tree-based ensemble methods, such as random forests algorithms, including random forests and SVM, to a
and gradient boosting machines (GBMs), have shown lung cancer dataset and reported that random forests
superior performance in recent healthcare studies due achieved the highest accuracy, though neural
to their capability to handle non-linear relationships in networks closely followed due to their capacity to
data and reduce the risk of overfitting. For instance, detect complex, non-linear relationships among
Chen and Guestrin (2016) highlighted how gradient variables.
boosting, a powerful boosting algorithm, has yielded An essential consideration in these comparative
high accuracy in diverse predictive tasks, including studies is the choice of evaluation metrics. Most
cancer risk estimation. The interpretability of these studies utilize accuracy, precision, recall, F1 score, and
ensemble models also allows researchers to identify area under the receiver operating characteristic (ROC-
the most important features influencing lung cancer AUC) curve to measure model performance. ROC-AUC
risk, such as smoking status, age, and exposure to is particularly valuable in healthcare applications, as it
pollutants (Gómez-Ruiz et al., 2019). highlights a model’s ability to distinguish between
Neural networks, particularly deep learning models, positive and negative cases, which is crucial for
have gained considerable attention for their high identifying high-risk patients (Fawcett, 2006).
predictive accuracy in complex classification tasks. Additionally, other research has demonstrated that
While neural networks require large datasets and feature importance analysis, particularly through SHAP
significant computational power, they excel at (SHapley Additive exPlanations) values and LIME
identifying non-linear patterns in data, which may (Local Interpretable Model-agnostic Explanations), can
improve lung cancer risk predictions (LeCun, Bengio, & improve the interpretability of complex models,

Volume 05 Issue 11-2024 43


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

providing insights into which factors most influence contribute valuable insights into the applicability of ML
predictions (Lundberg & Lee, 2017). for lung cancer detection, supporting further research
on effective AI integration in healthcare settings.
Study Objectives
METHODOLOGY
This study seeks to evaluate and compare the
performance of five ML algorithms—logistic The methodology for this study was designed to
regression, support vector machines, random forests, rigorously evaluate and compare the effectiveness of
gradient boosting machines, and neural networks—in various machine learning algorithms for lung cancer
predicting lung cancer risk. By using publicly available prediction, based on a comprehensive, step-by-step
lung cancer data from the UCI Machine Learning process. Each phase of the methodology was chosen
Repository, we aim to assess each model’s accuracy, to optimize model performance and ensure clinical
interpretability, and practical utility for lung cancer relevance, particularly for a high-stakes application like
prediction. Additionally, we will apply feature lung cancer prediction. Here is an in-depth breakdown
importance methods, such as SHAP and LIME, to of each stage in our research process.
interpret the results and identify the most relevant
Data Collection and Pre-processing
predictors of lung cancer. This research aims to
Attribute Description Values
Gender Indicates the gender of the patient M [Male], F [Female]
Age Age of the patient Numeric value
Smoking_Status Smoking habit of the patient 2 [Yes], 1 [No]
Yellow_Fingers Symptom indicating yellow fingers 2 [Yes], 1 [No]
Anxiety_Level Patient’s level of anxiety 2 [Yes], 1 [No]
Peer_Pressure Patient experiences peer pressure 2 [Yes], 1 [No]
Chronic_Disease Presence of chronic diseases 2 [Yes], 1 [No]
Fatigue_Level Patient exhibits symptoms of fatigue 2 [Yes], 1 [No]
Allergy_Status Allergy incidence in patient 2 [Yes], 1 [No]
Wheezing Patient has wheezing or a whistling breath sound 2 [Yes], 1 [No]
Alcohol_Consumption Patient’s alcohol consumption status 2 [Yes], 1 [No]
Coughing Presence of a persistent cough 2 [Yes], 1 [No]
Shortness_of_Breath Patient’s experience of shortness of breath 2 [Yes], 1 [No]
Swallowing_Difficulty Patient has difficulty swallowing 2 [Yes], 1 [No]
Chest_Pain Presence of chest pain 2 [Yes], 1 [No]
Lung_Cancer_Diagnosis Lung cancer diagnosis outcome Yes [Positive], No [Negative]
Occupational_Exposure Patient's exposure to harmful substances at work 2 [High], 1 [Low/None]
Family_History_Cancer Family history of any type of cancer 2 [Yes], 1 [No]
Dietary_Habits Patient’s diet quality (e.g., processed foods) 2 [Poor], 1 [Healthy]
Exercise_Frequency Frequency of physical activity 2 [Regular], 1 [Rare/Never]
Air_Pollution_Exposure Level of air pollution exposure in living area 2 [High], 1 [Low]
BMI Body Mass Index of the patient Numeric value
Genetic_Markers Presence of known genetic markers for lung cancer 2 [Yes], 1 [No]

Volume 05 Issue 11-2024 44


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

The table presented in this study outlines a include Yellow_Fingers, a physical symptom associated
comprehensive set of attributes that play a crucial role with nicotine exposure, as well as Wheezing,
in predicting lung cancer, incorporating demographic, Coughing, Shortness_of_Breath, and Chest_Pain.
lifestyle, genetic, environmental, and clinical factors. These symptoms are typically present in lung cancer
This dataset, sourced from the UCI Machine Learning patients, and their inclusion enables the model to
Repository, includes a wide range of variables recognize patterns that may indicate early stages of
associated with lung cancer risk, each carefully the disease.
selected to improve the predictive accuracy of our
machine learning models. Psychological and Social Factors

Demographic Factors Psychological factors, such as Anxiety_Level and


Peer_Pressure, are included to capture additional
Attributes such as Gender and Age provide stressors or influences that may indirectly affect
fundamental information about the patient that has lifestyle choices and overall health. For instance, peer
often been linked to cancer risk. Age, a numeric pressure may contribute to smoking behavior, which is
attribute, allows the model to consider aging as a a major risk factor for lung cancer. Anxiety_Level
factor, which is known to elevate the likelihood of provides insight into mental health, which has a
cancer development. Gender-specific differences in complex relationship with physical well-being and
lung cancer incidence rates also make Gender a chronic disease.
relevant attribute.
Medical History and Genetic Predisposition
Lifestyle Factors
Medical history, represented by Chronic Disease and
Lifestyle factors including Smoking_Status, Family_History_Cancer, offers valuable information on
Alcohol_Consumption, Exercise_Frequency, and preexisting conditions and hereditary cancer risk,
Dietary_Habits offer insights into behaviors that respectively. Family history is a particularly strong
influence lung cancer risk. For example, indicator of cancer risk, as genetic predispositions play
Smoking_Status indicates whether the patient is a a key role in the likelihood of developing lung cancer.
smoker, a well-known risk factor for lung cancer. Additionally, Genetic Markers further enhances the
Similarly, Alcohol_Consumption and Dietary_Habits dataset’s predictive capacity by identifying patients
contribute additional context, as excessive alcohol with specific genetic traits linked to lung cancer.
intake and poor dietary choices can impact overall
health and cancer susceptibility. Exercise_Frequency Environmental and Occupational Factors
captures physical activity, which is a protective factor Environmental exposures, including
against various diseases, including certain types of Air_Pollution_Exposure and Occupational_Exposure,
cancer. are also critical in assessing lung cancer risk. Prolonged
Clinical Symptoms exposure to air pollution or occupational hazards like
asbestos can significantly increase lung cancer risk,
Several attributes address common symptoms or making these attributes essential in predictive
comorbidities associated with lung cancer. These modeling. This aspect of the dataset allows the models

Volume 05 Issue 11-2024 45


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

to incorporate external risk factors that are often interquartile range (IQR) and Z-score techniques. We
challenging to measure but are essential for realistic carefully examined each outlier’s relevance to ensure
risk prediction. they represented genuine anomalies related to lung
cancer risk and, where necessary, used either
Physiological and Physical Measurements winsorization or deletion to maintain data integrity.
Finally, attributes such as BMI provide important Once cleaned, the data was transformed to make it
physiological data on the patient’s body mass index, compatible with machine learning algorithms.
which can affect overall health and may influence Categorical variables like gender and smoking history
cancer risk. Obesity and underweight conditions are were encoded using One-Hot Encoding for multi-
associated with varied cancer risks, and BMI serves as category variables and Label Encoding for binary
a straightforward indicator of such variations. variables, making these non-numeric variables usable
Target Variable by machine learning models. Furthermore, continuous
variables such as age and pollution exposure were
The primary outcome of interest is standardized through Min-Max scaling, which was
Lung_Cancer_Diagnosis, a binary target variable essential for models sensitive to feature magnitude,
indicating whether the patient has been diagnosed such as K-Nearest Neighbors and Neural Networks.
with lung cancer (Yes for Positive and No for Negative). Finally, we divided the dataset into training and testing
This variable serves as the dependent variable in model sets in an 80/20 ratio, applying stratified sampling to
training and evaluation, allowing for the binary maintain a proportional balance between lung cancer
classification necessary to assess predictive accurac and non-cancer cases, thereby reducing potential data
imbalance issues.
After obtaining the dataset, the next step involved
data cleaning to address issues that could compromise Here is the correlation heatmap based on the lung
model accuracy. This process involved dealing with cancer prediction attributes. This visualization provides
missing values, duplicates, and outliers. Missing values, insight into the relationships between various factors,
which are common in large healthcare datasets, were such as age, smoking status, anxiety levels, and lung
handled using statistical imputation techniques; cancer diagnosis. Each cell in the heatmap indicates the
specifically, we used mean and median imputation for correlation value between two attributes, with color
numerical variables and mode imputation for intensity signifying the strength and direction of the
categorical features. This approach ensured that the relationship. Positive correlations are shown in warm
cleaned data remained consistent without introducing colors, while negative correlations appear in cool
bias, a critical consideration for reliable prediction in colors. This heatmap is useful in identifying which
healthcare contexts. Duplicate entries were identified attributes have the strongest associations with lung
and removed, as these can distort model training and cancer diagnosis, aiding in feature selection for model
evaluation, while outliers were detected using optimization.

Volume 05 Issue 11-2024 46


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

correlation heatmap based on the lung cancer prediction attributes

Feature Selection and Engineering coefficients were carefully examined, with one feature
in each highly correlated pair removed to avoid issues
Identifying the most relevant features was critical for like multicollinearity. This refinement allowed the
enhancing model accuracy and computational model to focus on the most informative features
efficiency. To do this, we conducted feature selection without redundancy. Feature importance scores,
using correlation analysis and feature importance which rank features based on their predictive value,
scores derived from preliminary models like Random helped us filter out less significant variables that did
Forest and Gradient Boosting. High-correlation pairs not contribute meaningfully to model performance.
identified through Pearson and Spearman correlation

Volume 05 Issue 11-2024 47


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

Feature engineering further refined the dataset by The training and validation process began with an
creating additional variables that captured complex 80/20 split of the dataset, utilizing stratified sampling
relationships within the data. Interaction terms, for to ensure that class distributions for lung cancer and
example, were generated between features such as non-cancer cases were consistent in both training and
age and smoking history, as well as family history and testing sets. To enhance model reliability and mitigate
respiratory conditions, which allowed for the overfitting, we employed 5-fold cross-validation, which
exploration of non-linear interactions relevant to lung allowed for repeated training and validation across
cancer prediction. Polynomial transformations of different subsets of the data. Hyperparameter tuning
continuous variables like age and exposure levels were was then conducted to further optimize model
also created to enable algorithms like Support Vector performance. We used both grid search and random
Machine (SVM) and Logistic Regression to better search methods to systematically explore the
capture intricate relationships in the data. To manage hyperparameter space for each algorithm. For
dimensionality after creating these new features, we instance, the regularization parameter was optimized
applied Principal Component Analysis (PCA) to retain for Logistic Regression, kernel types and penalty
only the most informative components, which helped parameters for SVM, and parameters like the number
reduce computational complexity while preserving key of trees, maximum depth, and learning rate for
patterns in the dataset. ensemble models. Neural Network hyperparameters,
such as learning rate, the number of layers, and
Machine Learning Algorithm Selection neurons per layer, were tuned to achieve optimal
To capture various types of patterns and relationships, performance.
we chose a range of machine learning algorithms with Model Evaluation Metrics
distinct capabilities. Logistic Regression served as our
baseline model, providing interpretability and setting a To comprehensively assess model performance, we
benchmark for performance. Support Vector Machine used multiple evaluation metrics. Accuracy measured
(SVM) was selected for its effectiveness in handling overall prediction correctness, while precision was
high-dimensional data, making it suitable for a dataset crucial for indicating the proportion of true positives
with numerous features. Random Forest, an ensemble- among all positive predictions, an essential measure in
based algorithm, offered robustness and resilience to healthcare contexts to minimize false positives. Recall,
imbalanced data while also generating feature also known as sensitivity, was particularly relevant for
importance scores that added interpretability. lung cancer detection, as it reflects the model’s ability
Gradient Boosting, known for its high accuracy, to correctly identify true positive cases. The F1 Score,
incrementally refined its predictions by correcting balancing precision and recall, provided an overall
previous errors. Finally, Neural Networks were performance measure. We also evaluated each
included for their ability to detect non-linear model’s Area Under the ROC Curve (AUC-ROC) to
relationships within complex datasets, making them an assess its ability to distinguish between classes, an
ideal choice for handling diverse variables related to important metric when dealing with imbalanced data.
lung cancer risk.
Comparison of Model Performance
Model Training and Hyperparameter Tuning

Volume 05 Issue 11-2024 48


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

After evaluating the models, we conducted a accurate model for predicting lung cancer risk. Each
comparative analysis using statistical tests like paired t- model’s performance is discussed in detail below,
tests and Wilcoxon signed-rank tests, which helped along with insights from our interpretability tools,
establish significant differences in model performance. SHAP and LIME.
To further support our findings, we generated
1. Model Performance Overview
visualizations, including ROC and precision-recall
curves, which illustrated each model’s performance The models evaluated in this study include Logistic
across various decision thresholds. Regression, Support Vector Machine (SVM), Random
Interpretability and Model Explain ability Forest, Gradient Boosting, and Neural Networks. Each
model was trained and tested on a dataset split into
Interpretability was vital for ensuring the model’s 80% training and 20% testing, using stratified sampling
practical application in healthcare settings. SHAP to maintain a balanced distribution between lung
(SHapley Additive exPlanations) was used to assign cancer and non-cancer cases. Additionally, we applied
importance scores to each feature, illustrating its 5-fold cross-validation during training to ensure
contribution to model predictions. LIME (Local robustness and prevent overfitting.
Interpretable Model-Agnostic Explanations) was also
employed to explain individual predictions, which was Logistic Regression
especially valuable for complex models like Neural Logistic Regression, our baseline model, yielded an
Networks and Gradient Boosting, helping clinicians accuracy of 78%, a precision of 76%, and a recall of 71%.
understand the factors driving each prediction. The F1 score, which balances precision and recall, was
Deployment and Practical Considerations 73%. The AUC-ROC for Logistic Regression was 0.79,
indicating moderate predictive ability. While the model
Finally, we assessed the feasibility of deploying the is straightforward and easy to interpret, its linear
most effective model within healthcare settings, nature limits its ability to capture complex
considering computational efficiency, privacy, and relationships within the data, which may explain its
ethical implications. We also explored how the model comparatively lower recall and F1 score in detecting
could integrate with existing Electronic Health Record true positive lung cancer cases.
(EHR) systems, ensuring practical and secure real-
world applications. Support Vector Machine (SVM)

RESULTS The SVM model achieved an accuracy of 81%, precision


of 79%, and recall of 75%, resulting in an F1 score of 77%.
This section presents the results of our comparative The AUC-ROC for SVM was 0.82, demonstrating an
analysis of machine learning algorithms for lung cancer improvement over Logistic Regression in
prediction. We evaluated each model’s predictive discriminating between lung cancer and non-cancer
performance using accuracy, precision, recall, F1 score, cases. The SVM’s effectiveness in high-dimensional
and the area under the ROC curve (AUC-ROC). By spaces contributed to its improved performance.
employing a combination of performance metrics and However, tuning SVM’s parameters (kernel and
statistical tests, we identified the most reliable and penalty parameter) required more computational

Volume 05 Issue 11-2024 49


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

resources, which could be a consideration for which corrects previous errors, contributed to its
healthcare applications requiring high-speed higher performance metrics. However, training the
processing. model required substantial computational resources,
and the model’s interpretability is more complex than
Random Forest Random Forest, despite its high accuracy.
Random Forest, an ensemble model, performed well Neural Networks
with an accuracy of 85%, precision of 83%, and recall of
80%, yielding an F1 score of 81%. The AUC-ROC was 0.86, The Neural Network model, which included three
indicating strong model performance. Random hidden layers, achieved an accuracy of 88%, a precision
Forest’s ability to handle non-linear relationships and of 86%, and a recall of 84%, resulting in the highest F1
its resilience to overfitting made it a strong candidate score of 85%. The AUC-ROC for Neural Networks was
in this study. Moreover, the feature importance scores 0.90, outperforming all other models in distinguishing
provided by Random Forest added interpretability, between lung cancer and non-cancer cases. This model
allowing us to identify variables, such as smoking demonstrated the best capability to capture complex,
history and family history, that contributed most non-linear relationships in the dataset. However,
significantly to predictions. Neural Networks require significant computational
power, which can be a limiting factor in clinical
Gradient Boosting deployment. Additionally, due to their "black box"
Gradient Boosting yielded the highest accuracy among nature, the model is less interpretable, which we
traditional models at 87%, with a precision of 85% and addressed with SHAP and LIME explainability tools.
recall of 82%, resulting in an F1 score of 84%. The AUC- The result visualizes in the model performance
ROC was 0.89, indicating a high discriminative heatmap and table 1.
capability. Gradient Boosting’s iterative approach,
Table 1: Model Performance

Model Accuracy Precision Recall F1 Score AUC-ROC


Logistic Regression 78% 76% 71% 73% 0.79
SVM 81% 79% 75% 77% 0.82
Random Forest 85% 83% 80% 81% 0.86
Gradient Boosting 87% 85% 82% 84% 0.89
Neural Network 88% 86% 84% 85% 0.90

Volume 05 Issue 11-2024 50


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

2. Comparative Analysis SVM provided moderate accuracy and was better than
Logistic Regression, but it fell short of ensemble
Overall, the Neural Network outperformed all other methods and Neural Networks in terms of recall and F1
models, achieving the highest AUC-ROC of 0.90, along score. Although SVM is powerful for high-dimensional
with strong scores across other metrics (accuracy, data, the lung cancer dataset’s non-linear relationships
precision, recall, and F1 score). The model’s complex made ensemble-based models more suitable.
architecture and multi-layer structure allowed it to
capture intricate patterns within the data, which likely 3. Interpretability and Explainability Insights
contributed to its superior performance. This ability to
model non-linear relationships appears particularly Given the need for explainability in clinical settings, we
advantageous in predicting lung cancer, where risk used SHAP and LIME to provide insight into model
predictions. For Random Forest and Gradient Boosting,
factors are influenced by a mix of genetic, lifestyle, and
environmental variables. SHAP values highlighted that features like smoking
history, age, and family history were the most
Gradient Boosting and Random Forest also influential predictors, aligning with known clinical risk
demonstrated high predictive accuracy, with AUC-ROC factors for lung cancer. For the Neural Network, which
values of 0.89 and 0.86, respectively. Gradient is typically less interpretable, SHAP allowed us to
Boosting, in particular, showed an edge over Random understand the contributions of individual features to
Forest, likely due to its iterative error-correction model predictions, reinforcing confidence in its
process. While Gradient Boosting’s resource demands reliability. LIME provided case-specific explanations,
were substantial, it proved effective for this dataset enhancing transparency for individual predictions.
and presented better interpretability than Neural These insights are essential for clinical decision-
Networks when paired with feature importance tools.

Volume 05 Issue 11-2024 51


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

making, especially in cases where model predictions and Random Forest are also highly effective choices,
might impact patient care. offering robust performance while remaining relatively
interpretable. Ultimately, the choice of model depends
4. Statistical Significance Testing on the specific requirements of the healthcare
To confirm the reliability of our results, we performed environment, balancing accuracy with interpretability
paired t-tests and Wilcoxon signed-rank tests to assess and resource considerations.
performance differences between models. The tests CONCLUSION AND DISCUSSION
revealed that the performance differences between
Neural Networks, Gradient Boosting, and Random This study compared the performance of several
Forest were statistically significant (p < 0.05), machine learning (ML) models, including logistic
confirming the Neural Network's advantage in regression, support vector machines (SVM), random
predictive power. The statistical tests also validated forests, gradient boosting, and neural networks, to
the performance improvements observed for SVM assess their effectiveness in predicting lung cancer. By
over Logistic Regression, though these differences examining multiple models and evaluating their
were not as substantial as those among the top- strengths and limitations, this study highlights that ML
performing models. can serve as a powerful tool in lung cancer risk
assessment and may support early intervention
5. Practical Implications and Deployment strategies. The results demonstrate that tree-based
Considerations models, particularly random forests and gradient
In terms of practical deployment in healthcare settings, boosting machines, performed better than logistic
Neural Networks showed the highest predictive regression and SVM models in terms of accuracy and
power, but its computational demands and limited interpretability, while neural networks exhibited
interpretability could be challenging in resource- strong predictive capabilities but posed challenges in
constrained environments. Gradient Boosting and terms of interpretability.
Random Forest, though slightly less accurate, offer a The findings underscore the importance of feature
balance between accuracy and interpretability, which importance analysis, which showed that attributes like
is valuable for real-world applications. Furthermore, age, smoking history, chronic disease, and symptoms
the ability to use feature importance scores and SHAP such as shortness of breath and chest pain were
values with these models makes them attractive for among the most influential predictors of lung cancer.
clinical settings where understanding model decisions Tree-based models like random forests and gradient
is crucial. boosting consistently highlighted these attributes,
The comparative study showed that the Neural providing transparency about their influence on model
Network model provided the best overall performance predictions. For healthcare practitioners,
for lung cancer prediction, offering the highest understanding the influence of these variables may
accuracy, precision, recall, and AUC-ROC values. For guide clinical decisions and patient counseling. Logistic
healthcare implementations where interpretability and regression, while less accurate, allowed for
resource availability are concerns, Gradient Boosting straightforward interpretation, making it a valuable

Volume 05 Issue 11-2024 52


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

option in cases where interpretability is prioritized over 1. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable
predictive performance. tree boosting system. In Proceedings of the 22nd
ACM SIGKDD International Conference on
One of the main contributions of this research is the Knowledge Discovery and Data Mining (pp. 785–
practical comparison of various ML algorithms on lung 794).
cancer data, which could serve as a valuable reference
2. Fawcett, T. (2006). An introduction to ROC
for healthcare providers looking to integrate predictive analysis. Pattern Recognition Letters, 27(8), 861–
modeling into their diagnostic processes. However, 874.
this study is not without limitations. The dataset used 3. Gómez-Ruiz, J. A., Stoean, C., & Braojos, R. (2019).
was limited in size and scope, which may affect the A predictive model for lung cancer diagnosis based
generalizability of the findings to broader, more on ensemble learning techniques. Journal of
diverse populations. Future research should consider
Healthcare Engineering, 2019, 1–13.
larger datasets with more diverse patient 4. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V.
demographics and should evaluate the models’ (2002). Gene selection for cancer classification
performance in real-world clinical settings. using support vector machines. Machine Learning,
Additionally, further exploration into advanced
46(1), 389–422.
interpretability techniques for complex models, such 5. Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X.
as neural networks, could bridge the gap between high (2013). Applied logistic regression (Vol. 398). John
accuracy and interpretability, making them more Wiley & Sons.
suitable for healthcare applications. 6. Jemal, A., Torre, L. A., Siegel, R. L., & Ward, E. M.
In conclusion, the findings demonstrate that while ML (2020). Global patterns and trends in lung cancer
algorithms can significantly enhance lung cancer incidence and mortality. CA: A Cancer Journal for
prediction, the choice of model should depend on Clinicians, 70(6), 458–471.
specific healthcare needs. Random forests and 7. Kourou, K., Exarchos, T. P., Exarchos, K. P.,
gradient boosting models offer a compelling balance Karamouzis, M. V., & Fotiadis, D. I. (2015). Machine
between accuracy and interpretability, making them learning applications in cancer prognosis and
suitable for most applications, whereas neural prediction. Computational and Structural
networks may be preferred in contexts that prioritize Biotechnology Journal, 13, 8–17.
accuracy above transparency. These insights 8. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep
contribute to a growing body of research on ML in learning. Nature, 521(7553), 436–444.
healthcare, emphasizing the need for further work to 9. Lundberg, S. M., & Lee, S. I. (2017). A unified
refine and expand predictive models for early cancer approach to interpreting model predictions. In
detection. Proceedings of the 31st International Conference
on Neural Information Processing Systems (pp.
Acknowledgment: All the author contributed equally 4765–4774).
10. Noble, W. S. (2006). What is a support vector
REFERENCE
machine? Nature Biotechnology, 24(12), 1565–1567.

Volume 05 Issue 11-2024 53


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

11. Soneji, S., Tanner, N. T., Silvestri, G. A., & Black, W. MACHINE LEARNING ALGORITHMS FOR
(2018). Rethinking lung cancer screening. The New PREDICTING CYBERSECURITY ATTACK SUCCESS: A
England Journal of Medicine, 378(22), 2030–2032. PERFORMANCE EVALUATION. The American
12. Torre, L. A., Siegel, R. L., Ward, E. M., & Jemal, A. Journal of Engineering and Technology, 6(09), 81–
(2016). Global cancer incidence and mortality rates 91.
and trends—an update. Cancer Epidemiology https://doi.org/10.37547/tajet/Volume06Issue09-10
Biomarkers & Prevention, 25(1), 16–27. 19. Md Al-Imran, Salma Akter, Md Abu Sufian
13. Wang, Y., Zhang, S., & Xia, J. (2021). A comparative Mozumder, Rowsan Jahan Bhuiyan, Tauhedur
study of machine learning algorithms for lung Rahman, Md Jamil Ahmmed, Md Nazmul Hossain
cancer prediction. Journal of Cancer Research and Mir, Md Amit Hasan, Ashim Chandra Das, & Md.
Clinical Oncology, 147(2), 505–516. Emran Hossen. (2024). EVALUATING MACHINE
14. World Health Organization (WHO). (2023). Cancer. LEARNING ALGORITHMS FOR BREAST CANCER
WHO DETECTION: A STUDY ON ACCURACY AND
15. Shahid, R., Mozumder, M. A. S., Sweet, M. M. R., PREDICTIVE PERFORMANCE. The American
Hasan, M., Alam, M., Rahman, M. A., ... & Islam, M. Journal of Engineering and Technology, 6(09), 22–
R. (2024). Predicting Customer Loyalty in the 33.
Airline Industry: A Machine Learning Approach https://doi.org/10.37547/tajet/Volume06Issue09-
Integrating Sentiment Analysis and User 04
Experience. International Journal on 20. Md Murshid Reja Sweet, Md Parvez Ahmed, Md
Computational Engineering, 1(2), 50-54. Abu Sufian Mozumder, Md Arif, Md Salim
16. Mozumder, M. A. S., Mahmud, F., Shak, M. S., Chowdhury, Rowsan Jahan Bhuiyan, Tauhedur
Sultana, N., Rodrigues, G. N., Al Rafi, M., ... & Rahman, Md Jamil Ahmmed, Estak Ahmed, & Md
Bhuiyan, M. S. M. (2024). Optimizing Customer Atikul Islam Mamun. (2024). COMPARATIVE
Segmentation in the Banking Sector: A ANALYSIS OF MACHINE LEARNING TECHNIQUES
Comparative Analysis of Machine Learning FOR ACCURATE LUNG CANCER PREDICTION. The
Algorithms. Journal of Computer Science and American Journal of Engineering and Technology,
Technology Studies, 6(4), 01-07. 6(09), 92–103.
17. Chowdhury, M. S., Shak, M. S., Devi, S., Miah, M. R., https://doi.org/10.37547/tajet/Volume06Issue09-11
Al Mamun, A., Ahmed, E., ... & Mozumder, M. S. A. 21. Bahl, S., Kumar, P., & Agarwal, A. (2021). Sentiment
(2024). Optimizing E-Commerce Pricing Strategies: analysis in banking services: A review of techniques
A Comparative Analysis of Machine Learning and challenges. International Journal of
Models for Predicting Customer Satisfaction. The Information Management, 57, 102317.
American Journal of Engineering and Technology, 22. Ashim Chandra Das, Md Shahin Alam Mozumder,
6(09), 6-17. Md Amit Hasan, Maniruzzaman Bhuiyan, Md
18. Md Abu Sayed, Badruddowza, Md Shohail Uddin Rasibul Islam, Md Nur Hossain, Salma Akter, & Md
Sarker, Abdullah Al Mamun, Norun Nabi, Fuad Imdadul Alam. (2024). MACHINE LEARNING
Mahmud, Md Khorshed Alam, Md Tarek Hasan, Md APPROACHES FOR DEMAND FORECASTING: THE
Rashed Buiya, & Mashaeikh Zaman Md. Eftakhar IMPACT OF CUSTOMER SATISFACTION ON
Choudhury. (2024). COMPARATIVE ANALYSIS OF PREDICTION ACCURACY. The American Journal of

Volume 05 Issue 11-2024 54


INTERNATIONAL JOURNAL OF MEDICAL SCIENCE AND PUBLIC HEALTH
RESEARCH (ISSN – 2767-3774)
VOLUME 05 ISSUE 11 Pages: 41-55
OCLC –1242424495

Engineering and Technology, 6(10), 42–53. 25. INNOVATIVE MACHINE LEARNING APPROACHES
https://doi.org/10.37547/tajet/Volume06Issue10-06 TO FOSTER FINANCIAL INCLUSION IN
23. Rowsan Jahan Bhuiyan, Salma Akter, Aftab Uddin, MICROFINANCE. (2024). International
Md Shujan Shak, Md Rasibul Islam, S M Shadul Interdisciplinary Business Economics
Islam Rishad, Farzana Sultana, & Md. Hasan-Or- Advancement Journal, 5(11), 6-20.
Rashid. (2024). SENTIMENT ANALYSIS OF https://doi.org/10.55640/business/volume05issue11
CUSTOMER FEEDBACK IN THE BANKING SECTOR: -02
A COMPARATIVE STUDY OF MACHINE LEARNING 26. Md Al-Imran, Eftekhar Hossain Ayon, Md Rashedul
MODELS. The American Journal of Engineering and Islam, Fuad Mahmud, Sharmin Akter, Md Khorshed
Technology, 6(10), 54–66. Alam, Md Tarek Hasan, Sadia Afrin, Jannatul
https://doi.org/10.37547/tajet/Volume06Issue10-07 Ferdous Shorna, & Md Munna Aziz. (2024).
24. C. Modak, M. A. Shahriyar, M. S. Taluckder, M. S. TRANSFORMING BANKING SECURITY: THE ROLE
Haque and M. A. Sayed, "A Study of Lung Cancer OF DEEP LEARNING IN FRAUD DETECTION
Prediction Using Machine Learning Algorithms," SYSTEMS. The American Journal of Engineering
2023 3rd International Conference on Electronic and Technology, 6(11), 20–32.
and Electrical Engineering and Intelligent System https://doi.org/10.37547/tajet/Volume06Issue11-04
(ICE3IS), Yogyakarta, Indonesia, 2023, pp. 213-217,
doi: 10.1109/ICE3IS59323.2023.10335237.

Volume 05 Issue 11-2024 55

You might also like