KEMBAR78
Project Report Major Project | PDF | Preventive Healthcare | Machine Learning
0% found this document useful (0 votes)
46 views85 pages

Project Report Major Project

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views85 pages

Project Report Major Project

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

A

PROJECT REPORT
ON

“DEEPLUNG: PREDICTIVE MODELLING FOR


LUNG CANCER RISK ASSESSMENT”
(Submitted in partial fulfillment for the award of Degree of Bachelor of Technology)
IN
Computer Science Engineering

SUBMITTED BY
PRIYANSHI HEMRAJANI |NAMAN POKHARNA
MAYANK MEHRANIYA| PIYUSH SONI| DAKSH SONI
UNDER GUIDANCE
OF
PF. DR. R.K. SOMANI
DEAN
SCHOOL OF ENGINEERING AND TECHNOLOGY

SESSION: (2023-24)

Sangam University, NH-79, Bhilwara Chittor By-pass,


Chittor Road, Bhilwara-311001

Lung Cancer Prediction Model i|Page


SANGAM UNIVERSITY

AUTHOR’S DECLARATION

I hereby declare that the work, which is being presented in the Project Report, entitled
“Deeplung: Predictive Modelling for Lung Cancer Risk Assessment” in partial fulfillment
for the award of Degree of “Bachelor of Technology” in Computer Science Engineering and
submitted to the Department of Computer Science Engineering, Sangam University. Project is
a record of my own investigations carried under the guidance of Pf. Dr. R.K. Somani, Dean
of School of Engineering and Technology, Sangam University, Bhilwara, Rajasthan, India.

I have not submitted the matter presented in this dissertation anywhere for the any other Degree.

PRIYANSHI HEMRAJANI |NAMAN POKHARNA


MAYANK MEHRANIYA| PIYUSH SONI| DAKSH SONI
Computer Science & Engineering
Enrollment No.:2020BTCS032
Sangam University, Bhilwara (Raj.)

Counter Signed by
Pf. Dr. R.K. Somani
Dean
Department of Computer Science & Engineering
School of Engineering & Technology
Sangam University, Bhilwara (Raj.)

Lung Cancer Prediction Model ii | P a g e


SANGAM UNIVERSITY

CERTIFICATE

I feel great pleasure in certifying that the project entitled “Deeplung: Predictive Modelling
for Lung Cancer Risk Assessment” carried out by Priyanshi Hemrajani |Naman Pokhrna|
Mayank Mehraniya | Piyush Soni | Daksh Soni under the supervision of Pf. Dr. R.K.
Somani. I recommend the submission of project.

Date: ………………….

Sign ………………………………………
(Dr. Vikas Somani)
Head of Department of Computer Science & Engineering,
School of Engineering & Technology,
Sangam University, Bhilwara

Lung Cancer Prediction Model iii | P a g e


SANGAM UNIVERSITY

ACKNOWLEDGEMENT

This Dissertation would not have been successful without the guidance and support of a large
number of individuals.
Pf. Dr. R.K. Somani, Dean of School of Engineering and Technology my dissertation
supervisor, who believed in me since the initial stages of my dissertation work. For long time,
he provided insightful commentary during my regular meetings, and he was consistently
supportive of my proposed research directions. I am honored to be his first graduate student.
His consistent support and intellectual guidance inspired me to innovative new ideas. I am glad
to work under his supervision.
I am grateful to Dr. Vikas Somani, Head, Department of Computer Science Engineering for
his excellent support during my dissertation work. Dr. Awanit Kumar, Bachelor of
Engineering and Technology Computer Science Coordinator, spent many hours listening
to my concerns, working with me to navigate the bureaucracy, and assisting me with my most
important decisions.
Thanks all my friends and classmates for their love and support. I have enjoyed their company
so much during my stay at college name. I would like to thank all those who made my stay in
college, an unforgettable and rewarding experience.
Last, but not least I would like to thank my parents for supporting me to do complete my
master’s degree in all ways.

Priyanshi Hemrajani | Naman Pokharna | Mayank Mehraniya


Piyush Soni| Daksh Soni
Enrollment No.: 2020BTCS032

Lung Cancer Prediction Model iv | P a g e


SANGAM UNIVERSITY

ABSTRACT

Lung cancer remains a significant global health challenge, often diagnosed at advanced stages
with limited treatment options, leading to high morbidity and mortality rates. This project
focuses on developing a machine learning-based predictive model to assess an individual's risk
of developing lung cancer. Leveraging diverse input factors including smoking habits,
environmental pollutants, genetic predisposition, occupational hazards, and health parameters,
the aim is to create a robust and accurate predictive tool.

The project objectives encompass comprehensive data collection, preprocessing, and feature
engineering to extract crucial insights from datasets sourced from reliable medical records and
research databases. Various machine learning algorithms, including logistic regression,
decision trees, random forests, and neural networks, are employed to build predictive models.
Hyperparameter tuning and ensemble methods are utilized to enhance model performance and
robustness.

The models are rigorously evaluated using cross-validation techniques and diverse evaluation
metrics to ensure reliability and generalizability. Interpretability techniques are applied to
explain model predictions, facilitating user trust and understanding, particularly among
healthcare professionals. Ethical considerations regarding patient data privacy and compliance
with regulations are strictly adhered to throughout the project lifecycle.

The ultimate goal is to create a user-friendly predictive tool that aids in early detection,
personalized risk assessment, and targeted interventions for lung cancer. This model has the
potential to significantly impact public health initiatives by informing preventive measures,
policy changes, and resource allocation strategies to mitigate the burden of lung cancer. The
project aims to contribute to advancements in predictive analytics applied to healthcare while
striving to improve patient outcomes and reduce the societal and economic impact of this
devastating disease.

Lung Cancer Prediction Model v|Page


ABBREVIATIONS

ML-LCRA: Machine Learning for Lung Cancer Risk Assessment


LC-PMA: Lung Cancer Predictive Modeling Approach
LMRC: Lung Cancer Risk Classifier
LDAP: Lung Disease Assessment Project
CARL: Cancer Assessment via Risk Learning
MIRA-LC: Machine Intelligence for Risk Assessment in Lung Cancer
LPRAM: Lung Cancer Prediction and Risk Assessment Model
LEARC: Lung Evaluation and Risk Classification
PREDICAN: Predictive Analysis for Lung Cancer
LCAI: Lung Cancer AI-Assisted Identification
L-CARE: Lung Cancer Assessment via Risk Estimation
AIR-LCR: AI for Lung Cancer Risk
RISK-LC: Risk Identification for Lung Cancer
LUNAR-M: Lung Cancer Risk Modeling
SMART-LC: System for Modeling and Assessing Lung Cancer Risk
LCAID: Lung Cancer AI Diagnosis
L-RISK: Lung Cancer Risk Intelligence System
PROLUCAR: Predictive Lung Cancer Assessment and Risk
LC-PREDICT: Lung Cancer Prediction Tool
LUNGPROF: Lung Cancer Risk Profiler

Lung Cancer Prediction Model vi | P a g e


CONTENTS

Author’s Declaration……………………………………………………………ii
Certificate………………………………………………………………………iii
Acknowledgements……………………………………………………………..iv
Abstract………………………………………………………………………….v
Abbreviations…………………………………………………………………...vi
Contents…………………………………………………………………...…...vii
List of figures…………………………………………………………………...ix

1. Introduction to Predictive Modelling for Lung Cancer………...1


1.1. Objectives……………………………………………………………….1
1.1.1. Scope of the Study………………………………………………...3
1.1.2. Motivation behind the Research…………………………………..5
1.1.3. Limitations and Constraints……………………………………….6
1.1.4. Proposed Solution…………………………………………………7
1.2. Problem Statement……………………………………………………..10
1.2.1. Problem Overview and Context………………………………….10

2. Review of Related Literature…………………………………...12


2.1. Existing Research and Studies………………………………………….12
2.2. Research Gap…………………………………………………………...18

3. Proposed System and Methodology…………………………….21


3.1. System Architecture Design…………………………………………….21
3.2. System Flow and Use case……………………………………………...27
3.3. System Algorithm………………………………………………………35

4. Result and Discussion……………………………………………………..39


4.1. Model Performance and Graphs………………………………………...39

Lung Cancer Prediction Model vii | P a g e


5. Conclusion and Recommendations…………………………….45
5.1. Summary and Concluding Remarks……………………………………45
5.2. Practical Uses and Implications………………………………………...47
5.3. Future Work and Enhancements………………………………………...51

6. References……………………………………………………….54

7. Appendix………………………………………………………...57
7.1. Technical Details and Additional Graphs /Charts……………………….57
7.2. Supplementary Information…………………………………………….64

Lung Cancer Prediction Model viii | P a g e


LIST OF FIGURES

Fig 1: Lung Cancer Prediction Architecture


Fig 2: Architectural Model of LSTM
Fig 3: GRU’s accuracy Comparison
Fig 4: Use Case Diagram of the System
Fig 5: UML Sequence Diagram
Fig 6: Flowchart of the methodology for Cancer Detection
Fig 7: Decision Tree
Fig 8: ROC curves for risk prediction models in the MOLTEST BIS cohort.
ROC, receiver operating characteristic curve; LLP, Liverpool Lung Project;
AUC, area under the receiver operating characteristic curve.
Fig 9: Graphs
Fig 10: Input Data
Fig 11: Axes Input Plot
Fig 12: Dataset Details
Fig 13: Correlation Matrix
Fig 14: Lung Cancer due to Air Pollution
Fig 15: Level Vs Count
Fig 16: Label Graph

Lung Cancer Prediction Model ix | P a g e


1. Introduction To Predictive Modelling for
Lung Cancer

1.1. Objectives

Data Collection and Preparation:

 Gather diverse datasets encompassing information on smoking habits,


environmental exposures, genetic factors, occupational history, health parameters,
and demographics from reliable sources and medical records.
 Perform data preprocessing tasks, including handling missing values, outlier
detection, data normalization, and ensuring data consistency and quality.

Feature Selection and Engineering:

 Conduct thorough exploratory data analysis (EDA) to identify relevant features


associated with lung cancer risk.
 Apply feature selection techniques to choose the most influential and discriminative
features.
 Perform feature engineering to create new features or transformations that may
enhance the predictive power of the model.

Model Development:

 Implement various machine learning algorithms (e.g., logistic regression, decision


trees, random forests, support vector machines, neural networks) for building
predictive models.
 Train multiple models using the prepared dataset, employing appropriate
hyperparameter tuning and model optimization techniques to enhance performance.
 Explore ensemble methods to combine the strengths of multiple models for
improved prediction accuracy.

Model Evaluation and Validation:

Lung Cancer Prediction Model 1|Pag e


 Assess the performance of developed models using cross-validation techniques to
ensure robustness and generalizability.
 Utilize appropriate evaluation metrics (such as accuracy, precision, recall, F1-score,
ROC-AUC) to measure model performance.
 Validate the model on independent datasets or through external validation to
confirm its reliability.

Interpretability and Explain ability:

 Enhance the interpretability of the model by employing techniques such as feature


importance analysis, SHAP (Shapley Additive explanations), or LIME (Local
Interpretable Model-agnostic Explanations).
 Provide explanations for model predictions to facilitate understanding and trust
among users, particularly healthcare professionals.

Ethical Considerations and Data Privacy:

 Ensure compliance with ethical guidelines and data privacy regulations in handling
sensitive health-related information.
 Implement appropriate data anonymization techniques and robust security measures
to protect patient confidentiality.

User Interface Development (Optional):

 Develop a user-friendly interface or dashboard to facilitate easy interaction with the


predictive model for healthcare professionals or end-users.
 Design the interface to visualize predictions, risk factors, and recommendations
based on individual profiles.

Documentation and Reporting:

 Create comprehensive documentation detailing the methodologies, algorithms


used, data sources, preprocessing steps, model development, and evaluation
processes.
 Prepare a detailed report summarizing the findings, model performance, limitations,
and recommendations for further enhancements or applications.

Deployment and Integration:

 Deploy the finalized model in a suitable environment, making it accessible for real-
time predictions or integration within healthcare systems if applicable.

Lung Cancer Prediction Model 2|Pag e


 Collaborate with healthcare institutions or relevant stakeholders for potential
integration into clinical practice or public health initiatives.
 By addressing these objectives, the project aims to develop a reliable and accurate
lung cancer prediction model that supports early detection, personalized risk
assessment, and proactive interventions, contributing to improved healthcare
outcomes and public health initiatives.

1.1.1. Scope of the Study


The scope of a study involving the development of a machine learning-based predictive
model for lung cancer risk assessment is comprehensive and multidimensional. Here's
a detailed breakdown of the scope:

1. Data Collection and Preprocessing:


 Identifying Relevant Data Sources: Gathering data from diverse sources such
as medical records, surveys, research papers, and public databases to acquire
information on:
 Smoking habits: Quantity, duration, type of tobacco, cessation
attempts, etc.
 Environmental pollutants: Air quality indices, exposure levels to
toxins, geographical data, etc.
 Genetic predisposition: Genetic markers, family history, genotypic
data.
 Occupational hazards: Exposure to carcinogens in specific industries
or occupations.
 Other relevant parameters: Demographics, lifestyle factors, medical
history, etc.
 Data Preprocessing: Cleaning and formatting data, handling missing values,
encoding categorical variables, and ensuring data consistency and quality.

2. Feature Engineering:
 Feature Selection: Identifying the most relevant features that significantly
contribute to lung cancer risk using techniques like correlation analysis, feature
importance ranking, etc.
 Feature Transformation: Normalizing, scaling, or transforming features to
ensure uniformity and enhance model performance.

3. Model Development:
 Machine Learning Algorithms: Exploring various algorithms like logistic
regression, decision trees, random forests, support vector machines, neural
networks, etc., to build and compare predictive models.

Lung Cancer Prediction Model 3|Pag e


 Model Training: Using a portion of the data to train the models, tuning
hyperparameters, and evaluating model performance using cross-validation
techniques.
 Ensemble Methods: Employing ensemble techniques (e.g., stacking, boosting)
to enhance model robustness and accuracy.

4. Evaluation and Validation:


 Performance Metrics: Assessing the model's performance using appropriate
metrics like accuracy, precision, recall, F1-score, ROC-AUC, etc.
 Validation: Conducting rigorous validation on separate test datasets to ensure
the generalizability and reliability of the developed model.

5. Interpretability and Explain ability:


 Model Interpretation: Explaining the relationships between input factors and
the model's predictions, facilitating understanding for medical professionals and
end-users.
 Visualizations: Generating visual aids (e.g., feature importance plots, decision
boundaries) to enhance interpretability.

6. Ethical Considerations and Privacy:


 Ethical Guidelines: Ensuring adherence to ethical standards and regulations
concerning patient data, consent, and confidentiality.
 Privacy Protection: Implementing measures to safeguard sensitive information
and anonymizing data where necessary.

7. Deployment and Recommendations:


 Implementation: Developing a user-friendly interface or integrating the model
into existing healthcare systems for practical use.
 Recommendations: Providing personalized risk assessments and actionable
recommendations for individuals based on their assessed lung cancer risk.

8. Continuous Improvement:
 Model Updating: Establishing a framework for continuous model
improvement with new data and emerging research to enhance accuracy and
relevance over time.
 Feedback Mechanism: Creating a mechanism to receive feedback from
healthcare professionals and users for ongoing refinement.

Conclusion:
The scope of this study encompasses a comprehensive and interdisciplinary approach
involving data collection, preprocessing, model development, evaluation, ethical
considerations, deployment, and continuous improvement. The ultimate goal is to
create a reliable, accurate, and user-friendly predictive tool for assessing an individual's

Lung Cancer Prediction Model 4|Pag e


risk of developing lung cancer and providing personalized interventions for early
detection and prevention.

1.1.2. Motivation behind the Research


The motivation behind developing a lung cancer prediction model using machine
learning techniques is multifaceted and rooted in addressing several critical aspects:

 Early Detection and Prevention: Lung cancer is often diagnosed at advanced


stages when treatment options are limited and the prognosis is poor. By creating
a predictive model, the primary motivation is to enable early detection of the
disease. Early identification of individuals at high risk can prompt timely
screenings, leading to earlier diagnosis and potentially more effective treatment
strategies, thus improving survival rates.

 Personalized Healthcare: Each individual's risk factors for lung cancer can
vary significantly. By considering diverse input factors such as smoking habits,
environmental exposures, genetic predisposition, and health history, the model
aims to provide personalized risk assessments. This personalized approach
allows for tailored interventions and recommendations specific to an
individual's risk profile, enhancing the effectiveness of preventive measures.

 Public Health Impact: Lung cancer remains a significant public health


challenge globally. Developing a predictive model contributes to public health
initiatives by providing insights into risk factors and prevalence. This
information can aid policymakers, healthcare providers, and public health
authorities in formulating targeted interventions, implementing smoking
cessation programs, improving environmental regulations, and allocating
resources more effectively to combat lung cancer at a population level.

 Research Advancement: The project fosters advancements in the field of


predictive analytics and machine learning applied to healthcare. Developing a
robust predictive model involves data collection, preprocessing, feature
engineering, and model evaluation, contributing to methodological
advancements in analyzing complex health-related data. This can potentially
pave the way for similar predictive models for other types of cancers or diseases.

 Improving Patient Outcomes: Ultimately, the goal is to improve patient


outcomes and quality of life. By accurately identifying individuals at higher
risk, the model can empower healthcare providers to offer timely interventions,

Lung Cancer Prediction Model 5|Pag e


including counseling, lifestyle modifications, early screenings, and appropriate
medical care. This proactive approach has the potential to reduce the incidence
of lung cancer and its associated morbidity and mortality.

 Reducing Healthcare Costs: Early detection and prevention strategies can


significantly reduce the economic burden associated with treating advanced-
stage lung cancer. By focusing on preventive measures and early interventions,
healthcare costs related to extensive treatments and hospitalizations for
advanced stages of the disease can be curtailed.

In summary, the motivation behind creating a lung cancer prediction model lies in its
potential to revolutionize early detection, personalize healthcare interventions,
positively impact public health policies, advance research methodologies, enhance
patient outcomes, and alleviate the societal and economic burdens associated with lung
cancer.

1.1.3. Limitations and Constraints


To develop a machine learning-based predictive model for lung cancer risk assessment,
there are several limitations and constraints that should be acknowledged and
considered:

1. Data Availability and Quality:


 Limited or Incomplete Data: Availability of comprehensive data on all
relevant factors (genetic, environmental, occupational, etc.) might be restricted.
 Data Quality: Inaccuracies, missing values, or biases within the dataset can
affect model performance and reliability.

2. Ethical and Privacy Concerns:


 Data Privacy and Confidentiality: Adhering to strict privacy regulations (such
as HIPAA) might restrict access to certain sensitive patient information,
impacting the comprehensiveness of the dataset.
 Ethical Considerations: Balancing the need for data access with ethical
considerations regarding patient consent, confidentiality, and fair use of data.

3. Model Development Challenges:


 Complexity of Lung Cancer Development: Lung cancer is influenced by
multifaceted factors, and capturing this complexity within a model might be
challenging.
 Overfitting or Underfitting: Ensuring the model's balance between capturing
intricate patterns and generalizing well to new data.

Lung Cancer Prediction Model 6|Pag e


4. Interpretability and Explainability:
 Complexity of Machine Learning Models: Certain models like neural
networks might lack interpretability, making it difficult to explain the model's
predictions, especially in a medical context.
 Communication to Stakeholders: Explaining model predictions and
recommendations to healthcare professionals and individuals in a
comprehensible manner might be challenging.

5. Deployment and Practical Application:


 Integration with Healthcare Systems: Compatibility issues or resistance to
adopting new technologies within existing healthcare systems.
 User Acceptance: Ensuring that healthcare professionals and individuals trust
and understand the model's predictions and recommendations.

6. Continual Improvement and Maintenance:


 Dynamic Nature of Data: Continuous updates and additions to the dataset and
staying updated with the latest research might be resource-intensive.
 Model Drift: Ensuring that the model maintains accuracy over time as the
underlying patterns in data change.

7. External Factors and Generalizability:


 Geographical and Population Differences: Models developed using specific
datasets might not generalize well to diverse populations or different
geographical regions.
 External Influences: New environmental factors, changes in lifestyle, or
healthcare advancements might affect the model's relevance and accuracy.

8. Resource Constraints:
 Computational Resources: Availability of computational power and
infrastructure required for processing large datasets and training complex
models.
 Budget and Time Constraints: Limitations in funding and time could affect
the extent of data collection, model development, and validation processes.
Understanding and addressing these limitations and constraints are crucial for
managing expectations, ensuring ethical compliance, and developing a model that is
both effective and practical for real-world application.

1.1.4. Proposed Solution


Solution Proposal: Lung Cancer Prediction Model

Lung Cancer Prediction Model 7|Pag e


1. Data Acquisition and Preprocessing:

 Data Collection: Gather diverse datasets from reputable sources, including


medical records, research databases, surveys, and relevant literature,
encompassing information on smoking habits, environmental exposures,
genetic factors, occupational history, health parameters, and demographics.
 Data Preprocessing: Perform data cleaning to handle missing values, outliers,
and inconsistencies. Normalize or scale numerical features and encode
categorical variables for compatibility with machine learning algorithms.

2. Feature Engineering and Selection:

 Exploratory Data Analysis (EDA): Conduct comprehensive EDA to


understand relationships between features and lung cancer incidence. Identify
correlations, distributions, and patterns in the data.
 Feature Selection: Employ techniques like correlation analysis, mutual
information, or feature importance ranking to select the most relevant features
that significantly contribute to lung cancer risk prediction.
 Feature Engineering: Create new features or transformations that capture
complex relationships or interactions between variables, enhancing the
predictive power of the model.

3. Model Development and Optimization:

 Algorithm Selection: Experiment with various machine learning algorithms


(e.g., logistic regression, decision trees, random forests, support vector
machines, neural networks) to build predictive models.
 Hyperparameter Tuning: Use techniques like grid search or random search to
optimize hyperparameters for each model, improving their performance.
 Ensemble Methods: Explore ensemble methods such as bagging, boosting, or
stacking to combine multiple models for increased predictive accuracy and
robustness.

4. Model Evaluation and Validation:

 Cross-validation: Employ k-fold cross-validation to assess model performance


on different subsets of the dataset, ensuring generalizability.
 Performance Metrics: Measure model performance using appropriate
evaluation metrics like accuracy, precision, recall, F1-score, ROC-AUC, and
confusion matrices.

Lung Cancer Prediction Model 8|Pag e


 External Validation: Validate the final model on independent datasets or with
real-world data to confirm its reliability and applicability.

5. Model Interpretability and Explain ability:

 Feature Importance Analysis: Use techniques such as SHAP values,


permutation importance, or LIME to explain the importance of features in
predicting lung cancer risk.
 Visualizations: Generate visual explanations or plots that illustrate how
different factors contribute to an individual's risk, aiding in model interpretation
and user understanding.

6. Ethical Considerations and Deployment:

 Data Privacy and Ethics: Ensure compliance with ethical standards, patient
confidentiality, and data protection regulations throughout the project.
 Model Deployment: Deploy the finalized model in a suitable environment,
considering integration into healthcare systems or making it accessible through
a user-friendly interface for healthcare professionals.

7. Documentation and Reporting:

 Comprehensive Documentation: Create detailed documentation outlining the


methodologies, algorithms utilized, data sources, preprocessing steps, model
development, evaluation outcomes, and limitations.
 Report Generation: Prepare a comprehensive report summarizing the project
findings, model performance, recommendations for healthcare practices, and
potential future enhancements.
By executing these steps and implementing the proposed solution, the aim is to develop
a robust and accurate lung cancer prediction model. This model can assist healthcare
professionals in assessing individual risks, enabling early interventions, and
contributing to personalized healthcare strategies aimed at reducing the burden of lung
cancer. Additionally, this solution contributes to advancing predictive analytics in
healthcare, fostering research, and potentially impacting public health policies to
combat lung cancer more effectively.

Lung Cancer Prediction Model 9|Pag e


1.2. Problem Statement
"Developing a machine learning-based predictive model for lung cancer risk assessment
leveraging diverse input factors such as smoking habits, exposure to environmental
pollutants, genetic predisposition, occupational hazards, and other relevant parameters. The
objective is to create a robust and accurate predictive tool that identifies and evaluates the
likelihood of an individual developing lung cancer, thereby facilitating early intervention
and personalized preventive measures."

1.2.1. Problem Overview and Context


Problem Overview: Developing a Lung Cancer Prediction Model
Lung cancer remains one of the most prevalent and fatal types of cancer worldwide,
often diagnosed at advanced stages when treatment options are limited. The aim of this
project is to create a machine learning-based predictive model that assesses an
individual's risk of developing lung cancer. This model will leverage various input
factors, including but not limited to:

 Smoking Habits: Smoking is a well-established primary risk factor for lung


cancer. The model will consider different aspects such as duration, intensity, and
cessation of smoking habits.

 Environmental Pollutants: Exposure to air pollution, industrial emissions,


second-hand smoke, radon, asbestos, and other environmental toxins
significantly contributes to lung cancer risk. Data related to exposure levels and
duration will be integrated into the model.

 Genetic Predisposition: Certain genetic factors and family history play a role
in predisposing individuals to lung cancer. Genetic markers and family history
data will be considered to assess genetic susceptibility.

 Occupational Hazards: Certain occupations involve exposure to carcinogens


(e.g., asbestos in construction work). Occupational history and exposure data
will be incorporated into the model.

 Health Parameters: Additional health-related information such as pre-existing


respiratory conditions, history of chronic diseases, age, gender, and
demographic factors will also be taken into account.

Lung Cancer Prediction Model 10 | P a g e


Objectives:

 Model Development: Construct a robust predictive model utilizing machine


learning algorithms (e.g., logistic regression, decision trees, random forests,
neural networks) to analyze the relationships between the input factors and the
likelihood of developing lung cancer.

 Feature Selection and Engineering: Identify the most influential features


contributing to lung cancer risk. Perform feature engineering to enhance the
model's accuracy and interpretability.

 Data Collection and Preprocessing: Collect diverse datasets from reliable


sources (medical records, surveys, research databases) and preprocess the data
to handle missing values, outliers, and ensure compatibility for model training.

 Model Evaluation and Validation: Assess the model's performance using


appropriate evaluation metrics (e.g., accuracy, precision, recall, ROC-AUC)
through cross-validation techniques to ensure its reliability and generalizability.

 Ethical Considerations: Ensure the ethical use of sensitive health-related data,


maintaining patient privacy and confidentiality throughout the project lifecycle.

Outcome:

The ultimate goal is to create a user-friendly predictive tool that healthcare


professionals can utilize for early detection, personalized risk assessment, and targeted
intervention strategies. This model could assist in proactive measures such as smoking
cessation programs, environmental policy changes, and personalized healthcare
interventions, potentially reducing the burden of lung cancer and improving patient
outcomes.

Lung Cancer Prediction Model 11 | P a g e


2. Review of Related Literature

2.1. Existing Research and Studies

“An evaluation of machine learning classifiers and ensembles for early stage prediction of lung
cancer “(M.I. Faisal): This research paper delves into the realm of predictive modeling using
statistical and machine learning techniques, emphasizing their significance across various
domains like software fault prediction, spam detection, disease diagnosis, and financial fraud
identification. Recognizing the critical role of predicting lung cancer susceptibility in guiding
effective treatments, the study aims to assess different predictors' effectiveness in enhancing
lung cancer detection efficiency based on symptomatic data. Multiple classifiers—such as
Support Vector Machine (SVM), C4.5 Decision Tree, Multi-Layer Perceptron, Neural
Network, and Naïve Bayes (NB)—are rigorously evaluated using a benchmark dataset sourced
from the UCI repository.[1]

"Lung cancer classification tool using microarray data and support vector machines" (G.
Salano): This study introduces an innovative system that harnesses gene expression data from
oligonucleotide microarrays. Its primary goal is threefold: predict the presence or absence of
lung cancer, identify the specific subtype if present, and pinpoint marker genes linked to the
particular lung cancer type. The proposed system serves as a promising tool for expedited
diagnosis and complements existing lung cancer classification methods.[2]

S. H. Liu, "Prediction of lung cancer based on serum biomarkers by gene expression


programming methods”: The swift differentiation between small cell lung cancer (SCLC) and
non-small cell lung cancer (NSCLC) tumors holds pivotal significance in lung cancer
diagnosis. This research study focused on serum markers—lactate dehydrogenase (LDH), C-
reactive protein (CRP), carcino-embryonic antigen (CEA), neurone specific enolase (NSE),
and Cyfra21-1—as indicators reflecting distinct lung cancer characteristics. The study
conducted classification of lung tumors based on these biomarkers, involving 120 NSCLC and
60 SCLC patients. It aimed to establish an optimal joint utilization of biomarkers for accurate
classification, enhancing the ability to differentiate between SCLC and NSCLC tumors.[3]

Y. Choi “Early-stage lung cancer diagnosis by deep learning-based spectroscopic analysis of


circulating exosomes”: The approach involves exploring cell exosome features via deep
learning and identifying similarities in human plasma exosomes without extensive human data
learning. The deep learning model, trained on SERS signals from exosomes of normal and lung
cancer cell lines, achieved a 95% accuracy in classifying them. In a study involving 43 patients,
including stage I and II cancer patients, the model predicted that 90.7% of the patients' plasma

Lung Cancer Prediction Model 12 | P a g e


exosomes had higher similarity to lung cancer cell exosomes compared to healthy controls,
correlating with cancer progression. [4]

S.J. Lee “A machine-learning approach using PET-based radiomics to predict the histological
subtypes of lung cancer”: The research focused on utilizing machine learning techniques and
PET-based radiomic features to predict histological subtypes in lung cancer. It involved 396
patients (210 ADCs, 186 squamous cell carcinomas) who underwent FDG PET/CT scans
before treatment. Key clinical factors (age, sex, tumor size, smoking status) and 40 radiomic
features extracted from PET images were studied. The study identified the most significant
features associated with lung cancer subtypes using Gini coefficient scores. [5]

S. Jondhale “Lung cancer detection using image processing and machine learning healthcare”:
Lung cancer remains a leading cause of mortality in India, necessitating advanced diagnosis
and detection methods. With the elusive nature of its causes, early detection becomes
paramount for successful treatment. This research focuses on a lung cancer detection system
employing image processing and machine learning techniques to classify the presence of lung
cancer in CT images and blood samples. CT scan images, known for their efficacy compared
to Mammography, are used to classify patients' images as normal or abnormal. [6]

M. A. Yousuf, "Detection of Lung cancer from CT image using Image Processing and Neural
network": Lung cancer detection in its premature stages is a focal point of research due to its
critical impact on patient outcomes. The proposed system is designed as a two-stage process
aimed at detecting lung cancer in its early phases, employing a series of steps encompassing
image acquisition, preprocessing, binarization, thresholding, segmentation, feature extraction,
and neural network-based detection. The system begins by inputting lung CT images,
subsequently undergoing preprocessing via various image processing techniques. In the first
stage, a Binarization technique is applied to convert the image into a binary format, followed
by comparison with a predefined threshold value to identify potential lung cancer regions. The
second stage involves segmentation to isolate the lung CT image, and a robust feature
extraction method is employed to capture critical features from the segmented images. [7]

Viergever, "Computer-aided diagnosis in chest radiography: a survey": Chest radiographs


continue to hold a prominent place in clinical practice, despite the inherent complexity in their
interpretation. Consequently, there is ongoing interest in computer-aided diagnosis (CAD)
systems to aid in the analysis of chest images. This survey aims to categorize and provide
concise reviews of over 150 papers spanning the last three decades, focusing on the computer-
based analysis of chest images. The literature review encompasses a wide array of techniques
and methodologies utilized in computer analysis for chest radiography. Various approaches and
advancements in CAD systems are summarized, highlighting their strengths and limitations.
[8]

Lung Cancer Prediction Model 13 | P a g e


Nice, Jr., "Digital computer determination of a medical diagnostic index directly from chest X-
ray images": This pioneering research employed digital technology to record chest X-ray
images on magnetic tape via a flying spot scanner and analog-to-digital converter.
Subsequently, a digital computer system processed these taped images utilizing a stored
program. The computer's automated analysis focused on measuring the maximum transverse
diameter of the heart shadow and rib cage shadow from the X-ray images. The calculated ratio
between these two measurements yielded the cardiothoracic ratio, a standard diagnostic index
extensively used by physicians to detect cardiac pathology, particularly heart enlargement.
Notably, this research marks the first successful determination of this diagnostic index directly
from unaltered X-ray films through the innovative use of a digital computer. [9]

H. M. Joseph, "Image processing": This research explores the visualization of a scalar function
of two independent variables as an image, enabling the conception of all mathematical
operations as modifications or processing of the original image. Specifically, the study focuses
on a class of modifying operators achieved through specialized scanning techniques,
eliminating the need for rapid access memory storage devices. The investigation identifies two
significant operators: contour enhancement and contour outlining. Contour enhancement
exhibits effects similar to deblurring, akin to aperture correction, and crispening observed in
television practices. [10]

J. M. Hollywood, "A new technique for improving the sharpness of pictures": This research
focuses on a technique known as "crispening" designed to enhance the apparent picture
definition in the CBS color-television system. The method utilizes nonlinear circuitry to modify
the apparent rise time of an isolated step input applied to a bandwidth-limited system. The
principle behind crispening involves adding a second waveform, representing the difference
between the desired and original waveforms, to a slow transition waveform. This addition aims
to create a narrower "spike" shape, superimposed on the original waveform, effectively
reducing the rise time by about half.[11]

Fredendall, "Analysis synthesis and evaluation of the transient response of television


apparatus": This research delves into the relationship between the sharpness of detail in
television pictures and the transmitter's capacity to transmit abrupt changes in picture half-tone.
The study focuses on the utilization of square waves, particularly a square-wave test signal
with a sufficiently long period, as a suitable method for evaluating subjective sharpness in
transmitted pictures. The paper deduces rules for evaluating the expected subjective sharpness
based on the square-wave response of the transmitting apparatus. It introduces rapid chart
methods for analyzing a square-wave output into sine-wave amplitude and phase
responses.[12]
N. Ayache, "Medical image analysis: Progress over two decades and the challenges ahead":
The paper explores the evolution of medical image analysis within the pattern analysis and
machine intelligence (PAMI) community, tracing its trajectory from initial applications of
pattern analysis and computer vision techniques to medical datasets to its emergence as a
distinct and significant discipline. Over the past two to three decades, the field has undergone

Lung Cancer Prediction Model 14 | P a g e


significant transformation due to the unique challenges posed by medical image analysis.
Notable aspects include the distinct types of image information obtained, the complex and fully
three-dimensional nature of medical image data, the nonrigid motion and deformation of
objects, and the statistical variation present in both normal and abnormal image ground
truths.[13]

R.P.A. Grzeszczuk, "Clinical Applications of Three-Dimensional Rendering of Medical Data


Sets": This paper focuses on highlighting the diverse clinical applications of volumetric
rendering techniques in medical imaging, propelled by advancements in high-resolution
imaging modalities like MRI and CT, alongside progress in computer technology. It aims to
provide a comprehensive overview for those seeking a general understanding of the clinical 3D
rendering process and its applications. The research identifies and outlines various clinical
applications that demonstrate potential for utilizing volumetric rendering of medical images.
These applications span different stages of medical practice, including diagnostics,
preoperative planning, intraoperative navigation, surgical robotics, postoperative validation,
training, and telesurgery.[14]

S. Tsuji, "A Plan-Guided Analysis of Cineangiograms for Measurement of Dynamic Behavior


of the Heart Wall": This research paper presents a system tailored for processing noisy dynamic
images, focusing on cineangiograms—X-ray motion pictures capturing the beating heart
through the injection of X-ray opaque dye via a catheter. The system's primary task involves
detecting both the internal and external surfaces of the left ventricular chamber and measuring
the spatial and temporal changes in heart wall thickness, crucial for diagnosing various heart
diseases.[15]
Yu et al. have “obtained histopathology whole-slide slides of lung cancer and squamous cell
carcinoma that have been stained with hematoxylin and eosin” (2016). Patients' photographs
were taken from TCGA (The Cancer Genome Atlas) and the Stanford TMA (Tissue Microarray
Database), plus an additional 294 photos. Even when conducted with the greatest of intentions,
an assessment of human pathology cannot properly predict the patient's prognosis. A total of
9,879 quantitative elements of an image were retrieved, and machine learning algorithms were
used to select the most important aspects and differentiate between patients who survived for
a short period of time and those who survived for a long period of time after being diagnosed
with stage I adenocarcinoma or squamous cell carcinoma. The researchers used the TMA
cohort to validate the survival rate of the recommended framework (P0.036 for tumor type).
According to the findings of this study, the characteristics that are created automatically may
be able to forecast the prognosis of a lung cancer patient and, as a consequence, may help in
the development of personalized medication. The methodologies that were outlined can be
utilized in the analysis of histopathology images of various organs [16].

Pol Cirueda and “his colleagues used an aggregation of textures that kept the spatial
covariances across features consistent”. Mixing the local responses of texture operator pairs
is done using traditional aggregation functions like the average; nonetheless, doing so is a vital
step in avoiding the problems of traditional aggregation. Pretreatment computed tomography

Lung Cancer Prediction Model 15 | P a g e


(CT) scans were utilized in order to assist in the prediction of NSCLC nodule recurrence prior
to the administration of medication. After that, the recommended methods were put to use in
order to compute the kind of NSCLC nodule recurrence according to the manifold regularized
sparse classifier. These discoveries, which offer up new study possibilities on how to use
morphological, tissue traits to evaluate cancer invasion, need to be confirmed and investigated
further. However, this will not be possible without more research. When modeling orthogonal
information, the author focused on the textural characteristics of nodular tissue and coupled
those characteristics with other variables such as the size and shape of the tumor [17].

“The creation of a method for the early detection and accurate diagnosis of lung cancer that
makes use of CT, PET, and X-ray” images by Manasee Kurkure and Anuradha Thakare in 2016
has garnered a significant amount of attention and enthusiasm. The utilization of a genetic
algorithm that permits the early identification of lung cancer nodules by diagnostics allows for
the optimization of the findings to be accomplished. It was necessary to employ both Naive
Bayes and a genetic algorithm in order to properly and swiftly classify the various stages of
cancer images. This was done in order to circumvent the intricacy of the generation process.
The categorization has an accuracy rate of up to eighty percent [18].

Sangamithraa and Govindaraju [19] have “used a preprocessing strategy in order to eliminate
the unwanted unaffected by the use of median and Wiener filters”. This was done in order to
improve the quality of the data. The K-means method is used to do the segmentation of the CT
images. EK-mean clustering is the method that is used to achieve clustering. To extract contrast,
homogeneity, area, corelation, and entropy features from images, fuzzy EK-mean segmentation
is utilized. A back propagation neural network is utilized in order to accomplish the
classification [20].

According to Ashwini Kumar Saini et al. (2016), a summary of the types of noise that might
cause lung cancer and the strategies for removing them has been provided. Due to the fact that
lung cancer is considered to be one of the most life-threatening kinds of cancer, it is essential
that it be detected in its earlier stages. If the cancer has a high incidence and mortality rate, this
is another indication that it is a particularly dangerous form of the disease. The quality of the
digital dental X-ray image analysis must be significantly improved for the study to be
successful. A pathology diagnosis in a clinic continues to be the gold standard for detecting
lung cancer, despite the fact that one of the primary focuses of research right now is on finding
ways to reduce the amount of image noise. X-rays of the chest, cytological examinations of
sputum samples, optical fiber investigations of the bronchial airways, and final CT and MRI
scans are the diagnostic tools that are utilized most frequently in the detection of lung
malignancies (MRI). Despite the availability of screening methods like CT and MRI that are
more sensitive and accurate in many parts of the world, chest radiography continues to be the
primary and most prevalent kind of surgical treatment. It is routine practice to test for lung
cancer in its early stages using chest X-rays and CT scans; however, there are problems
associated with the scans' weak sensitivities and specificities [19].

Lung Cancer Prediction Model 16 | P a g e


Neural ensemble-based detection is the name given to the automated method of illness
diagnosis that was suggested in Kureshi et al.'s research [21] (NED). The approach that was
suggested utilized feature extraction, classification, and diagnosis as its three main
components. In this experiment, the X-ray chest films that were taken at Bayi Hospital were
utilized. This method is recommended because it has a high identification rate for needle
biopsies in addition to a decreased number of false negative identifications. As a result, the
accuracy is improved automatically, and lives are saved [22].

Kulkarni and Panditrao [23] have created a novel algorithm for early-stage cancer identification
that is more accurate than previous methods. The program makes use of a technology that
processes images. The amount of time that passes is one of the factors that is considered while
looking for anomalies in the target photographs. The position of the tumor can be seen quite
clearly in the original photo. In order to get improved outcomes, the techniques of watershed
segmentation and Gabor filtering are utilized at the preprocessing stage. The extracted interest
zone produces three phases that are helpful in recognizing the various stages of lung cancer:
eccentricity, area, and perimeter. These phases may be found in the extracted interest zone. It
has been revealed that the tumors come in a variety of dimensions. The proposed method is
capable of providing precise measurements of the size of the tumor at an early stage [21].

Westaway et al. [24] used a radiomic approach to identify three-dimensional properties from
photos of lung cancer in order to provide prediction information. As is well known, classifiers
are devised to estimate the length of time an organism will be able to continue existing. The
Moffitt Cancer Center in Tampa, Florida, served as the location from where these photographs
for the experiment's CT scans were obtained. Based on the properties of the pictures produced
by CT scans, which may suggest phenotypes, human analysis may be able to generate more
accurate predictions. When a decision tree was used to make the survival predictions, it was
possible to accurately forecast seventy-five percent [23] of the outcomes.

CT (computed tomography) images of lung cancer have been categorized with the use of a lung
cancer detection method that makes use of image processing. This method was described by
Chaudhary and Singh [25]. Several other approaches, including segmentation, preprocessing,
and the extraction of features, have been investigated thus far. The authors have distinguished
segmentation, augmentation, and feature extraction, each in its own unique section. In Stages
I, II, and III, the cancer is contained inside the chest and manifests as larger, more invasive
tumors. By Stage IV, however, cancer has spread to other parts of the body [24], at which point
it is said to be in Stage IV.

Lung Cancer Prediction Model 17 | P a g e


2.2. Research Gap

From the provided research summaries, several potential research gaps or areas for
further exploration might be identified:
Noise Reduction and Image Enhancement Techniques: While the researches touch upon
noise reduction in medical imaging, there might be room to delve deeper into advanced
noise reduction and image enhancement techniques specifically tailored for dynamic
medical images like cineangiograms. Investigating more robust algorithms could lead
to better image quality and more accurate boundary detection.
Automated Boundary Detection: Despite the sophisticated edge detection methods
mentioned, there could be scope for developing more automated and efficient
algorithms to detect boundaries accurately, particularly in cases of low-contrast regions
or images affected by noise. This could involve exploring machine learning or deep
learning techniques for improved segmentation and boundary detection.
Real-time Processing and Analysis: Expanding research on real-time processing of
dynamic medical images, such as cineangiograms, might be valuable. Developing
systems that can process and analyze images in near-real-time during medical
procedures could aid clinicians by providing immediate feedback and guidance.
Clinical Validation and Standardization: While the mentioned research shows
promising results compared to radiologist-detected boundaries, further clinical
validation across a larger and more diverse dataset could be beneficial. Additionally,
establishing standardized protocols and benchmarks for evaluating the accuracy and
reliability of image processing systems in clinical settings could enhance their adoption.
Integration of Multiple Imaging Modalities: Exploring the integration of data from
various imaging modalities (e.g., MRI, CT scans) alongside cineangiograms could
provide a more comprehensive understanding of cardiac structures and functions. This
integration might offer richer diagnostic insights and improve the accuracy of disease
detection.
User Interface and Clinical Adoption: Investigating user-friendly interfaces and system
integration into clinical workflows could bridge the gap between research and practical
clinical application. Ensuring ease of use and seamless integration of these systems into
existing medical practices is crucial for their widespread adoption.
Addressing these potential research gaps could contribute to advancements in medical
imaging technology, enhancing diagnostic accuracy, clinical decision-making, and
ultimately improving patient care in the field of cardiology and beyond.
Research on Lung Cancer Detection using Image Processing and Machine Learning:
Research Summary: This study focuses on lung cancer detection using image
processing and machine learning techniques, highlighting the importance of early-stage
detection for favorable prognosis.

Lung Cancer Prediction Model 18 | P a g e


Potential Research Gap: While the research outlines the use of SVM and image
processing for lung cancer detection, further exploration into hybrid models integrating
diverse machine learning algorithms might improve accuracy. Additionally,
investigating the integration of multiple imaging modalities (like CT scans and
histopathological images) for more comprehensive detection could be valuable.
Research on Computer-Aided Diagnosis in Chest Radiography:
Research Summary: The paper reviews computer-aided diagnosis in chest radiography,
emphasizing the challenges and advancements in this domain.
Potential Research Gap: The research identifies challenges but does not delve into
specific methods to overcome them. Further exploration could involve proposing novel
algorithms or approaches to tackle the challenges posed by interpreting chest
radiographs, thereby enhancing accuracy and efficiency.
Research on Cardiac Diagnosis via Digital Computer System:
Research Summary: This pioneering research introduces a digital computer system for
cardiac diagnosis via chest X-ray films, aiming to enhance diagnostic accuracy.
Potential Research Gap: While the study successfully establishes a method for cardiac
diagnosis, future research could explore the application of this system to a wider range
of cardiac conditions. Moreover, validating the system's accuracy across diverse patient
populations could enhance its reliability and practical utility.
Research on 3D Volumetric Rendering in Medical Imaging:
Research Summary: This paper discusses clinical applications and implementation of
volumetric rendering in medical imaging, emphasizing potential uses in diagnostics,
preoperative planning, etc.
Potential Research Gap: The research provides an overview but lacks detailed insights
into specific volumetric rendering techniques or implementation challenges. Future
studies could focus on evaluating and comparing different rendering methods,
considering their efficacy, limitations, and practical feasibility in clinical settings.
Research on Square Wave Analysis for Television Picture Sharpness:
Research Summary: This study explores the analysis of square waves for evaluating
television picture sharpness, focusing on the relationship between transmitter responses
and image quality.
Potential Research Gap: While the research covers square wave analysis, further
exploration into advanced techniques for enhancing image sharpness could be
beneficial. Investigating modern image processing methods and their impact on image
quality in television could be an area of interest.
Research on Heart Wall Surface Detection in Cineangiograms:
Research Summary: This research presents a plan-guided analysis system for
cineangiograms, aimed at detecting heart wall surfaces and measuring wall thickness.

Lung Cancer Prediction Model 19 | P a g e


Potential Research Gap: While the study demonstrates effective boundary detection,
future research might focus on real-time implementation and validation across a broader
dataset. Exploring automated segmentation techniques and their robustness in noisy
dynamic images could further improve accuracy.
These summaries suggest potential areas for future research, including advancements
in machine learning algorithms, novel image processing techniques, validation across
diverse datasets, and real-time implementation for practical clinical applications.
Addressing these gaps could lead to more accurate and reliable diagnostic tools in
medical imaging and television picture processing.

Lung Cancer Prediction Model 20 | P a g e


3. Proposed System and Methodology

3.1 System Architecture and Design

Fig 1: Lung Cancer Prediction Architecture [25]

Designing the system architecture for a machine learning-based predictive model for lung
cancer risk assessment involves several components and considerations. Here's a high-level
overview of the system architecture and design model for such a project:

System Architecture:
1. Data Collection and Preprocessing:
 Data Sources: Gather data from diverse sources such as medical records, surveys,
public databases, and research studies containing information on smoking habits,
environmental pollutants, genetic predisposition, occupational hazards, and other
relevant parameters.
 Data Preprocessing Pipeline: Develop a robust pipeline for cleaning, formatting,
encoding, and standardizing data. This includes handling missing values, outlier
detection, and feature scaling.

Lung Cancer Prediction Model 21 | P a g e


2. Feature Engineering and Selection:
 Feature Engineering: Implement techniques to extract, transform, and create
relevant features that contribute significantly to lung cancer risk assessment. This
might include normalization, dimensionality reduction, and feature scaling.
 Feature Selection: Employ methods to identify the most impactful features for
building the predictive model, such as correlation analysis, feature importance
ranking, and domain knowledge-based selection.

3. Model Development and Training:


 Machine Learning Model Selection: Experiment with various machine learning
algorithms (e.g., logistic regression, decision trees, random forests, neural
networks) to develop the predictive model.
 Model Training and Validation: Utilize a portion of the dataset for training,
perform hyperparameter tuning, and validate the model using cross-validation
techniques to ensure robustness and generalizability.

4. Interpretability and Explainability:


 Model Interpretation: Implement methods to enhance model interpretability, such as
feature importance visualization, SHAP (Shapley Additive explanations), LIME (Local
Interpretable Model-Agnostic Explanations), or other explainable AI techniques.
 Visualizations: Generate visual aids to explain model predictions and help healthcare
professionals understand the rationale behind the risk assessments.

5. Deployment and Integration:


 System Integration: Design an interface or platform to integrate the predictive
model into existing healthcare systems or as a standalone tool for easy access by
healthcare professionals.
 Scalability and Performance: Ensure the system's scalability and efficiency to
handle large volumes of data and provide real-time predictions.

Design Model:
1. Sequential Model:
The process flow might follow a sequential pattern, starting from data collection,
preprocessing, feature engineering, model development, validation, interpretation, and
finally, deployment.

2. Modular Design:
Modularize different components of the system architecture for easier maintenance and
scalability. Modules might include data ingestion, preprocessing, feature engineering,
model training, validation, and deployment.

3. Feedback Loop:

Lung Cancer Prediction Model 22 | P a g e


Implement a feedback loop mechanism to continuously improve the model by
incorporating new data, feedback from healthcare professionals, and advancements in
research.

4. Security and Privacy:


Incorporate robust security measures to protect sensitive patient data and ensure
compliance with privacy regulations (e.g., encryption, access controls, anonymization
techniques).

5. Documentation and Monitoring:


Document each stage of the system architecture and model development for transparency
and reproducibility. Implement monitoring tools to track model performance and data drift.

6. Collaboration and Interdisciplinary Approach:


Encourage collaboration between data scientists, healthcare professionals, domain experts,
and ethicists throughout the project to ensure the model's accuracy, relevance, and ethical
compliance.

Conclusion:
The system architecture and design model for a machine learning-based predictive model
for lung cancer risk assessment should emphasize data quality, model performance,
interpretability, scalability, security, and ethical considerations. It should be flexible enough
to adapt to evolving data and healthcare needs while delivering accurate risk assessments
and actionable insights for early intervention and personalized preventive measures.

Prediction Models
The prediction problem is formulated as binary classification. The hospitalization when
cancer occurred was used as a class label. If diagnosed with cancer, we assigned a patient
to the positive class (‘1′). Otherwise, we put the patient into the negative class (‘0′). We
experimented with two different RNN models. These models are advantageous for the
sequence data, especially when one data point is dependent on the preceding data point,
like in our case. The reason is that they have a memory to store the states or information of
previous inputs in order to construct the sequence's subsequent output. This mechanism is
also known as a hidden state. The following equations explain the learning process:

To calculate the hidden state

Lung Cancer Prediction Model 23 | P a g e


for the next step
, we use input weights
and hidden units’ weights
together with the input
from the current time step
, and bias
from the recurrent layer. At the end of the calculation, a nonlinear transformation ReLU is
applied. Furthermore, to predict
, we multiply the newly learned hidden state with the weights
from the output layer. We also add up bias
of all neurons in the network. Finally, everything is pulled through a sigmoid function.

The first model contains layers with LSTM units capable of learning long-term
dependencies in sequential data. Remembering information for long periods is practically
their default behavior. The second model has layers with GRUs. Unlike the LSTM unit, the
GRU has gating units that modulate information flow without separating memory cells
[38]. This structure allows to adaptively capture dependencies from large data sequences
without discarding information from earlier parts of the sequence.

The architectures of both models are identical, with one hidden layer of 64 neurons (Fig.
2). Empirical evaluation of RNN models showed that both the LSTM and GRU
demonstrated superiority over traditional ML models [39]. Since LSTM and GRU
architectures have shown surpassing results in various applications, we compared both in
our experiments.

Lung Cancer Prediction Model 24 | P a g e


Fig 2: Architectural Model of LSTM [26]

SVD and embedding layer were tested separately with both RNN methods. The output layer
contains only one neuron with the sigmoid activation function. The adaptive learning rate
optimization algorithm ADAM was used to train the RNN models [40].

A potential problem with training neural networks could be the number of epochs. A large
number of epochs could lead to overfitting, whereas an insufficient number of epochs may
result in an underfit model. That is why in our application, sequential learning models used
the early stopping method, which monitored the model's performance during training. The
objective of the method is to stop the training when the validation loss (binary cross-entropy
loss) starts to increase constantly. As a result, both RNN models were trained through 20
epochs unless stopped earlier by the method mentioned above.

We used a batch size of 64 since, in such a way, the overall training procedure required less
memory. Furthermore, a smaller size was chosen because it is reported across many
applications that using such small batch sizes achieves training stability and improved
generalization performance [41].

To compare the performance of the proposed sequence learning models, we also trained
four standard machine learning models: DT, MLP, RF, and KNN. Only default settings

Lung Cancer Prediction Model 25 | P a g e


provided in the scikit-learn Python library were used for DT and MLP without parameter
tuning [42]. For RF and KNN, we used the standard implementation with basic settings (for
RF the maximum depth was set to 10 and the number of trees to 100, while for KNN the
number of nearest neighbors was 3). All prediction models were run separately for each of
the four studied cancers. We trained the models on 80% of patients selected entirely at
random, and the remaining 20% we used for testing, while 25% of the training set was used
for training validation. All models were run on balanced datasets, and we measured test
accuracy, Area Under the Receiver Operating Characteristic curve (AUROC), sensitivity
(recall), specificity, precision, and F1 score. Prediction accuracy was chosen as a primary
metric since there are equal patients in both classes for each cancer. However, we also
reported the AUROC score for a more comprehensive evaluation of the models. The
difference between these two metrics is based on the decision threshold, i.e. class
probability threshold. In binary classification, the threshold is the value over which a
sample is assigned to class one. AUROC is a metric that evaluates a binary classifier's
output over decision thresholds varying between 0 and 1, whereas the accuracy indicates
how well a classifier performs for the default threshold of 0.5. High accuracy and high
AUROC indicate that the classifier performs admirably for the default threshold and
similarly for many other threshold values. Additionally, an admirably accurate classifier
should have high sensitivity and specificity. Since the AUROC score summarizes the
model's efficacy in terms of sensitivity and specificity for various decision thresholds, we
calculated those two metrics only for the 0.5 threshold.

Fig 3: GRU’s accuracy Comparison [27]

Lung Cancer Prediction Model 26 | P a g e


3.2. System Flow and Use Case

Creating a use case diagram for the machine learning-based predictive model for lung
cancer risk assessment involves identifying the primary actors interacting with the system
and illustrating their interactions. Here's a simplified representation of the use case
diagram:

Use Case Diagram for Lung Cancer Risk Assessment System:


Actors:
 Healthcare Professional: Interacts with the system to access risk assessments and
recommendations for patients.
 System: Represents the machine learning-based predictive model for lung cancer
risk assessment.

Use Cases:
Collect Data:
 Description: The system collects diverse data sources related to patients' smoking
habits, environmental exposure, genetics, etc.
 Actors: System
Preprocess Data:
 Description: The system cleans, preprocesses, and prepares the collected data for
model training.
 Actors: System
Train Model:
 Description: The system utilizes machine learning algorithms to train the predictive
model based on the preprocessed data.
 Actors: System
Validate Model:
 Description: The system evaluates the trained model's performance using validation
techniques.
 Actors: System
Provide Risk Assessment:
 Description: Healthcare professionals interact with the system to obtain
personalized lung cancer risk assessments for patients.
 Actors: Healthcare Professional, System
Present Recommendations:

Lung Cancer Prediction Model 27 | P a g e


 Description: The system provides actionable recommendations based on the risk
assessment for early intervention and personalized preventive measures.
 Actors: Healthcare Professional, System

Fig 4: Use Case Diagram of the System [28]

Relationships:
 Healthcare Professional --> Provide Risk Assessment --> System: Initiates the
request for patient-specific risk assessment.
 Healthcare Professional --> Present Recommendations --> System: Receives
personalized recommendations based on the risk assessment.
 System --> Collect Data --> System: Collects diverse data sources for model
training.
 System --> Preprocess Data --> System: Cleans and prepares collected data for
training.
 System --> Train Model --> System: Utilizes data to train the predictive model.

Lung Cancer Prediction Model 28 | P a g e


 System --> Validate Model --> System: Assesses the model's performance through
validation.

This use case diagram outlines the primary interactions between the actors (healthcare
professionals and the system) and the key functionalities involved in the development and
utilization of the predictive model for lung cancer risk assessment.

A use case is a representation of interactions between an actor (an external entity, which
can be a user or another system) and a system. It describes the functionality or behavior of
a system from the perspective of its users. Each use case represents a specific goal or action
that an actor wants to achieve when interacting with the system.

Components of a Use Case:


 Use Case Name: Describes the action or goal that an actor wants to accomplish.

 Actors: Represent entities interacting with the system. They can be users, external
systems, or any other role that engages with the system to achieve specific tasks.

 Description: Details the specific functionality or behavior associated with the use
case.

 Trigger: Describes the event or condition that initiates the use case.

 Preconditions: Specifies any conditions that must be true for the use case to start.

 Postconditions: States the expected outcome or state of the system after the use
case is completed successfully.

 Flow of Events: Describes the sequence of steps or actions that occur when the use
case is executed. It typically includes the main flow (basic course of actions) and
alternative flows (exceptions or variations).

 Exceptions: Covers exceptional scenarios or error conditions that might occur


during the execution of the use case.

Lung Cancer Prediction Model 29 | P a g e


Example: Use Case - Provide Risk Assessment

Use Case Name: Provide Risk Assessment


Actors: Healthcare Professional, System
Description: This use case involves a healthcare professional interacting with the system
to obtain personalized risk assessments for patients regarding their likelihood of developing
lung cancer.
Trigger: The healthcare professional requires a risk assessment for a specific patient or
group of patients.

Preconditions:
 The system has collected and preprocessed relevant patient data.
 The machine learning model for lung cancer risk assessment is trained and
validated.

Postconditions:
 The healthcare professional receives the personalized risk assessment for the
patient(s).
 The system maintains the confidentiality and security of patient data.

Flow of Events:
 Healthcare Professional requests risk assessment: The healthcare professional
logs into the system and provides patient-specific information required for the risk
assessment.
 System processes the request: The system utilizes the trained predictive model to
analyze the provided data and generates a personalized risk assessment.
 System presents risk assessment: The system displays the risk assessment results
to the healthcare professional, providing insights into the patient's likelihood of
developing lung cancer.
 Healthcare Professional reviews and interprets the assessment: The healthcare
professional interprets the risk assessment and uses it to inform further medical
decisions or interventions.

Exceptions:
If the system encounters errors in data processing or model failure, it notifies the healthcare
professional and prompts appropriate actions or troubleshooting steps.

Lung Cancer Prediction Model 30 | P a g e


Fig 5: UML Sequence Diagram

Unified Modeling Language (UML) diagram for the machine learning-based predictive
model for lung cancer risk assessment involves various components such as class diagrams,
activity diagrams, sequence diagrams, and more. For the purposes of this project, let's create
a high-level UML diagram outlining the main components and their interactions:

UML Diagram for Lung Cancer Risk Assessment System:


Class Diagram:
A class diagram showcases the system's classes, their attributes, methods, and relationships.

Classes:
 Data Collector: Responsible for collecting diverse data sources.
 Data Preprocessor: Handles data cleaning, formatting, and preprocessing tasks.
 Model Trainer: Utilizes machine learning algorithms to train the predictive model.
 Model Validator: Evaluates the trained model's performance using validation
techniques.
 Healthcare Professional: Represents the user interacting with the system.

Lung Cancer Prediction Model 31 | P a g e


 Predictive Model: Encapsulates the machine learning model for lung cancer risk
assessment.
Activity Diagram:
An activity diagram illustrates the flow of activities or processes within the system.

Activities:
 Collect Data: DataCollector gathers data from various sources.
 Preprocess Data: Data Preprocessor cleans and prepares the collected data.
 Train Model: Model Trainer uses data to train the predictive model.
 Validate Model: Model Validator assesses the model's performance.
 Provide Risk Assessment: Interaction between Healthcare Professional and
Predictive Model to obtain risk assessments.
 Present Recommendations: Predictive Model presents actionable
recommendations based on risk assessments.
Sequence Diagram:
A sequence diagram shows the interactions between objects in a specific scenario or use
case.

Sequence:
 Healthcare Professional -> Provide Risk Assessment -> Predictive Model:
Healthcare Professional initiates a request for risk assessment.
 Predictive Model -> Provide Risk Assessment -> Healthcare Professional:
Predictive Model generates and provides risk assessment to Healthcare
Professional.
 Healthcare Professional -> Present Recommendations -> Predictive Model:
Healthcare Professional receives and interprets the recommendations.
This UML diagram provides a high-level overview of the system's components (classes),
their relationships, and the flow of activities (activity and sequence diagrams) involved in
the development and utilization of the predictive model for lung cancer risk assessment. It
serves as a visual representation to understand the system's structure and behavior at a
conceptual level.

Lung Cancer Prediction Model 32 | P a g e


Fig 6: Flowchart of the methodology for Cancer Detection

SAM (State-Action-Model) is an architectural pattern used for structuring front-end


applications. However, it might not directly apply to a machine learning-based predictive
model for lung cancer risk assessment, which typically involves data processing, model
development, and deployment in a backend or server-side environment. Nonetheless, I can
provide an adapted interpretation of SAM principles tailored to the development and
deployment phases of the predictive model system:

State:
In the context of the lung cancer risk assessment model:

Lung Cancer Prediction Model 33 | P a g e


 Data State: Represents the diverse data collected from various sources (smoking
habits, environmental pollutants, genetic predisposition, occupational hazards,
etc.).
 Preprocessed Data State: Indicates the cleaned, formatted, and preprocessed data
ready for model training.
 Trained Model State: Signifies the machine learning model trained on the
preprocessed data.
 Validation State: Denotes the state where the model is validated for its accuracy and
performance.
 Prediction State: Represents the system's ability to predict lung cancer risk for a
specific individual based on input data.

Action:
Actions refer to the transformation of the system's state. In this context:

 Collect Data Action: Involves the collection of diverse data sources related to lung
cancer risk factors.
 Preprocess Data Action: Cleansing, formatting, and preparing the collected data for
model training.
 Train Model Action: Utilizes the preprocessed data to train the machine learning
model.
 Validate Model Action: Evaluates and validates the trained model's performance
using cross-validation or other techniques.
 Predict Risk Action: Involves using the trained and validated model to predict lung
cancer risk for individuals.

Model:
The model here represents the machine learning model itself, developed to predict lung
cancer risk based on various input factors.

 Machine Learning Model: Includes the algorithms, parameters, and trained weights
resulting from the model training process.
 Model Evaluation Metrics: Indicate the performance metrics (accuracy, precision,
recall, etc.) obtained during model validation.
 Deployment Model: The model in its deployable form integrated into a system or
application for real-time risk assessment.

Lung Cancer Prediction Model 34 | P a g e


3.3. System Algorithm
Developing a machine learning-based predictive model for lung cancer risk assessment
involves several algorithms and techniques that contribute to different stages of the system.
Here's an overview of the key algorithms and methodologies involved in the system's
workflow:

1. Data Collection and Preprocessing:


 Data Collection Algorithm: Depending on the sources, different algorithms might
be employed to retrieve data from various repositories or sources.
 Data Cleaning Algorithm: Techniques such as outlier detection, handling missing
values, and normalization methods to ensure data quality and consistency.
 Feature Engineering Techniques: Algorithms like principal component analysis
(PCA), feature scaling, or selection algorithms to derive relevant features from raw
data.

2. Model Development and Training:


 Supervised Learning Algorithms: Utilizing supervised learning algorithms to
train the predictive model:
 Logistic Regression: For binary classification predicting lung cancer risk.
 Decision Trees: Capturing non-linear relationships between features.
 Random Forests: Ensemble technique for improved accuracy and robustness.
 Support Vector Machines (SVM): Separating data into classes using hyperplanes.
 Neural Networks: Deep learning models for complex pattern recognition.
 Hyperparameter Tuning Algorithms: Grid Search, Random Search, or Bayesian
Optimization to fine-tune model hyperparameters for better performance.
 Cross-Validation Algorithms: K-fold cross-validation or stratified cross-
validation to assess model generalizability.

3. Model Evaluation and Validation:


Evaluation Metrics: Algorithms to compute various performance metrics:

 Accuracy, Precision, Recall, F1-score: Assessing the model's overall performance.


 ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Evaluating
binary classification performance.
 Confusion Matrix Analysis: Understanding model prediction and misclassification.
Validation Techniques: Bootstrapping, Monte Carlo cross-validation, or holdout
validation for robust model evaluation.

4. Interpretability and Explainability:

Lung Cancer Prediction Model 35 | P a g e


 Feature Importance Algorithms: Methods like SHAP (Shapley Additive
explanations), LIME (Local Interpretable Model-Agnostic Explanations), or
permutation importance to understand feature contributions.
 Visualization Algorithms: Algorithms for generating visual aids like feature
importance plots, decision boundaries, or partial dependence plots for
interpretation.

5. Deployment and Integration:


 Scalable Algorithms: Ensuring the chosen algorithms are scalable and efficient for
real-time prediction in a production environment.
 Integration Algorithms: API integration, containerization (e.g., Docker), or
deployment on cloud platforms using algorithms for streamlined deployment.

Conclusion:
The system algorithm for developing a machine learning-based predictive model for lung
cancer risk assessment involves a diverse set of algorithms encompassing data collection,
preprocessing, model development, evaluation, interpretability, and deployment. The
choice of algorithms depends on factors like data characteristics, model complexity,
interpretability requirements, and deployment environments, among others. These
algorithms collectively contribute to creating a robust and accurate predictive tool for lung
cancer risk assessment.

Decision Tree
Role in Lung Cancer Risk Assessment:
Feature Importance:
Decision trees help identify the most crucial features influencing lung cancer risk by
assessing feature importance. Attributes like smoking habits, environmental pollutants,
genetic predisposition, etc., are ranked based on their contribution to classification.

 Interpretability:

Lung Cancer Prediction Model 36 | P a g e


Decision trees offer high interpretability, making it easier for healthcare professionals to
understand and explain the model's predictions. The decision path in the tree can be
visualized and easily comprehended.

 Handling Non-Linear Relationships:

Fig 7: Decision Tree [29]

Decision trees can capture non-linear relationships between input factors and lung cancer
risk, which might be crucial as certain risk factors might not have a linear impact on the
risk.

Model Building Process:


Tree Construction:
The tree construction starts with the entire dataset and recursively splits it into subsets based
on features to create decision nodes.

Lung Cancer Prediction Model 37 | P a g e


The splitting occurs based on metrics like Gini impurity or information gain to maximize
the homogeneity of subsets concerning the target variable (lung cancer risk).
Pruning:
Techniques like pre-pruning (limiting tree depth, setting a minimum number of samples for
a split) or post-pruning (pruning nodes after tree construction) are used to prevent
overfitting.
Prediction:
Once the tree is built, predictions for lung cancer risk are made by traversing the tree from
the root node to leaf nodes, where the final prediction or class label resides.

Lung Cancer Prediction Model 38 | P a g e


4. Result and Discussions

4.1. Model Performance and Graphs


The latency and performance of a machine learning-based predictive model for lung cancer
risk assessment can vary based on several factors, including data size, model complexity,
chosen algorithms, hardware infrastructure, and real-time deployment requirements. Here's
a detailed breakdown:

Latency:
Training Latency:
Training a machine learning model involves processing the collected data, feature
engineering, algorithm execution, hyperparameter tuning, and model validation. The
duration can range from minutes to several hours or even days, depending on the dataset
size, algorithm complexity, and available computational resources.
Prediction Latency:
Once the model is trained and deployed, the time taken to predict lung cancer risk for an
individual depends on:
 Model Complexity: Simple models like logistic regression might have lower
prediction times compared to complex models like deep neural networks.
 Size of Input Data: Larger input data or higher dimensionality may increase
prediction time.
 Hardware and Software Infrastructure: Utilization of powerful hardware
(GPUs/TPUs) and optimized software frameworks can reduce prediction latency.

Performance:
Model Performance Metrics:
 Accuracy: The ability of the model to correctly predict lung cancer risk.
 Precision: Proportion of correctly predicted positive instances (lung cancer cases)
among all instances predicted as positive.
 Recall: Proportion of correctly predicted positive instances among all actual
positive instances.
 F1-score: Harmonic mean of precision and recall, balancing both metrics.
 ROC-AUC: Area under the Receiver Operating Characteristic curve, assessing the
model's ability to distinguish between classes.
Validation and Testing:

Lung Cancer Prediction Model 39 | P a g e


 The model's performance is evaluated using validation techniques (e.g., cross-
validation) on separate datasets to ensure its generalizability and reliability.
Scalability and Resource Utilization:
 The model's ability to handle increased data sizes, maintain consistent performance,
and efficiently utilize available computational resources (CPU, memory, GPUs) is
crucial for scalability.

Factors Affecting Latency and Performance:


Dataset Size and Complexity:
Larger datasets with more features can increase both training and prediction latency.
Model Complexity and Algorithms:
Complex models like ensemble methods (e.g., Random Forests) or deep learning
architectures might have longer trained times but potentially higher performance.
Hardware Infrastructure:
Utilization of GPUs or specialized hardware accelerators can significantly reduce
computation time for training and predictions.
Optimization Techniques:
Optimizing algorithms, feature engineering, and utilizing parallel processing or distributed
computing can enhance performance.

Background Lung cancer is the second most common cancer in incidence and the leading
cause of cancer deaths
worldwide. Meanwhile, lung cancer screening with low-dose CT can reduce mortality. The
UK National Screening
Committee recommended targeted lung cancer screening on Sept 29, 2022, and asked for
more modelling work to be
done to help refine the recommendation. This study aims to develop and validate a risk
prediction model—the
CanPredict (lung) model—for lung cancer screening in the UK and compare the model
performance against
seven other risk prediction models.
Methods For this retrospective, population-based, cohort study, we used linked electronic
health records from
two English primary care databases: QResearch (Jan 1, 2005–March 31, 2020) and Clinical
Practice Research

Lung Cancer Prediction Model 40 | P a g e


Datalink (CPRD) Gold (Jan 1, 2004–Jan 1, 2015). The primary study outcome was an
incident diagnosis of lung
cancer. We used a Cox proportional-hazards model in the derivation cohort (12·99 million
individuals aged
25–84 years from the QResearch database) to develop the CanPredict (lung) model in men
and women. We used
discrimination measures (Harrell’s C statistic, D statistic, and the explained variation in
time to diagnosis of lung
cancer [R²
D]) and calibration plots to evaluate model performance by sex and ethnicity, using data
from QResearch
(4·14 million people for internal validation) and CPRD (2·54 million for external
validation). Seven models for
predicting lung cancer risk (Liverpool Lung Project [LLP]v2, LLPv3, Lung Cancer Risk
Assessment Tool [LCRAT],
Prostate, Lung, Colorectal, and Ovarian [PLCO]M2012, PLCOM2014, Pittsburgh, and
Bach) were selected to compare their
model performance with the CanPredict (lung) model using two approaches: (1) in ever-
smokers aged 55–74 years
(the population recommended for lung cancer screening in the UK), and (2) in the
populations for each model
determined by that model’s eligibility criteria.
Findings There were 73380 incident lung cancer cases in the QResearch derivation cohort,
22838 cases in the
QResearch internal validation cohort, and 16145 cases in the CPRD external validation
cohort during follow-up. The
predictors in the final model included sociodemographic characteristics (age, sex, ethnicity,
Townsend score), lifestyle
factors (BMI, smoking and alcohol status), comorbidities, family history of lung cancer,
and personal history of other
cancers. Some predictors were different between the models for women and men, but model
performance was similar
between sexes. The CanPredict (lung) model showed excellent discrimination and
calibration in both internal and
external validation of the full model, by sex and ethnicity. The model explained 65% of the
variation in time to diagnosis

Lung Cancer Prediction Model 41 | P a g e


of lung cancer in both sexes in the QResearch validation cohort and 59% of the R²
D in both sexes in the CPRD validation
cohort. Harrell’s C statistics were 0·90 in the QResearch (validation) cohort and 0·87 in
the CPRD cohort, and the

Fig 8: ROC curves for risk prediction models in the MOLTEST BIS cohort.
ROC, receiver operating characteristic curve; LLP, Liverpool Lung Project;
AUC, area under the receiver operating characteristic curve. [30]

D statistics were 2·8 in the QResearch (validation) cohort and 2·4 in the CPRD cohort.
Compared with seven other
lung cancer prediction models, the CanPredict (lung) model had the best performance in
discrimination, calibration,

Lung Cancer Prediction Model 42 | P a g e


and net benefit across three prediction horizons (5, 6, and 10 years) in the two approaches.
The CanPredict (lung)
model also had higher sensitivity than the current UK recommended models (LLPv2 and
PLCOM2012), as it identified
more lung cancer cases than those models by screening the same amount of individuals at
high risk.
Interpretation The CanPredict (lung) model was developed, and internally and externally
validated, using data from
19·67 million people from two English primary care databases. Our model has potential
utility for risk stratification
of the UK primary care population and selection of individuals at high risk of lung cancer
for targeted screening. If
our model is recommended to be implemented in primary care, each individual’s risk can
be calculated using
information in the primary care electronic health records, and people at high risk can be
identified for the lung cancer
screening programmed.

Lung Cancer Prediction Model 43 | P a g e


Graphs

Fig 9: Graphs

Lung Cancer Prediction Model 44 | P a g e


5. Conclusion and Recommendations

5.1. Summary and Concluding Remark


The project revolves around the creation of a machine learning-based predictive model
tailored for lung cancer risk assessment. The primary aim is to develop a robust tool capable
of evaluating an individual's probability of developing lung cancer by considering a wide
array of influential factors such as smoking habits, exposure to environmental pollutants,
genetic predisposition, occupational hazards, and other pertinent parameters. The
comprehensive model development process involves multiple stages: initial data collection
from diverse sources, meticulous data preprocessing, including cleaning and feature
engineering, model development using various algorithms like decision trees, logistic
regression, random forests, and neural networks, followed by model validation through
performance evaluation metrics like accuracy, precision, recall, F1-score, and ROC-AUC.
Additionally, the project emphasizes interpretability and explain ability, incorporating
methods for feature importance and model visualization to ensure healthcare professionals
can comprehend and trust the model's predictions. Latency in training and prediction, along
with performance metrics, is crucial for assessing the model's efficiency. Balancing model
complexity, hardware infrastructure, optimization techniques, and continuous updates
based on evolving data and technology are pivotal for maintaining accuracy, scalability,
and relevance over time. Ultimately, the envisioned outcome is a reliable predictive tool
facilitating early intervention and personalized preventive measures in the realm of lung
cancer assessment within healthcare.

The primary objective of this project is to develop a robust machine learning-based


predictive model specifically designed for assessing an individual's risk of developing lung
cancer. The model aims to leverage an extensive range of critical input factors, including
smoking habits, exposure to environmental pollutants, genetic predisposition, occupational
hazards, and other relevant parameters. This comprehensive model development process
entails several stages, starting from the collection of diverse data sources to meticulous
preprocessing, feature engineering, and the utilization of various algorithms such as
decision trees, logistic regression, random forests, and neural networks. Furthermore,
model validation using performance evaluation metrics like accuracy, precision, recall, F1-
score, and ROC-AUC is integral to ensuring the model's reliability and effectiveness.

Moreover, emphasis has been placed on interpretability and explain ability, incorporating
methods for understanding feature importance and visualizing the model's decision-making
process. This strategic approach aims to enhance the model's transparency and facilitate the
comprehension of predictions by healthcare professionals. Latency in both training and
prediction phases, coupled with the model's performance metrics, stands as critical
evaluation criteria for assessing the model's efficacy.

Lung Cancer Prediction Model 45 | P a g e


In conclusion, this project represents a multifaceted endeavor to develop an accurate,
interpretable, and reliable predictive tool for lung cancer risk assessment. Striking a balance
between model complexity, hardware optimization, and continuous updates based on
evolving data trends and technological advancements will be pivotal for maintaining the
model's accuracy, scalability, and practical relevance within healthcare settings. Ultimately,
the envisioned outcome is a cutting-edge predictive model that not only identifies
individuals at risk of developing lung cancer but also aids in enabling early interventions
and personalized preventive measures in the realm of healthcare.

Lung cancer is the major cause of cancer-related death in this generation, and it is expected
to remain so for the foreseeable future. It is feasible to treat lung cancer if the symptoms of
the disease are detected early. It is possible to construct a sustainable prototype model for
the treatment of lung cancer using the current developments in computational intelligence
without negatively impacting the environment. Because it will reduce the number of
resources squandered as well as the amount of work necessary to complete manual tasks,
it will save both time and money. To optimise the process of detection from the lung cancer
dataset, a machine learning model based on support vector machines (SVMs) was used.
Using an SVM classifier, lung cancer patients are classified based on their symptoms at the
same time as the Python programming language is utilised to further the model
implementation. The effectiveness of our SVM model was evaluated in terms of several
different criteria. Several cancer datasets from the University of California, Irvine, library
was utilised to evaluate the evaluated model. As a result of the favourable findings of this
research, smart cities will be able to deliver better healthcare to their citizens. Patients with
lung cancer can obtain real-time treatment in a cost-effective manner with the least amount
of effort and latency from any location and at any time. The proposed model was compared
with the existing SVM and SMOTE methods. The proposed method gets a 98.8% of
accuracy rate when comparing the existing methods.

The data was located in the machine learning repository at UCI, and there are 32 examples
in the dataset, each having 57 features and a notional range of 0-3 for all predictive
attributes. This is accomplished by translating nominal attribute and class label data into
binary form, which makes data analysis easier to perform. The conversion of data from
nominal to binary form is the most widely used and standardized method in data analysis.
There are some missing values in the dataset, which has an impact on the performance of
the algorithm; therefore, caution should be exercised when analyzing the data. The label
has three different levels of severity: high, medium, and low. There is a significant amount
of missing data in the input data. As a result, it is important to prepare the data in such a
way that the missing values are replaced with the value that occurs the most frequently in
the column. Following that, the newly processed data is subjected to analysis using a
Python tool. When prior data is transformed into a form that may be utilised for
categorization, classifiers are used to do this. To put the classifier through its paces, ten
different cross validation methods are applied. It is a powerful data analysis approach that
can be used to run ten times the number of computations with the available data and create

Lung Cancer Prediction Model 46 | P a g e


accurate predictions based on that data as is possible with traditional methods. The
classification accuracy of a forecast is defined as the number of correct predictions
produced out of a total forecast. The values of these variables are conditional on the
outcome of the experiment. In the case of false-positive and false-negative values, they are
denoted by the true positive (TP) and true negative (TN). As you can see, false positive
(FP) and stands for false negative (FN).

The method proposed is the most efficient method. This is because of the computations that
exist in this system. That is, after the given data is included, many of the data in the fifth
text are compared with its various formats and analyzed. These analysis methods compute
its structure and dimensions when comparing the given data with the many data present in
the other datasets attached to it. The various data available in such calculations will define
its boundaries. The changes in its boundaries when small cooks are attached to each other
help to calculate it more accurately when analyzing its various shape models. Thus, its
accuracy is high.

As demonstrated by the evaluation findings, SVM with SMOTE resampling (Figures 3–8)
on two iterations of the Lung Cancer dataset produced the greatest performance on the
dataset. When compared to earlier methods, this method achieves the maximum value for
all of the parameters that were investigated. The study has two minorities participating in
our lung cancer data collection. As a result, after two rounds of SMOTE, there is an equal
distribution of minorities among the two classes. The third run of SMOTE generates
synthetic samples for class B, which had previously been the majority class in the previous
steps. Nonetheless, the classification performance of these samples does not increase. The
best way to use SVM and SMOTE is to do both of them twice on the same dataset.

5.2. Practical Uses and Implications

The development of a machine learning-based predictive model for lung cancer risk
assessment holds several practical uses and significant implications within healthcare and
beyond:

Practical Uses:
Early Intervention and Preventive Measures:
Identification of individuals at a higher risk of developing lung cancer enables healthcare
professionals to implement targeted preventive measures and interventions. This could
include personalized counseling, regular screenings, lifestyle modifications, and cessation
programs for smoking or reducing exposure to environmental pollutants.

Lung Cancer Prediction Model 47 | P a g e


Improved Patient Care and Management:
Healthcare providers can tailor patient care plans based on individualized risk assessments,
optimizing resource allocation and prioritizing care for high-risk individuals. This leads to
more efficient and effective healthcare delivery.
Healthcare Resource Allocation:
Targeted risk assessment assists in efficient allocation of healthcare resources by focusing
on high-risk groups or individuals, optimizing screening programs, and allocating
interventions where they are most needed.
Public Health Policies and Awareness Campaigns:
Insights from the predictive model could inform public health policies aimed at reducing
lung cancer risk factors on a larger scale. It can also support the development of public
awareness campaigns for smoking cessation, environmental regulations, and occupational
safety measures.

Implications:
Early Detection and Improved Outcomes:
Early identification of individuals at risk may lead to early detection of lung cancer,
potentially improving treatment outcomes by enabling timely intervention and
management.
Ethical Considerations:
Handling sensitive health-related data and making predictions about an individual's health
condition raises ethical concerns regarding patient privacy, data security, informed consent,
and fair use of predictive analytics in healthcare.
Health Equity and Accessibility:
Ensuring equitable access to risk assessment tools and interventions is crucial to prevent
exacerbating health disparities among different socioeconomic groups or regions.
Continuous Improvement and Validation:
Ongoing validation, refinement, and improvement of the model are critical to maintaining
accuracy, especially considering the evolving nature of medical data and healthcare
practices.

Overall Impact:
The successful implementation of a predictive model for lung cancer risk assessment has
the potential to significantly impact public health strategies, patient care, resource
allocation, and individual health outcomes. By enabling early identification of at-risk
individuals and facilitating targeted interventions, such a model can contribute to reducing
the burden of lung cancer and improving overall healthcare effectiveness and efficiency.
However, careful consideration of ethical, legal, and social implications is essential to
ensure responsible and equitable use of predictive analytics in healthcare.

Lung Cancer Prediction Model 48 | P a g e


Roadblocks
Developing a machine learning-based predictive model for lung cancer risk assessment
involves several challenges and roadblocks that can hinder the project's progress. Some of
the key roadblocks include:

Data Quality and Availability:


 Data Accessibility: Accessing comprehensive and diverse datasets encompassing
various risk factors like smoking habits, environmental exposure, genetic
predisposition, and occupational hazards can be challenging due to data silos or
limited availability.
 Data Quality Issues: Incomplete, inconsistent, or biased data can impact the model's
accuracy and reliability. Handling missing values, outliers, and ensuring data
consistency poses significant challenges.

Model Development and Performance:


 Model Complexity and Overfitting: Complex models might lead to overfitting,
reducing the model's generalizability. Balancing model complexity with
interpretability and performance is challenging.
 Algorithm Selection and Tuning: Choosing the right algorithms and
hyperparameters, along with optimizing model performance without compromising
accuracy, is a complex task.

Interpretability and Explainability:


 Interpretability of Model: Ensuring the model's predictions are interpretable and
explainable to healthcare professionals is crucial for its acceptance and trust. Black-
box models might lack interpretability.
Ethical and Regulatory Challenges:
 Privacy and Confidentiality: Dealing with sensitive patient health data requires
strict adherence to privacy regulations (such as HIPAA in the United States) and
ensuring patient confidentiality.
 Ethical Considerations: Making predictions about an individual's health condition
raises ethical concerns regarding consent, fairness, bias, and the responsible use of
predictive analytics in healthcare.

Lung Cancer Prediction Model 49 | P a g e


Deployment and Integration:
 Scalability and Deployment: Deploying the model in real-world healthcare settings
while ensuring scalability, efficiency, and compatibility with existing systems can
be a complex task.
 Continual Validation and Improvement: Continuous validation and improvement of
the model to adapt to evolving data and healthcare practices require ongoing
resources and efforts.

Feasibility Analysis
A feasibility analysis for a machine learning-based predictive model for lung cancer risk
assessment involves evaluating various aspects to determine the project's viability, including
technical, economic, operational, and scheduling feasibility.

Technical Feasibility:
 Data Availability and Quality: Assess the availability of diverse data sources containing
relevant factors like smoking habits, environmental exposure, genetic predisposition,
etc. Evaluate data quality, considering completeness, consistency, and potential biases.
 Technology and Tools: Determine the feasibility of employing suitable technologies,
algorithms, and tools for data preprocessing, model development, validation, and
deployment. Consider hardware and software requirements for computational
resources.

 Model Complexity and Interpretability: Assess the feasibility of developing a model


that balances complexity with interpretability, ensuring healthcare professionals can
comprehend and trust the model's predictions.

Economic Feasibility:
 Cost Estimation: Evaluate the costs associated with data acquisition, data
preprocessing, model development, validation, infrastructure, deployment,
maintenance, and personnel (data scientists, healthcare experts, IT professionals).
 Return on Investment (ROI): Estimate potential benefits in terms of improved
healthcare outcomes, reduced healthcare costs through early intervention, and resource
optimization against the incurred costs.

Operational Feasibility:

Lung Cancer Prediction Model 50 | P a g e


 Resource Availability: Assess the availability of skilled personnel, domain experts
(healthcare professionals), and IT infrastructure needed for model development,
implementation, and ongoing maintenance.
 Integration with Healthcare Systems: Determine the feasibility of integrating the
predictive model into existing healthcare systems or workflows while ensuring
compatibility and acceptance by healthcare professionals.

Scheduling Feasibility:
 Timeline and Milestones: Evaluate the feasibility of meeting project deadlines,
considering the complexities involved in data collection, preprocessing, model
development, validation, and deployment.
 Risk Assessment and Mitigation: Identify potential risks (e.g., data quality issues,
model performance limitations, regulatory hurdles) and develop mitigation strategies
to address them.

5.3. Future Work and Enhancement


Future work and enhancements for the machine learning-based predictive model for lung
cancer risk assessment encompass a range of possibilities aimed at advancing its accuracy,
interpretability, scalability, and applicability within healthcare settings. Some prospective
areas for further development include:
Enhanced Data Integration and Quality Improvement:
Incorporation of Additional Data Sources: Integrating more comprehensive and diverse
datasets, including longitudinal data, genetic markers, and environmental exposure records,
to enhance the model's predictive capabilities.
Advanced Data Preprocessing Techniques: Implementing more sophisticated methods for
handling missing data, outlier detection, and feature engineering to improve data quality
and ensure the model's robustness.
Model Development and Interpretability:
Ensemble Methods and Advanced Algorithms: Exploring ensemble learning techniques or
advanced algorithms to enhance predictive accuracy while maintaining model
interpretability for healthcare professionals.
Explainable AI (XAI) Techniques: Implementing state-of-the-art explainable AI methods
to improve the model's interpretability, providing clear insights into the factors influencing
the risk assessment predictions.
Validation and Continuous Improvement:

Lung Cancer Prediction Model 51 | P a g e


Longitudinal Studies and Real-Time Validation: Conducting longitudinal studies to validate
the model's performance over time and considering real-time validation methods to adapt
the model to evolving data trends.
Feedback Mechanisms and Iterative Updates: Implementing feedback loops from
healthcare practitioners to continuously refine and update the model, ensuring it stays
relevant and aligned with clinical practices.
Ethical Considerations and Regulatory Compliance:
Ethical Framework and Fairness Assessments: Developing an ethical framework for the
model's usage, including fairness assessments to mitigate biases and ensure equitable
predictions across diverse populations.
Regulatory Compliance and Data Privacy: Continuously aligning with evolving healthcare
regulations, ensuring compliance with data privacy laws, and adopting secure data handling
practices.
Integration and Deployment:
Scalability and Real-World Deployment: Enhancing the model's scalability for large-scale
deployment in diverse healthcare settings, ensuring seamless integration with existing
healthcare systems, and optimizing deployment for real-time risk assessment.
Collaborative Partnerships and Knowledge Sharing: Establishing collaborative
partnerships with healthcare institutions for broader data access, domain expertise, and
knowledge sharing to drive continuous improvements.
The future work and enhancements for the machine learning-based predictive model for
lung cancer risk assessment aim to propel its accuracy, interpretability, regulatory
compliance, and practicality within healthcare. Continual advancements in data integration,
model development, ethical considerations, and deployment strategies will be instrumental
in fostering a reliable and effective predictive tool that aids in early intervention and
personalized preventive measures, ultimately improving outcomes in lung cancer
assessment and healthcare delivery.
In our research, we leveraged 45,856 de-identified chest CT screening cases (some in which
cancer was found) from NIH’s research dataset from the National Lung Screening Trial
study and Northwestern University. We validated the results with a second dataset and also
compared our results against 6 U.S. board-certified radiologists.

When using a single CT scan for diagnosis, our model performed on par or better than the
six radiologists. We detected five percent more cancer cases while reducing false-positive
exams by more than 11 percent compared to unassisted radiologists in our study. Our
approach achieved an AUC of 94.4 percent (AUC is a common metric used in machine
learning and provides an aggregate measure for classification performance).

Lung Cancer Prediction Model 52 | P a g e


Despite the value of lung cancer screenings, only 2-4 percent of eligible patients in the U.S.
are screened today. This work demonstrates the potential for AI to increase both accuracy
and consistency, which could help accelerate adoption of lung cancer screening worldwide.
These initial results are encouraging, but further studies will assess the impact and utility
in clinical practice. We’re collaborating with Google Cloud Healthcare and Life Sciences
team to serve this model through the Cloud Healthcare API and are in early conversations
with partners around the world to continue additional clinical validation research and
deployment.

Lung Cancer Prediction Model 53 | P a g e


6. References

[1] M.I. Faisal, S. Bashir, Z.S. Khan, F.H. Khan, “An evaluation of machine learning
classifiers and ensembles for early-stage prediction of lung cancer” December 2018 3rd
International Conference on Emerging Trends in Engineering, Sciences and Technology
(ICEEST), IEEE (2018), pp. 1-4
[2] J. Cabrera, A. Dionisio and G. Solano, "Lung cancer classification tool using microarray
data and support vector machines", Information Intelligence Systems and Applications
(IISA), 2015, July, 2015.
[3] Z. Yu, X. Z. Chen, L. H. Cui, H. Z. Si, H. J. Lu and S. H. Liu, "Prediction of lung cancer
based on serum biomarkers by gene expression programming methods", Asian Pacific Journal
of Cancer Prevention, vol. 15, no. 21, pp. 9367-9373, 2014.
[4] H. Shin, S. Oh, S. Hong, M. Kang, D. Kang, Y.G. Ji, Y. Choi “Early-stage lung cancer
diagnosis by deep learning-based spectroscopic analysis of circulating exosomes” ACS Nano,
14 (5) (2020), pp. 5435-5444
[5] S.H. Hyun, M.S. Ahn, Y.W. Koh, S.J. Lee “A machine-learning approach using PET-based
radiomics to predict the histological subtypes of lung cancer” Clin. Nucl. Med., 44 (12) (2019),
pp. 956-960
[6] W. Rahane, H. Dalvi, Y. Magar, A. Kalane, S. Jondhale “Lung cancer detection using image
processing and machine learning healthcare” 2018, March International Conference on
Current Trends towards Converging Technologies (ICCTCT), IEEE (2018), pp. 1-5
[7] B. A. Miah and M. A. Yousuf, "Detection of Lung cancer from CT image using Image
Processing and Neural network", 2nd International Conference on Electrical Engineering and
Information and Communication Technology (ICEEICT), May 2015.
[8] B.V. Ginneken, B. M. Romeny and M. A. Viergever, "Computer-aided diagnosis in chest
radiography: a survey", IEEE transactions on medical imaging, vol. 20, no. 12, 2001.
[9] H. Becker, W. Nettleton, P. Meyers, J. Sweeney and C. Nice, Jr., "Digital computer
determination of a medical diagnostic index directly from chest X-ray images", IEEE Trans.
Biomed. Eng., vol. BME-11, pp. 67-72, 1964.
[10] L. S. Kovasznay and H. M. Joseph, "Image processing", Proc. IRE, vol. 43, pp. 560-570,
May 1955.
[11] P. C. Goldmark and J. M. Hollywood, "A new technique for improving the sharpness of
pictures", PRoc. I.R.E., vol. 39, pp. 1314, October 1951.
[12] Bedford and Fredendall, "Analysis synthesis and evaluation of the transient response of
television apparatus", Proc. I.R.E., vol. 30, pp. 453-455, October 1942.
[13] J. Duncan and N. Ayache, "Medical image analysis: Progress over two decades and the
challenges ahead", IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 85-106, Jan. 2000.

Lung Cancer Prediction Model 54 | P a g e


[14] R. Shahidi, R. Tombropoulos and R.P.A. Grzeszczuk, "Clinical Applications of Three-
Dimensional Rendering of Medical Data Sets", Proc. IEEE, vol. 86, no. 3, pp. 555-568, Mar.
1998.
[15] M. Yachida, M. Ykeda and S. Tsuji, "A Plan-Guided Analysis of Cineangiograms for
Measurement of Dynamic Behavior of the Heart Wall", IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 2, pp. 537-543, 1980.
[16] Talukdar J., Sarma P. A survey on lung cancer detection in CT scans images using image
processing techniques. International Journal of Current Trends in Science and Technology .
2018;8(3):20181–20186.
[17] Yu K. H., Zhang C., Berry G. J., et al. Predicting non-small cell lung cancer prognosis by
fully automated microscopic pathology image features. Nature Communications . 2016;7(1):p.
12474. doi: 10.1038/ncomms12474.
[18] Cirujeda P., Cid Y. D., Muller H., et al. A 3-D Riesz-covariance texture model for
prediction of nodule recurrence in lung CT. IEEE transactions on medical imaging .
2016;35(12):2620–2630. doi: 10.1109/TMI.2016.2591921.
[19] Sangamithraa P. B., Govindaraju S. Lung tumour detection and classification using EK-
mean clustering. Proceedings of the 2016 IEEE International Conference on Wireless
Communications, Signal Processing and Networking, WiSPNET; 2016; Chennai, India.
[20] Kurkure M., Thakare A. Lung cancer detection using genetic approach. Proceedings -2nd
International Conference on Computing, Communication, Control and Automation,
ICCUBEA; 2017; Pune, India. [Google Scholar]
[21] Kureshi N., Abidi S. S. R., Blouin C. A predictive model for personalized therapeutic
interventions in non-small cell lung cancer. IEEE journal of biomedical and health informatics
. 2016;20(1):424–431. doi: 10.1109/JBHI.2014.2377517.
[22] Kumar A., Gautam B., Dubey C., Tripathi P. K. A review: role of doxorubicin in treatment
of cancer. International Journal of Pharmaceutical Sciences and Research . 2014;5(10):4117–
4128. [Google Scholar]
[23] Kulkarni A., Panditrao A. Classification of lung cancer stages on CT scan images using
image processing. IEEE International Conference on Advanced Communication, Control and
Computing Technologies, ICACCCT; 2014; Ramanathapuram, India. 2014. pp. 1384–1388.
[24] Westaway D. D., Toon C. W., Farzin M., et al. The International Association for the Study
of Lung Cancer/American Thoracic Society/European Respiratory Society grading system has
limited prognostic significance in advanced resected pulmonary adenocarcinoma. Pathology .
2013;45(6):553–558. doi: 10.1097/PAT.0b013e32836532ae. [PubMed] [CrossRef] [Google
Scholar]
[25] Automatic detection of lung cancer from biomedical data set using discrete AdaBoost
optimized ensemble learning generalized neural networks.
[26] Bitzel Cortez, An architecture for emergency event prediction using LSTM recurrent
neural networks.

Lung Cancer Prediction Model 55 | P a g e


[27] Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In IEEE International
Conference on Intelligent Robots and Systems, pages 2219–2225, 2006. ISBN
142440259X.doi: 10.1109/IROS.2006.282564.
[28] Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and
attacking thesaddle point problem in high-dimensional non-convex optimization.
arXiv:1406.2572 [cs, math, stat],June 2014. URL http://arxiv.org/abs/1406.2572. arXiv:
1406.2572.
[29] https://www.kaggle.com/code/subhajeetdas/lung-cancer-prediction/notebook
[30] M. Firmino, A. H. Morais, R. M. Mendoa, M. R. Dantas, H. R. Hekis, and R. Valentim.
Computer-aideddetection system for lung cancer in computed tomography scans: Review and
future prospects.BioMedical Engineering OnLine, 13:41, Apr. 2014. ISSN 1475-925X. doi:
10.1186/1475-925X-13-41.

Lung Cancer Prediction Model 56 | P a g e


7. Appendix

7.1. Technical Details and Additional Graphs/Charts

Import Libraries
!pip install dtreeviz

Lung Cancer Prediction Model 57 | P a g e


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,
ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

Lung Cancer Prediction Model 58 | P a g e


from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score,
confusion_matrix, ConfusionMatrixDisplay, classification_report

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings("ignore")

/kaggle/input/cancer-patients-and-air-pollution-a-new-link/cancer patient data sets.csv

Load Data
df = pd.read_csv("/kaggle/input/cancer-patients-and-air-pollution-a-new-link/cancer
patient data sets.csv")
df

Data Cleaning & Visualization


df.isnull().sum()

Lung Cancer Prediction Model 59 | P a g e


Fig 10: Input Data

sns.heatmap(df.isnull(), cmap = 'viridis')

Lung Cancer Prediction Model 60 | P a g e


Fig 11: Axes Input Plot

df.drop(columns=['index', 'Patient Id'], axis=1, inplace=True)


df
df.size

df.dtypes

Lung Cancer Prediction Model 61 | P a g e


df.iloc[:, 1:24].plot(title="Dataset Details")
df_corr = df.corr()
df_corr

Lung Cancer Prediction Model 62 | P a g e


Fig 12: Dataset Details

plt.title("Correlation Matrix")
sns.heatmap(df_corr, cmap='viridis')
sea = sns.FacetGrid(df, col = "Level", height = 4)
sea.map(sns.distplot, "Age")

Lung Cancer Prediction Model 63 | P a g e


Fig 13: Correlation Matrix

sea = sns.FacetGrid(df, col = "Level", height = 4)


sea.map(sns.distplot, "Gender")
x = df.iloc[:, 0:23]
x

Lung Cancer Prediction Model 64 | P a g e


df['Level'].replace(to_replace = 'Low', value = 0, inplace = True)
df['Level'].replace(to_replace = 'Medium', value = 1, inplace = True)
df['Level'].replace(to_replace = 'High', value = 2, inplace = True)

df['Level'].value_counts()

plt.figure(figsize = (20, 27))

Lung Cancer Prediction Model 65 | P a g e


for i in range(24):
plt.subplot(8, 3, i+1)
sns.distplot(df.iloc[:, i], color = 'red')
plt.grid()

plt.figure(figsize = (11, 9))


plt.title("Lung Cancer Chances Due to Air Polution")
plt.pie(df['Level'].value_counts(), explode = (0.1, 0.02, 0.02), labels = ['High', 'Medium',
'Low'], autopct = "%1.2f%%", shadow = True)
plt.legend(title = "Lung Cancer Chances", loc = "lower left")

Fig 14: Lung Cancer due to Air Pollution

Lung Cancer Prediction Model 66 | P a g e


sns.displot(df['Level'], kde=True)

Fig 15: Level Vs Count

y = df.Level.values
y

Lung Cancer Prediction Model 67 | P a g e


Train & Test Splitting the Data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

Function of Measure Performance


def perform(y_pred):
print("Precision : ", precision_score(y_test, y_pred, average = 'micro'))
print("Recall : ", recall_score(y_test, y_pred, average = 'micro'))
print("Accuracy : ", accuracy_score(y_test, y_pred))
print("F1 Score : ", f1_score(y_test, y_pred, average = 'micro'))
cm = confusion_matrix(y_test, y_pred)
print("\n", cm)
print("\n")
print("**"*27 + "\n" + " "* 16 + "Classification Report\n" + "**"*27)
print(classification_report(y_test, y_pred))
print("**"*27+"\n")

cm = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=['Low',


'Medium', 'High'])
cm.plot()

Random Forest
model_rf = RandomForestClassifier()
model_rf.fit(x_train, y_train)
y_pred_rf = model_rf.predict(x_test)
perform(y_pred_rf)

Lung Cancer Prediction Model 68 | P a g e


Fig 16: Label Graph

Lung Cancer Prediction Model 69 | P a g e


Lung Cancer Prediction Model 70 | P a g e
Lung Cancer Prediction Model 71 | P a g e
Lung Cancer Prediction Model 72 | P a g e
import dtreeviz

viz_model = dtreeviz.model(model_dt,
X_train=x_train, y_train=y_train,
feature_names=feature_names,
target_name='Lung Cancer',
class_names=['Low', 'Medium', 'High'])

v = viz_model.view() # render as SVG into internal object


v.save("Lung Cancer.svg") # save as svg

viz_model.view()

Lung Cancer Prediction Model 73 | P a g e


Lung Cancer Prediction Model 74 | P a g e
Lung Cancer Prediction Model 75 | P a g e
Lung Cancer Prediction Model 76 | P a g e

You might also like