KEMBAR78
Asthma Prediction with Hybrid ML | PDF | Statistical Classification | Machine Learning
0% found this document useful (0 votes)
64 views52 pages

Asthma Prediction with Hybrid ML

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views52 pages

Asthma Prediction with Hybrid ML

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

A Hybrid Technique to Predict the Aggravations in Asthma

Dissertation IV Report
Submitted in partial fulfilment of the requirements for the award of degree of

INTEGRATED (BTech + MTech)


COMPUTER SCIENCE AND ENGINEERING

Submitted to

LOVELY PROFESSIONAL UNIVERSITY


PHAGWARA, PUNJAB

SUBMITTED BY

Name of Student Name of Supervisor


Rajat Rana Tarun

Registration Number of Student UID of Supervisor


11901295 24044

Signature of Student Signature of Supervisor


TOPIC APPROVAL PERFORMA

School of Computer Science and Engineering (SCSE)

Program : P192-ND::Integrated B.Tech. - M.Tech. (Computer Science & Engineering)

COURSE CODE : CSE588 REGULAR/BACKLOG : Regul GROUP NUMBER : CSERGD0579


ar

Supervisor Name : Tarun UID : 2404 Designation : Assistant Professor


4

Qualification : Research Experience :

SR.NO. NAME OF STUDENT Prov. Regd. No. BATCH SECTION CONTACT NUMBER

1 Rajat Rana 11901295 2019 K19MT 9805518019

SPECIALIZATION AREA : Program Methodology and Design Supervisor Signature:

PROPOSED TOPIC : A hybrid technique to predict asthmatic patients during aggravations in asthma

Qualitative Assessment of Proposed Topic by PAC


Sr.No. Parameter Rating (out of 10)

1 Project Novelty: Potential of the project to create new knowledge 6.13

2 Project Feasibility: Project can be timely carried out in-house with low-cost and available resources in 6.13
the University by the students.
3 Project Academic Inputs: Project topic is relevant and makes extensive use of academic inputs in UG 6.25
program and serves as a culminating effort for core study area of the degree program.
4 Project Supervision: Project supervisor’s is technically competent to guide students, resolve any issues, 7.00
and impart necessary skills.
5 Social Applicability: Project work intends to solve a practical problem. 6.13

6 Future Scope: Project has potential to become basis of future research work, publication or patent. 6.13

PAC Committee Members

PAC Member (HOD/Chairperson) Name: Janpreet Singh UID: 11266 Recommended (Y/N): Yes
PAC Member (Allied) Name: Dr.Gurpreet Singh UID: 17671 Recommended (Y/N): Yes

PAC Member 3 Name: Pradeep Kumar UID: 16473 Recommended (Y/N): Yes

Final Topic Approved by PAC: A hybrid technique to predict asthmatic patients during aggravations in asthma

Overall Remarks: Approved

PAC CHAIRPERSON Name: 13714::Dr. Prateek Agrawal Approval Date: 06 May 2023

1
Student Declaration

To whom so ever it may concern

I, Rajat Rana, 11901295, do hereby declare that the work done by me on “Dissertation IV” under the
supervision of Tarun, Assistant Professor, Lovely Professional University, Phagwara, Punjab, is a
record of original work for the partial fulfilment of the requirements for the award of the Integrated
CSE(Btech+Mtech).

Name of the Student (Registration Number)


Rajat Rana (11901295)

Signature of the Student

Dated:03/05/2024

2
Declaration by the Supervisor

To whom so ever it may concern

This is to certify that Rajat Rana, 11901295 of Lovely Professional University, Phagwara, Punjab, has
worked on “Dissertation IV” under my supervision from 01/02/2024 to 03/05/2024. It is further stated
that the work carried out by the student is a record of original work to the best of my knowledge for
the partial fulfilment of the requirements for the award of the Integrated CSE(Btech+Mtech).

Name of Supervisor

Tarun

UID of Supervisor
24044

Signature of Supervisor

3
ACKNOWLEDGEMENT

A project serves as a conduit between theoretical knowledge and practical application, and with this
mindset, I dedicated myself to the project, ensuring its success with the timely support and efforts of
my mentor. I express my gratitude to my teacher and mentor, Tarun, who provided unwavering
support, clarified my doubts, and to my parents, who played a significant role in the finalization of my
project file. I take this moment to acknowledge their invaluable support, and I hope for their continued
encouragement in the future. Throughout the preparation of this project file, the diverse information I
discovered greatly contributed to the project's completion. I am pleased to have successfully finished
the project and gained a deeper understanding of many concepts. The meticulous preparation of this
project was an immense learning experience, fostering the development of personal qualities such as
responsibility, punctuality, confidence, and more.

In conclusion, I extend my thanks to my classmates and friends for their encouragement and assistance
in designing and enhancing the creativity of my project. It was through their support that I was able to
craft an enjoyable and successful project experience.

4
Abstract
The abstract of this paper introduces a novel hybrid approach for predicting asthma exacerbations
through machine learning models. It underscores the significance of personalized medicine and the
role of Big Data analytics in healthcare. The paper's objective is to introduce a model capable of
effectively identifying key features associated with asthma onset, using a specific dataset as an
illustration. The model integrates multiple machine learning algorithms such as K-Nearest Neighbors
(KNN), XGBoost, Decision Tree, and SVC, aiming to enhance predictive accuracy and robustness.
Additionally, the study conducts a review of prior research on machine learning methodologies for
asthma prediction, emphasizing the necessity for improved generalization capabilities and practicality.
The proposed hybrid technique intends to contribute to the progression of predictive healthcare
analytics. Notably, the hybrid model achieves accuracy rates of 96% and 98% with the utilization of
Stacking and voting mechanisms respectively.

5
List of Tables

Content Page No.

Survey of Literature review on the various research 15-17


Description of meta data analysis 18-19
Gini index and Information gain of DT 27-28
Confusion metrics of All methods 35
Accuracy of all models 35
Comparison of accuracies 40-41
Our model Evaluation metrics 41

6
List of Figures

Content Page No.

Visualization of GOOD and BAD lung for respiration 10


Flow diagram of hybrid model 13
Performing the Categorization of the dataset 20
Processed data after data cleaning 20
Code for the subcategory of the dataset 21
Final cleaned data for the predictive models 21
Working of KNN 23
Architecture of XGB 29
Working of SVC 30
Final Architecture of Hybrid model 32
Precision Graph 36
Recall Graph 36
F1-Score values 37
X & Y graph of all methods for accuracy and ROC 37
Learning curve of DT 38
Learning curve of KNN 38
Learning curve of SVC 39
Learning curve of Stacking 39
Learning curve of Voting 39
Learning curve of XGB 39
Combined AUCROC curve of all techniques 39

7
List of Equation & Algorithms

Content Page No.

Class labels 25
Average value for nearest data points 26
DT for further nodes 27
SVC mathematical equation 31

8
List of Abbreviations

SNO. Abbreviation Definition


1 KNN K-Nearest Neighbors
2 DT Decision Tree
3 SVC Support Vector Classifier
4 COPD Chronic Obstructive Pulmonary
Disease
5 BRFSS Behavioral Risk Factor
Surveillance System
6 ROC AUC Receiver Operating
Characteristic Area Under the
Curve
7 WHO World Health Organization
8 XGBoost Extreme Gradient Boosting

9
Chapter-1
I. Introduction

Machine learning is a broad set of algorithmic models and statistical approaches designed to solve
problems without the need for specialized programming [31]. Some machine learning models,
particularly single-layered ones, involve extensive feature extraction and data processing before the
data is input into the algorithm[1], [2]. Proper data preprocessing is crucial to ensure accurate
predictions and avoid issues such as overfitting or underfitting the training dataset. Deep learning, a
more advanced division of Machine Learning that employs hierarchical artificial neural networks.
achieve higher accuracy and precision, though this may come at the cost of reduced interpretability[3].
In deep learning, neural networks comprise multiple layers that connect artificial neurons or units,
enabling complex data processing. These networks can autonomously learn, recognize patterns, and
extract insights from data through these layered connections until they achieve desired results[4], [5].
Personalized medicine tailors medical decisions, treatments, and technologies to each individual
patient based on their predicted response or disease risk [6]. This approach has gained traction in
recent years owing to advancements in diagnostic techniques and informatics. Big Data analytics,
leveraging various machine learning methods, plays a pivotal role in establishing the analytical
foundation of identified medicine. The growing utilization of computerized algorithms for real-time
estimation of clinical outcomes aims to enhance patient care and reduce costs, supported by Big Data
analytics. Moreover, the expanding accessibility of electronic health data is driving the swift expansion
of predictive analytics applications within the healthcare sector.

1.1. Overview of Asthma as health concern

Fig. 1.1. Visualization of GOOD and BAD lung for respiration

10
Asthma is a significant global health issue, affecting about 300 million people and resulting in
approximately 250,000 deaths annually[7]. A major challenge of asthma is the constriction of the
airways, which worsens as the condition becomes more severe and can be expensive to treat.
Evaluating this airway narrowing is essential for diagnosing asthma, monitoring its progression, and
assessing the effectiveness of treatments[8]. Commonly, doctors rely on tests such as spirometry and
body plethysmography, but these require patients to fully cooperate and exert maximum effort. This
can be challenging for older adults, individuals who may struggle to follow instructions, or those with
other serious health conditions. Asthma stands out as one of the most prevalent and serious non-
communicable conditions worldwide[9]. It's a chronic lung condition that impacts the airways, leading
to notable changes in lung function. Recent data from the World Health Organization (WHO) indicates
that around 334 million people worldwide are affected by asthma. Shockingly, in 2016 alone, it
claimed the lives of over 417,918 people globally[10], [11], [12]. Asthma is thought to be caused by a
combination of genetic and environmental factors, including allergies, smoking, weather, air pollution,
and exposure to specific chemicals [13]. Symptoms can differ from one individual to another, but the
most prevalent ones typically encompass shortness of breath, chest tightness or discomfort, disrupted
sleep patterns, persistent coughing, breathlessness, difficulty speaking, sensations of anxiety or panic,
and fatigue [5].

Medical professionals emphasize the importance of accurate diagnosis and prompt detection of life-
threatening illnesses, as these factors can significantly improve a patient's chances of survival and
expedite their recovery[13]. Recently, artificial intelligence (AI) has gained widespread recognition as
a valuable tool in disease detection. It has showed amazing success in diagnosing various health issues
using machine learning and deep learning approaches [14], [15]. Numerous research studies are
presently delving into the capabilities of machine learning and deep learning algorithms for detecting
diseases, such as asthma and pneumonia [16].

The asthmatic lung is exemplified by persistent soreness and increased sensitivity of the airways,
resulting in interrupted occurrences of wheezing, breathlessness, chest tightness, and coughing that
happen repeatedly. [17]. Asthma is a prevalent chronic respiratory condition that can affect individuals
of any age, although it often begins in childhood. Events that may lead to episodes include exposure to
substances that can cause an allergic reaction (such as pollen, dust mites, or pet dander), respiratory
infections, engaging in physical activity, and being exposed to unfriendly air, smoke, or air
contamination can induce tenderness and thinning of the airways in people with asthma. This leads to
the typical symptoms, which can vary in severity and frequency[9], [18].

Inflammation in asthmatic lungs involves a complex interaction of protected cells, such as eosinophils,
mast cells, and T lymphocytes, as well as various inflammatory mediators such as histamine,

11
leukotrienes, and cytokines. This inflammatory response contributes to airway constriction, excessive
mucus production, and heightened airway responsiveness.

Managing asthma aims to control symptoms, prevent flare-ups, and enhance lung function. Treatment
often includes bronchodilators to alleviate acute symptoms and anti-inflammatory medications to
reduce airway inflammation and prevent exacerbations. Alongside medication, avoiding triggers and
adopting a healthy lifestyle are crucial for effective asthma management. Regular examining of
indications and lung function, along with personalized asthma action plans developed with healthcare
providers, can empower individuals with asthma to effectively manage their condition and lead active
lives despite its challenges[19], [20].

In recent years, the global community has grappled with the COVID-19 challenge.[17] One of its
concerning effects is the development of severe pneumonia, often leading to fatalities, particularly
when diagnosed late. Chronic lung ailments have imposed significant strains on healthcare systems.
The period from 2019 to 2021 witnessed a surge in chronic obstructive pulmonary disease (COPD)
cases worldwide due to the COVID-19 outbreak. Timely diagnosis holds the key to mitigating this
issue, as early intervention can effectively manage the condition[17].

1.2. Importance of early prediction and intervention

There is an increasing use of home-based telemonitoring to monitor and manage chronic health
conditions outside of hospitals. The goal is to optimize the management of these diseases and prevent
worsening. This technology has demonstrated promise in various situations including asthma,
hypertension, provocative gut syndrome, congestive heart failure (CHF), multiple sclerosis, COPD,
and depression. Recent studies have revealed limits in the effectiveness of home telemonitoring
strategies for chronic health disorders [3], [21]. These limitations are often attributed to the absence of
reliable speedy judges and modest implementation of traditional algorithm in identifying worsening
symptoms. Currently, algorithms typically calculate the general hazard of aggravations happening
within a precise time outline, such as one month or one year[22]. They rely on a combination of
medical and invoicing files, but they do not consider changes in disease seriousness over time and day-
to-day alternatives in symptoms[6]. By implementing novel methods that can anticipate impending
exacerbations through individual disease patterns and facilitate prompt detection of potential
worsening before it happens, we could knowingly boost the effectiveness of homegrown
Telemonitoring systems. This advancement holds the potential to elevate the standard of care delivered
to patients and reduce healthcare expenditure.

12
1.3. Introduction to the proposed Hybrid Technique

In the field of medical prediction, hybrid models are becoming increasingly utilized to enhance
accuracy and reliability, particularly in forecasting events such as asthma exacerbations. These models
integrate multiple algorithms or techniques to capitalize on the strengths of each while mitigating
individual weaknesses[8]. The blending of different approaches often leads to superior predictive
performance compared to using any single model alone. Moreover, hybrid models demonstrate greater
resilience to variations in data and environmental factors, making them adaptable to diverse real-world
scenarios. Additionally, these models offer enhanced interpretability, a critical aspect in healthcare
decision-making, by providing insights into the underlying reasoning behind predictions. By
incorporating various feature engineering and selection methods, hybrid models effectively utilize
pertinent data features while reducing noise, resulting in more precise predictions of asthma

Fig. 1.2. Flow diagram of hybrid model

13
exacerbations. The ability to accurately predict asthma exacerbations facilitates timely intervention by
healthcare professionals, potentially preventing severe episodes and enhancing patient outcomes.

The availability of large volumes of data and technological improvements have significantly changed
the healthcare sector. Advancements in this field have provided the groundwork for the development
of sophisticated predictive models that can effectively diagnose diseases. Early disease prediction not
only improves patient outcomes but also leads to more effective treatments and reduced healthcare
costs[12]. The objective of this research paper is to investigate the effectiveness of a hybrid algorithm
that integrates elements from four powerful machine learning models: K-nearest neighbors (KNN),
XGBoost, Decision tree, and Support vector classifier (SVC). The KNN algorithm is a simple yet
powerful method that classifies new instances based on their similarity to the nearest instances in the
training dataset[22]. Due to its flexibility and efficiency, XGBoost has gained popularity as a preferred
option for large-scale machine learning tasks, as it effectively implements the gradient boosting
framework. This study compares these four algorithms' abilities to forecast the course of diseases.
[14]. By utilizing a hybrid algorithm, we aim to leverage the unique strengths of each model to
improve the accuracy and reliability of disease prediction. The research will delve into the
methodologies of each model, their application in the context of disease prediction, and a
comprehensive analysis of their performance using various assessment metrics[4]. This comparative
probe will provide valuable perceptions into the most effective machine learning techniques for disease
prediction, contributing to the advancement of predictive healthcare analytics.

1.4. The Research Objective

The target of this learning is to propose a hybrid approach model that can effectively pinpoint the key
features associated with the onset of asthma in patients. This model will offer predictive capabilities
applicable to diverse datasets, with the dataset utilized in this study serving as a demonstrative
example. Additionally, the study will illustrate how such data-driven insights can inform the
development of intervention strategies and early medical interventions for asthmatic patients. The
evaluation encompasses four algorithms KNN, SVC, XGBoost and Decision Tree.

14
Chapter-2

II. Literature Review:

The books review begins by examining previous research on predictive modeling and machine
learning approaches in the context of chronic health conditions which are mentioned in the table2.1
which is mentioned below:

Table 2.1. Survey of Literature review on the various research

Authors Data Samples Methods Research Drawbacks Future Scope


Type
[5] 7001 naive Predictive Overestimation Subsequent
Bayesian of risk, False research should
classifier, Positive, concentrate on
adaptive Calibration enhancing the
Bayesian Issues generalization
network, capability and
support vector practicality of
machines these models.
[23] 8,802 DT Predictive Missing values, The tool needs
RF Selection Bias to be tested in
NB other
Multilayer- populations to
Perceptron evaluate its
K-Nearest predictive
Neighbours performance
and
generalizability.
[15] 1456 CAPE and Predictive The study was Continued
CAPP restricted research into
because both using machine
model learning
development techniques to
and validation predict
were primarily different types
conducted of asthma may
using provide deeper

15
predominantly predictive
Caucasian insights and
populations. enhance
clinical utility.
Another
unavoidable
limitation in
epidemiological
studies is the
lack of a clear
definition for
asthma.
[20] 150 asthma and 52 ANN, SVM, Comparative The study did The researchers
healthy control RF study on not explore the plan to use the
blood sera Predictive possible optimized
samples models challenges in classifier
diagnosing developed in
asthma in this study for
infants or real-time
differentiating asthma risk
it from other prediction and
lung conditions. potentially for
other diseases
as well.
[11] 16000 Naïve Bayes, Predictive
J48, Random
Forest, and
Random Tree
[24] 5,875 patients, Decision Predictive Identifying an The study
including 13,614 trees, logistic unstable event expects further
weekly regression, using a weekly improvement in
surveys and 75,795 naïve Bayes, survey with a performance if
daily surveys and support resolution of objective
vector one week may measures can
machine impact the be used as
(SVM) precision of the feature inputs

16
study. for prediction.
[25] The study used 10 vector Over- Further
endogenous autoregressive dispersion research could
variables: 6 model (VAR), can affect explore robust
atmospheric, 3 DNN consistency variance
meteorological, of estimators for
and asthma coefficient QML Poisson.
occurrence data estimation.
from Seoul, South
Korea.
[16] OASIS dataset has SVM, DT, The study
a dimension of 373 RF, LR found that
rows x 15 columns the random
forest
classifier, a
more
complex
model,
experienced
overfitting.

17
Chapter-3

III. Methodology

3.1. Data Collection

To improve the accuracy of asthma prediction, we have incorporated various machine learning
algorithms like SVC, KNN, XGBoost, and Decision Tree into our model. Furthermore, we have
curated a novel disease dataset, which holds significant potential as a cornerstone for forthcoming
researchers and healthcare professionals. The subsequent sections will provide concise insights into the
research materials and methodologies employed in the study. The Behavioral Risk Factor Surveillance
System (BRFSS) has emerged as a potent resource for tailoring and advancing health promotion
efforts through the gathering of behavioral health risk information at both state and local levels.
Consequently, there has been a growing call from BRFSS users for expanded datasets and additional
survey questions to better address their needs. The features or column are listed below in table3.1:

Table 3.1. Description of meta data analysis

Fieldname Question/Description Data Type Value Range


VETERAN3 Have you ever been on active duty in the Numerical 1-9
United States Armed Forces, whether in
the regular military, National Guard, or
military reserve unit? This excludes
reserve or National Guard training but
includes deployments, such as those
during the Persian Gulf War.
ALCDAY5 How often did you drink an alcoholic Numerical 101-999

beverage in the past month, including


beer, wine, malt beverages, or liquor?
Please provide the number of days per
week or per month.
SLEPTIM1 How many hours of sleep do you Numerical 1-99

typically get per day, on average?


Athma3 Have you ever been informed that you Numerical 1-9

have asthma?
Smoke100 In the past year, have you abstained Numerical 1-9

from smoking for a day or more to quit?


SMOKDAY2 Do you currently smoke cigarettes daily, Numerical 1-9

occasionally, or not at all?

18
Sex Indicate sex of respondent. Numerical 1-2

MARITAL Description: Are you: (marital status) Numerical 1-9

GENHLTH Overall, would you describe your health Numerical

as:
HLTHPLN1 Do you have any type of healthcare Numerical 1-9

coverage, such as health insurance,


prepaid plans like HMOs, or government
plans like Medicare or Indian Health
Service?
EDUCA What is the highest level of education Numerical 1-9

you have achieved?


INCOME2 What is the total income of your Numerical 1-99

household from all sources on an annual


basis? If you prefer not to disclose,
please indicate that you refuse to
answer.
EXERANY2 Have you participated in any physical Numerical 1-9

activities or exercises, such as running,


calisthenics, golf, gardening, or walking
for exercise, aside from your usual work
duties, in the past month?

3.2. Dataset Description:

The comprehensive BRFSS dataset, comprising both landline and cell phone data, is compiled from
submissions for the year 2014. It encompasses information from all 50 territories, the Ward of
Columbia, Guam, and Puerto Rico. There are 464,664 records in the dataset.

3.3. Data Preprocessing:

In preparing the 2014 BRFSS survey data for analysis, several key steps are involved in preprocessing
the raw responses. Initially, the raw survey data, which is provided in fixed-width column ASCII files
and accessible from the CDC website, is downloaded and extracted. The structure of the data,
including column positions and group fields, is defined in a layout format. To facilitate parsing, a data
frame is constructed to remove group fields and accurately calculate the widths of individual fields.
Subsequently, the data undergoes categorization based on specific parameters, aiming to segment it
into meaningful groups. This process may entail combining multiple values to create new parameters
that enhance the interpretability of the data. Moreover, data cleaning procedures are implemented to
19
tackle concerns such as ignoring estimates, outliers, and conflicts. By handling these anomalies, the
quality and reliability of the data are improved, ensuring its suitability for subsequent analysis.

Fig. 3.1. Performing the Categorization of the dataset

From the above fig.3.1 the data will divide into subcategories according to its metadata. For example,
the drinking has four variables which are Nondrinking, drink monthly, drink weekly and these
variables are appended with the asthma4 column for the separation of the drinker’s category in the
dataset.

Fig. 3.2. Processed data after data cleaning

After removing the null or noisy data here is the cleaned data which contains the 4,64,664 entries in it.

20
Fig.3.3. Code for the subcategory of the dataset.

This code clean or make the separate groups of the data according to its name mentioned in this code.
The code generated the CSV file named as Cleaned_data.csv which have the 30,410 entries in it which
we will use for the further in the machine learning models. The fig represents the columns name and
data in it which is numerical or categorical presented in it.

Fig.3.4. Final cleaned data for the predictive models

The features are selected from the fig such as VETERAN3, ALCDAY5, SLEPTM1 etc. The required
information about the features is in Table.1.

21
3.4. Hybrid Technique Implementation:

A hybrid machine learning model is a combination of multiple machine learning models, each
contributing to the final prediction. In your case, you're considering K-Nearest Neighbors (KNN),
Decision Tree, XGBoost and SVC. Here's a high-level overview of how this could work:

1. Individual Model Training: Each of the three models (KNN, Decision Tree, and XGBoost,
SVC) is conducted separately on the guiding data. In this phase, we will tune the
hyperparameters of each version to achieve the best performance.

2. Model Evaluation: Using suitable measures like accuracy, precision, recall, and F1-score on
the validation set, we will assess each model's performance when training is complete. This
method makes it easier to evaluate each model's performance separately.
3. Hybrid Model Creation: In this step, we’ll develop a hybrid model that capitalizes on the
strengths of the individual models. There are several ways to do this:
 Voting: Each model makes a estimate for each instance, and the absolute expectation is
decided by common vote[8]. This method works well when the models are largely
uncorrelated.
 Stacking: In this approach, another machine learning model takes the likelihoods of the
individual models as input and learns to make the final prediction. This "meta-model"
can be any model, but it's often a simple one like linear regression.

4. Hybrid Model Assessment: Ultimately, we will calculate the running of the hybrid model
using the same metrics as before. If the hybrid model is effective, it should perform better than
any individual model.

3.4.1. Insights of Individual Machine learning Models:

KNN:

The K-Nearest Neighbors (KNN) algorithm is widely recognized and easy to understand in the
field of machine learning for asthma prediction. It is a versatile approach that relies on the concept of
similarity. KNN classifies a data point by determining the majority class among its closest neighbors in
the feature space [4]. In the context of asthma prediction, KNN analyzes various attributes and patterns

22
within the dataset to identify similarities between instances, aiding in the classification of asthma
likelihood.

By utilizing historical data containing factors such as demographic details, environmental exposures,
genetic predispositions, and medical history, KNN effectively identifies individuals at risk of
developing asthma[26]. The K-Nearest Neighbors (KNN) algorithm is highly valuable due to its
straightforwardness and clarity. It does not depend on assumptions about the distribution of the data
and can easily adapt to diverse datasets with various types of features.

Fig. 3.5. Working of KNN

Moreover, the flexibility of KNN allows it to accommodate dynamic changes in asthma risk factors
over time, making it a robust contender for real-time prediction and monitoring applications[27]. Its
non-parametric nature also enables it to capture complex relationships and nonlinear interactions
among predictors, ensuring comprehensive and accurate predictions.

However, like any machine learning method, KNN presents certain considerations. The computational
overhead associated with evaluating distances between data points can be significant, especially in
large datasets. Additionally, appropriate feature scaling and careful selection of the K parameter
(number of neighbors) are crucial to optimize performance and minimize potential biases[17], [28].

Despite these challenges, the KNN algorithm remains a valuable tool in the arsenal of asthma
prediction models, offering a straightforward yet effective approach to identifying individuals at risk
and informing targeted intervention strategies. Through its reliance on proximity-based classification,
KNN contributes to advancing personalized medicine initiatives by facilitating early detection and
proactive management of asthma, ultimately enhancing patient outcomes and quality of life.

Implementation of KNN:

23
KNN (K-Nearest Neighbors) is one of the base models used for classification. This is the break down
how KNN is performing in this model building process:

1. Initialization: KNN classifier is prepared lacking insist on whichever hyperparameters. By


default, it uses 5 nearest neighbors (`n_neighbors=5`), among other default parameters.

2. Model Training: Once the KNN classifier is initialized, it undergoes training using the scaled
training data (X_train_scale, y_train), alongside other base models.

3. Prediction: Once trained, the KNN classifier expects the objective adjustable for the scaled
test data (X_test_scale) using the predict method. Additionally, it calculates the possibilities of
fitting to each class using the `predict_proba` method. This is important for computing ROC
AUC score.
4. Evaluation Metrics:
 Accuracy: The accurateness of KNN model predictions on the test set is printed.
 ROCAUC Score: The ROCAUC score is calculated. This measured evaluates the
capability of the classifier to distinguish between classes and is based on the true
positive rate and false positive rate.
 Confusion Matrix: The program will display a confusion matrix indicating TP`, TN`,
FP`, and FN` predictions.
 Classification Report: A thorough classification report with information on precision,
recall, F1-score, and support will be produced by it.
5. Learning Curve Plot: A learning curve is used to show how well the KNN model performs in
order to determine whether overfitting or underfitting is present. The curve displays the
fluctuations in training and cross-validation scores as the size of the training dataset changes.
This approach assists in comprehending the model's tendency towards either overfitting or
underfitting.

Being a non-parametric technique, the K-Nearest Neighbours (KNN) classifier does not rely on any
predetermined assumptions about the data distribution. KNN adopts a statistical technique wherein it
utilizes distance functions to categorize unknown samples based on their proximity to a set of known k
samples[4]. KNN works by identifying the closest k neighbors to the input values provided and their
respective classifications. It then determines the most common classification among these neighbors as
the final output for the given input. In our training process using the dataset, we have set the value of
the k parameter to 5. The K nearest neighbors (KNN) algorithm can be expressed mathematically as
follows:

24
1. We calculate the distances between each input point and each data point in the dataset using a
chosen distance metric, such as the Euclidean distance, in order to identify the K closest neighbours for
a given input point.

2. By sorting the voids in scaling order, we select the K information spots with the miniature gaps as
the nearest neighbors.

3. In classification tasks, the predicted class label for an input point is determined by selecting the
majority class label among its nearest neighbors. Conversely, in regression tasks, the predicted target
value for an input point is calculated by averaging the target values of its nearest neighbors.

The mathematical representation of the KNN algorithm is as follows:

Given:

- n: number of features in the dataset

- m: amount to of statistics points in the dataset

- X: input point with n features

- D: dataset with m data points and n features

- Di: Ith data point in the dataset

- y_i: group label or target value associated with the ith data point

- ( d(X, Di) \): represents the distance between the input point \( X \) and the \( i \)th data point \( D_i \)

For classification tasks:

1. Calculate the distances between the input point X and each data point D_i in the dataset:

d (X, D1), d (X, D2), ..., d (X, D_m)

2. Sort the gaps in ascending order and select the K closest data locations:

K_nearest = {D_i1, D_i2, ..., D_iK}, where d (X, D_i1) ≤ d(X, D_i2) ≤ ... ≤ d(X, D_iK)

3. Determine the most common class label among the K closest data points:
𝑦_𝑝𝑟𝑒𝑑 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝛴[𝑦_𝑖 = 𝑐]

Equation 3.1. Class labels

for Di in K_nearest), where c represents each unique class label.

For regression tasks:

1. Calculate the distances between the input point X and each data point D_i in the dataset:

25
d(X, D1), d(X, D2), ..., d(X, Dm)

2. Arrange the distances in increasing order and choose the K nearest data points:

K_nearest = {Di1, Di2, ..., D_iK}, where d(X, Di1) ≤ d(X, Di2) ≤ ... ≤ d(X, DiK)

3. Calculate the average target value among the K nearest data points:

(𝛴[𝑦𝑖 𝑓𝑜𝑟 𝐷𝑖 𝑖𝑛 𝐾𝑛𝑒𝑎𝑟𝑒𝑠𝑡 ])


𝑦𝑝𝑟𝑒𝑑 =
𝑘
Equation 3.2. Average value for nearest data points

3.4.2. Decision Tree:

Several widely-used algorithms such as ID3 and CART make use of Decision Trees, a non-parametric
supervised learning technique primarily employed for classification tasks. Its fundamental structure
consists of a root node and several leaf nodes. During classification, the algorithm begins at the root
and traverses to a leaf node. Decision trees find applications in decision analysis research, with
implementations showing promise in accurately predicting asthma disease. The core objective involves
constructing a model that predicts a target variable's value based on learned data features using
straightforward decision rules[2], [29].

Once the decision tree is built, it can be used to forecast asthma status for new patients by applying a
set of criteria from the root node to the leaf nodes [1], [24]. At each internal node, the decision tree
evaluates the patient's feature values and proceeds down the appropriate branch based on the decision
rules. Ultimately, the leaf node reached predicts the patient's asthma status: asthma-positive or asthma-
negative[29].

The prediction accuracy of decision trees can vary depending on considerations such as the dataset's
quality and completeness, the selection of features, and the decision tree's complexity. However,
decision trees offer several advantages in asthma prediction[14], [30]. They are easy to interpret, as the
decision rules are represented graphically and, in a human, -readable format. They can handle mixed
data types and missing values, making them applicable to real-world medical datasets. Decision trees
can also account for interactions between features, allowing for more accurate predictions.

The mathematical formula for a decision tree is as follows:

The decision tree algorithm is formulated to create a set of rules by segmenting the data using input
features to classify the target variable. In our scenario, we're addressing a binary classification
problem, where the target variable can take on two values: 0 for asthma-negative and 1 for asthma-
positive. The decision tree model can be depicted as a sequence of if-else statements. Each internal

26
node of the tree signifies a split on a feature, while each leaf node signifies a prediction (0 or 1). Below
is a typical representation of the decision tree formula:

if (feature1 <= threshold1) {

if (feature2 <= threshold2) {

// Predict class value for Leaf Node 1

} else {

// Predict class value for Leaf Node 2

} else {

// Predict class value for Leaf Node 3

Condition 3.1. DT for further nodes

In this condition3.1, "feature1" and "feature2" represent features from the input data, and "threshold1"
and "threshold2" represent thresholds to split the data on those features. The if-else statements evaluate
the feature values and decide which branch (left or right) to take based on the thresholds. At each leaf
node, a prediction is made for the target variable (0 or 1). The actual formula may be more complex
and may involve multiple features and splits depending on the specific decision tree being used.
Additionally, decision trees can be utilized to deal with multi-class grouping or regression problems,
which would require different mathematical formulations.

It's crucial to mention that the decision tree algorithm utilizes impurity measures, such as the Gini
Index or Information Gain, to identify the optimal feature and threshold for splitting the data at each
node. These measures help in finding the splits that maximize the separation between different classes
or minimize the impurity within each class. Overall, the mathematical formula for a decision tree
involves evaluating feature values and splitting the data based on rules defined by thresholds,
eventually leading to predictions at each leaf node.

The Gini Index = 0.301for the dataset which is trained on the decision tree model.

Table 3. 2. Gini index and Information gain of DT

Features Values
ALCDAY5 Information Gain = 0.028650825128447473
SLEPTIM1 Information Gain = 0.013936882614491629
X_AGE_G Information Gain = 0.006847519607620646

27
SMOKDAY2 Information Gain = 0.006960442110794853
SEX Information Gain = 0.0004739918775396071
X_HISPANC Information Gain = 0.0025293500968855574
X_MRACE1 Information Gain = 0.008850538074078308
MARITAL Information Gain = 0.014924287194574969
GENHLTH Information Gain = 0.014629738787207778
HLTHPLN1 Information Gain = 6.570035720693993e-05
EDUCA Information Gain = 0.012295452134480305
INCOME2 Information Gain = 0.02102649276657655
X_BMI5CAT Information Gain = 0.011768902179923593
EXERANY2 Information Gain = 0.007876172653369179
ALCGRP Information Gain = 0.002625711646314724
DRKWEEKLY Information Gain = 0.0011139398458448837
ASTHMA4 Information Gain = 0.8340980145040057
AGE2 Information Gain = 0.0013888986102831005
AGE3 Information Gain = 0.00034468189068495765
AGE4 Information Gain = 0.001742967422482746
AGE5 Information Gain = 0.004107403649051549
AGE6 Information Gain = 0.0037420868481351094

3.4.3. XGBoost:

Extreme gradient boosting is a ml algorithm that combines the principles of gradient boosting and
decision trees [8]. It is extensively used for various Classification and Regression tasks, including
detecting medical conditions like asthma in patients.

The XGBoost algorithm constructs a sequence of weak decision tree models, each designed to rectify
errors made by previous trees and enhance the overall predictive accuracy. It places special emphasis
on problematic examples that were difficult to classify accurately.

1. Feature selection: XGBoost is capable of automatically handling feature selection by assessing


the significance of various features through their impact on reducing the loss function.
Techniques such as gain or cover are utilized to evaluate this contribution.[10], [31]. It can
identify the most relevant features related to asthma diagnosis and use them effectively.
2. Handling missing data: XGBoost can handle missing values in the input data, which is
common in medical datasets[19]. It can make splits on missing values and incorporating them
into the decision-making process.
28
3. Handling imbalanced data: Medical datasets can often be imbalanced, where the number of
negative cases (non-asthma) significantly outweighs the positive cases (asthma)[20]. XGBoost
provides options for handling imbalanced data, such as using class weights or setting different
thresholds during model training.
4. Boosting: XGBoost uses gradient boosting, which means it trains new trees that explicitly
target and correct the mistakes made by previous trees during the training process. Using
XGBoost for asthma diagnosis can enhance the model's capability to detect intricate patterns
and increase sensitivity in predicting asthma.
5. Regularization: XGBoost includes regularization techniques to prevent overfitting, which is
important when dealing with medical data. Regularization helps generalize the model, reducing
the chances of it memorizing noise or idiosyncrasies in the training data.
6. Model evaluation and tuning: XGBoost offers evaluation metrics like accuracy, precision,
recall, and AUC-ROC to evaluate the model's performance in identifying asthma in patients.
Various hyperparameters, such as the learning rate, tree depth, and number of trees, can be
adjusted to enhance the model's performance.

Fig. 3.6. Architecture of XGB

XGBoost, an ensemble tree method, employs a gradient descent framework to enhance the
performance of weak learners. To mitigate overfitting, XGBoost incorporates a regularization term into
its loss function, thereby smoothing out learned weights. The model's output, denoted as y-hat, is
computed as the average of the outputs of individual trees. A loss function, which quantifies the
difference in predicted and real values, as well as a regularization function to manage model
complexity and prevent overestimation are combined into XGBoost's Objective Function.

The formulas for XGBoost:

1. Objective Function:
Objective Function = Loss Function + Regularization Term
2. Loss Function:
Loss Function = 𝑠𝑢𝑚(𝑦 ∗ log(𝑝`) + (1 − 𝑦) ∗ log(1 − 𝑝`))
29
3. Regularization Term:
Regularization Term = 𝑎𝑙𝑝ℎ𝑎 ∗ 𝐿1 + 0.5 ∗ 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝐿22
4. Prediction Function:
Prediction = 𝑠𝑢𝑚(𝐹𝑚(𝑋) )
5. Gradient:
𝑑𝐿(𝑦,𝐹)
Gradient = − 𝑑𝐹

6. Ensemble Prediction:
7. Final Prediction = 𝑠𝑢𝑚 (𝑒𝑡𝑎 ∗ 𝐹𝑚(𝑋) )

3.4.4. SVC:

Fig. 3.7. Working of SVC

The SVC (Support Vector Classifier) technique is a supervised machine learning technique utilized for
data classification. It belongs to the Support Vector Machine (SVM) family and is well-known for its
ability to find the best hyperplane in a high-dimensional feature space to distinguish between classes
[15], [22], [32]. The primary goal of SVC is to discover the decision boundary that maximises the
separation of classes. This is performed by setting a distance between them and employing support
vectors, which are the data points closest to the decision boundary [32]. To manage both linearly and
non-linearly separable data, SVC employs kernels such as linear, polynomial, or radial basis functions
to map the data into higher-dimensional spaces. SVC can be used to predict asthma in patients by
training the algorithm on a dataset that includes features related to asthma and corresponding labels
indicating whether a patient has asthma or not [33].

30
After the completion of training, the model's efficacy is assessed by employing the testing set.
Utilizing the patients' characteristics, the model predicts whether they have asthma. The predicted
labels are subsequently compared to the actual labels to compute various performance metrics, such as
accuracy, precision, recall, and F1 score [24], [34].

The performance metrics enable us to gauge the accuracy of the SVC model in predicting asthma
among patients. If the model performs well, it can be applied to make predictions on new patient data
that it hasn't seen before. However, it is crucial to consider the quality and relevance of the data used to
train the model, as it significantly influences the accuracy and effectiveness of predictions. To improve
the model's performance, feature selection and data preprocessing techniques may be required.[4].

The mathematical formulation of the SVM algorithm, particularly for the SVC, revolves around
optimizing a hyperplane capable of categorizing data points into their respective classes. With a
labeled training dataset containing input features (𝑋) and corresponding labels (𝑦), SVC aims to
discover the hyperplane characterized by a vector (𝑤) and bias (𝑏) that maximizes the margin between
the classes. The mathematical formula for the SVC's decision function is:

𝑓(𝑥 ) = 𝑠𝑖𝑔𝑛(𝑤 𝑇 ∗ 𝑥 + 𝑏)

Equation 1.3. SVC mathematical equation

where:

 The function f(x) is used to make predictions on the class of a given input data point x.
 w is a vector perpendicular to the hyperplane, and its direction determines the decision
boundary.
 x is the input attribute vector.
 b is the preference term that adjusts the position of the decision boundary.

3.4.5. Hybrid Approach:

A hybrid machine learning strategy integrates multiple machine learning algorithms or approaches to
address a particular problem. Instead of relying on a single algorithm, a hybrid approach leverages the
strengths of different algorithms to improve overall performance, accuracy, or efficiency.

There are various ways to create a hybrid approach in machine learning. Here are a few common
techniques such as Ensemble Method, Feature Combination, Model Stacking[8], [22], [35]. Computer
31
vision, natural language processing, recommendation systems, and anomaly detection are all examples
of fields where hybrid machine learning approaches may be applied. By combining different
algorithms or techniques, hybrid approaches can often achieve better performance, handle complex
problems, and provide more robust solutions. However, designing and implementing a hybrid
approach expects cautious consideration of the specific problem, data, algorithms, and their
interactions[22].

The hybrid approach we are using, called stacking and voting, combines the predictions of different
machine learning algorithms, including KNN, Decision Tree, XGBoost, and SVC.

Stacking is an Ensemble technique where multiple base models are prepared on the same dataset. The

Fig.3.8. Final Architecture of Hybrid model

32
predictions made by these models are then used as input features for a final meta-model. In this case,
the base models include KNN, Decision Tree, XGBoost, and SVC classifiers. Each of these models
learns to forecast based on incoming data and generates its own set of predictions.

After training the base models, their predictions are merged using a meta-model. In our scenario, the
meta-model can be a Voting Classifier. A Voting Classifier is an Ensemble model that aggregates the
predictions of multiple models through voting to determine the final prediction. Various voting
methods can be employed, including majority polling, where the class with the highest number of
votes is chosen, or weighted voting, where each model's prediction is weighted based on its
performance or certainty level. The Voting Classifier integrates the predictions from the KNN,
Decision Tree, XGBoost, and SVC classifiers to generate a conclusive prediction.

By combining the predictions of different classifiers, the stacking and voting approach leverages the
strengths of each model. For example, KNN is a lazy learner that makes predictions based on the
closest neighbors in the training data. Decision Tree is a non-parametric model that can capture
complex relationships between features. XGBoost is a gradient boosting algorithm that creates a strong
predictive model by combining weak models. SVC is a supervised learning method that separates
classes using hyperplanes in a high-dimensional space.

In the fig.10 stacking and voting hybrid approach can be beneficial in improving the overall prediction
accuracy, handling diverse or complex datasets, and handling different types of problems. However, it
is crucial to carefully select and tune the base models, meta-model, and voting strategy to ensure
compatibility and optimal performance. It is also important to handle potential issues such as model
complexity, overfitting, and class imbalances in the dataset. Overall, the stacking and voting approach
allows you to leverage the strengths of different machine learning algorithms to obtain robust
predictions for your specific problem.

33
Chapter-4

IV. Results

4.1. Report the results obtained from the experiments:

We assessed the performance of the models using test data to ascertain their accuracies. This included
showing the training and testing accuracies for all models. We also assessed each model's AUC Score,
Specificity, and Sensitivity, which were obtained using the confusion matrix.

The experimental outcomes demonstrate the efficacy of different machine learning algorithms in
predicting asthma prevalence using the provided dataset. Four main algorithms were evaluated: KNN,
SVC, Decision Tree, and XGBoost. Additionally, ensemble methods such as Voting and Stacking were
utilized to harness the combined strengths of these base models.

Across all models, the accuracy scores were notably high, with SVC, XGBoost, and Voting achieving
accuracies above 98%. This indicates a robust predictive capability of the models in identifying asthma
cases. Moreover, the area under the ROC curve values ranged from approximately 0.86 to 0.95, further
affirming the models' effectiveness in distinguishing between asthma and non-asthma instances. In
terms of individual model performances, SVC and XGBoost emerged as top performers, consistently
exhibiting high accuracy and ROC AUC scores.[36] These models demonstrated precision and recall
rates exceeding 0.90 for both asthma and non-asthma classes, indicative of their ability to correctly
classify instances across various metrics. The Decision Tree model, while slightly lower in accuracy
compared to SVC and XGBoost, still provided a respectable performance, achieving an accuracy of
around 96% and an ROC AUC score of approximately 0.90. However, it displayed relatively lower
precision and recall rates for the minority class (asthma instances) compared to the majority class.

Ensemble methods, namely Voting and Stacking, showcased competitive performance, with accuracy
and ROC AUC scores like or slightly lower than those of individual base models. This suggests that
while ensemble methods can offer enhanced predictive capabilities through model aggregation, they
may not always outperform individual strong learners.

34
The table for the Confusion matrix for the algorithms is below Table4:

Table 2.1. Confusion metrics of All methods

Model Confusion Matrix


KNN 513 80
43 5448
SVC 532 61
43 5447
DT 482 111
120 5370
XGB 528 65
42 5448
Voting 527 66
43 5447
Stacking 484 109
125 5365

4.2. Evaluation Metrics:

The following evaluation metrics or parameters are evaluated, and the accuracy of the model and
algorithm are given in the Table.5:

1. Accuracy: Accuracy is the fraction of accurately predicted instances among all instances in a
dataset.
 Formula: Accuracy = (TP` + TN`) / (TP` + TN` + FP` + FN`)

Table 4.2. Accuracy of all models

Model Accuracy

KNN 0.979944

SVC 0.982903

Decision Tree 0.961203

XGBoost 0.98241

Voting 0.982081

Stacking 0.960217

35
2. Precision: Precision measures the ratio of TP` forecasts to all (+) predictions made by the
model.
 Formula: Precision = TP` / (TP` + FP`)

PRECISION
0.985 SVC, 98.27% XGBoost, 98.22%
Voting, 98.18%
KNN, 97.96%
0.98

0.975

0.97

0.965
Stacking, 96.33%
0.96 Decision Tree,
96.20%
0.955

0.95
KNN SVC DECISION XGBOOST VOTING STACKING
TREE

Fig.4.1. Precision Graph

3. Recall (Sensitivity): Remember the ratio of TP' projections to entirely actual (+) cases in the
dataset.
 Formula: Recall = TP / (TP + FN)

Recall

Stacking 96.28%

Voting 98.21%

XGBoost 98.24%

Decision Tree 96.17%

SVC 98.29%

KNN 97.99%

0.95 0.955 0.96 0.965 0.97 0.975 0.98 0.985

Recall

Fig. 4. Recall Graph

36
4. F1 Score: F1 Score represents the HM of precision and recall, offering a balanced assessment.
 Formula: F1-score = 2 * (P * R) / (P + R)

F1-score
0.985 98.28% 98.23% 98.19%
97.96%
0.98

0.975

0.97

0.965 96.19% 96.30%

0.96

0.955

0.95
KNN SVC Decision XGBoost Voting Stacking
Tree

F1-score

Fig. 5 F1-Score values

5. ROCAUC curve: The ROC Curve illustrates the TP` rate versus the FP` rate at different
thresholds, while the AUC quantifies the area under this curve.
 Formula: AUC is computed by integrating the area under the ROC Curve.

Fig.4.4. X & Y graph of all methods for accuracy and ROC

37
Fig. 4.5. Learning Curve of DT

We trained and tested KNN, SVC, DT, XGBoost, and ensemble models (stacking and voting) on a
dataset containing patient information. System of measurement such as accuracy, ROCAUC, P, R, F1-
score, MSE, and CM were computed for every model presented in the tables and diagrams. The
learning curves for the ensemble models demonstrated that both stacking and voting approaches
improved performance as the number of training samples increased. The voting ensemble model
showed a higher training score and a lower cross-validation score gap compared to the stacking
ensemble, indicating better generalization.

Fig. 4.6. Learning Curve of KNN

38
Fig. 4.7. Learning Curve of SVC Fig. 4.8. Learning Curve of Stacking

Fig. 4.9. Learning Curve of Voting Fig. 4.10. learning Curve of

Fig. 4.11. Combined AUCROC curve of all techniques

39
The ROC AUC curves indicated that the voting ensemble model surpassed both the stacking
ensemble and individual base models in terms of ROCAUC. Specifically, the voting
ensemble attained the highest ROC AUC score, followed by the stacking ensemble,
XGBoost, SVC, Decision Tree, and KNN.

4.3. Discussion:

Selecting the most suitable classification techniques for a health dataset necessitates
preprocessing and a deep understanding of the data. Numerous classifiers exist for analyzing
healthcare data, with the choice depending on the intended analysis. Cleaning large amounts
of healthcare data may take a long time and be expensive. The models employed in this study
are applicable to individual healthcare clinics or can be scaled up for broader use. There is a
scarcity of prior research utilizing machine learning to examine paediatric health data
concerning asthma development. Timely identification of asthma in children is imperative for
early implementation of interventions for this chronic respiratory condition.

4.3.1. Comparison Of the Hybrid Technique with Other Techniques:

Several research studies have explored the utilization of machine learning (ML) in asthma
identification. These studies assessed various techniques using metrics like accuracy, recall,
and AUCROC. The main aim is to identify the ML method that offers the most accurate
predictions and effectively detects genuine asthma cases (recall). AUCROC provides a
detailed assessment of the model's performance at various categorization criteria.

Table 4.3. Comparison of accuracies

Authors Models/ Algorithms Accuracy Recall AUCROC


5 CAPE Not available Not available 0.71
CAPP Not available Not available 0.82
6 Neural Network 0.92 0.94 0.96
SVM 0.94 0.98 0.90
RF 0.92 0.97 0.95
7 DT Not available 0.68 0.71
LR Not available 0.81 0.87
NB Not available 0.88 0.87
SVM Not available 0.59 0.63
9 DT Not available 0.86 0.86

40
RF Not available 0.84 0.88
KNN Not available 0.84 0.88
ANN Not available 0.87 0.90
14 CNN 0.98 Not available Not available
17 SVM 0.94 Not available Not available
ANN 0.92 Not available Not available
RF 0.92 Not available Not available
33 Neural Network 0.68 Not available Not available
XGB 0.728 Not available Not available
LGBM 0.721 Not available Not available
KNN 0.70 Not available Not available
36 KNN 0.66 0.62 0.93
DT 0.86 0.86 0.91
XGB 0.93 0.92 0.98
SVC 1 1 1
39 SVM 0.88 Not available Not available
GBM 0.90 Not available Not available
XGB 0.86 Not available Not available

4.3.2. Our Model Scoring parameters:

In the table7 the models are evaluated with the evaluation metrics which are mentioned
below:

Table 4.4. Our model Evaluation metrics

Algorithm Accuracy F1-socre SN AUCROC


KNN 0.979944 0.97964 0.97994 0.94141
DT 0.961203 0.96185 0.96169 0.89499
SVC 0.982903 0.98278 0.98290 0.93900
XGB 0.98241 0.98225 0.98240 0.95161
Ensemble Techniques
Voting 0.982081 0.98192 0.98208 0.94929
Stacking 0.960217 0.96304 0.96284 0.86657

41
The provided table outlines the execution metrics of various machine learning algorithms and
ensemble techniques on a specific dataset:

1. Algorithm: In this column, various machine learning algorithms and ensemble


methods used for classification tasks are listed. These include KNN, DT, SVC, and
XGB. Additionally, ensemble techniques like Voting and Stacking are also mentioned.
2. Accuracy: SVC achieves the highest accuracy of 0.982903, closely followed by XGB
with an accuracy of 0.98241. The accuracy metric evaluates the proportion of
correctly classified instances out of the total instances, and higher values signify
better performance.
3. F1-score: This method, representing the harmonic mean of P & R, offers a balanced
evaluation metric. It stretches from zero to one, with higher scores indicating better
precision and recall balance. KNN emerges with the highest F1-score of 0.97964,
indicating commendable performance across both precision and recall.
4. Sensitivity (SN): Sensitivity, also known as R, measures the proportion of actual
positive cases correctly identified by the model. All algorithms and ensemble
techniques exhibit high sensitivity values, with KNN leading at 0.97994.
5. AUCROC: The AUCROC, a performance statistic for binary classification tasks,
measures the model's ability to distinguish between positive and negative classes at
different threshold levels. Values range from 0 to 1, with closer proximity to 1
denoting superior discrimination. Among algorithms, XGB boasts the highest
AUCROC of 0.95161, indicative of strong discriminatory capacity.
6. Ensemble Techniques: Ensemble techniques amalgamate predictions from multiple
models to enhance performance. The table evaluates two such techniques: Voting and
Stacking, yielding accuracies of 0.982081 and 0.960217, respectively.

Overall, the findings indicate that SVC and XGB algorithms demonstrate notable accuracy
and AUCROC performance, while KNN excels in terms of F1-score and sensitivity.
Ensemble techniques like Voting and Stacking also showcase competitive performance, albeit
marginally lower than individual algorithms. These results furnish valuable insights for
selecting the most appropriate algorithm or ensemble approach tailored to the specific
classification task.

42
Chapter-5

V. Future Direction

Using machine learning techniques shows promise for predicting asthma exacerbations with
high accuracy, but there's still a need to explore more methods in the future. While several
models have been developed, only a few have been put into practical use. Hence, it's crucial
to enhance the adaptability of prediction models across various large datasets. Practicality is
key here. Simplifying models with just a handful of predictors, using easily accessible data,
could make them more practical. Additionally, integrating machine learning algorithms into
user-friendly software or systems would facilitate their transition from research to practical
applications. Furthermore, randomised control trials are needed to determine whether these
models truly assist asthma patients by preventing exacerbations. There's potential in creating
models capable of predicting asthma exacerbations during a child's clinical visit, offering
real-time insights for clinicians. It would be highly beneficial if such models could provide
predictions for children under the age of two, enabling early intervention with appropriate
medications. Moreover, exploring various classifiers like artificial neural networks,
multilayer perception, and linear regression could provide valuable insights into prediction
probabilities.

5.1. Machine learning in Health sector:

The next generation of healthcare providers must acquire a solid understanding of the
fundamentals, potentials, and terminology related to machine learning (ML) due to its rapid
advancements and growing use in healthcare. Understanding ML algorithms and related
terms will empower them to comprehend and analyze relevant literature, as well as engage in
research utilizing ML techniques[6], [11]. It's essential to educate professionals across
various healthcare domains, including public health, epidemiology, clinical practice,
pathology, and radiology, about ML terms. Given the interconnectedness of data science and
epidemiology, it's equally crucial to train public health professionals who possess a strong
grasp of epidemiological concepts. Incorporating certain ML and data science concepts into
the medical curriculum in the long run is recommended to ensure that future healthcare
professionals are well-equipped to leverage the potential of ML in improving healthcare
outcomes[5].

43
5.2. Conclusion:

The initial discovery of asthma patient role at risk of experiencing aggravations is crucial for
providing timely intervention and close monitoring. This examine highlights the effectiveness
of machine learning in predicting asthma exacerbations. However, it is important for future
research to focus on improving the applicability and generalizability of these models, making
them more suitable for integration into clinical practice. Asthma is a complex condition with
numerous risk factors, making it particularly challenging to diagnose in children under the
age of six. Machine learning presents a promising approach to developing predictive models
for childhood asthma by utilizing large datasets and outperforming traditional regression
methods. These studies exhibited differences in their definitions of asthma, preferred
populations, predictors considered, age of prediction, feature selection methods, and ml
algorithms employed, thus introducing a potential risk of bias. Although these studies
achieved great precision, there were indications of overfitting due to small sample sizes.
Additionally, none of the studies externally validated their models, further undermining their
reliability. These limitations highlight the necessity for future research to focus on enhancing
the accuracy and applicability of machine learning models for predicting childhood asthma.
By addressing these challenges, machine learning has the potential to significantly improve
the early detection and management of asthma in children. Compared to alternative
classifiers, the suggested model demonstrates significant enhancements in recall, precision,
and accuracy metrics, especially for the poorly-controlled class. Moreover, the findings
suggest that ensemble learning, and similar machine learning approaches hold significant
promise for integrating prognostic systems in identifying asthma control levels, especially
when supplemented by medical expertise.

44
References:

[1] D. Shi, C. DiStefano, H. L. McDaniel, and Z. Jiang, “Examining Chi-Square Test


Statistics Under Conditions of Large Model Size and Ordinal Data,” Structural
Equation Modeling, vol. 25, no. 6, pp. 924–945, Nov. 2018, doi:
10.1080/10705511.2018.1449653.

[2] Ambekar, S., & Phalnikar, R. (2018, August). Disease risk prediction by using
convolutional neural network. In 2018 Fourth international conference on computing
communication control and automation (ICCUBEA) (pp. 1-5). IEEE.

[3] M. Lovrić, I. Banić, E. Lacić, K. Pavlović, R. Kern, and M. Turkalj, “Predicting


treatment outcomes using explainable machine learning in children with asthma,”
Children, vol. 8, no. 5, May 2021, doi: 10.3390/children8050376.

[4] A. Alanazi, “Using machine learning for healthcare challenges and opportunities,”
Informatics in Medicine Unlocked, vol. 30. Elsevier Ltd, Jan. 01, 2022. doi:
10.1016/j.imu.2022.100924.

[5] J. Finkelstein and I. cheol Jeong, “Machine learning approaches to personalize early
prediction of asthma exacerbations,” Ann N Y Acad Sci, vol. 1387, no. 1, pp. 153–165,
Jan. 2017, doi: 10.1111/nyas.13218.

[6] A. Yahyaoui and N. Yumusak, “Deep and Machine Learning towards Pneumonia and
Asthma Detection,” in 2021 International Conference on Innovation and Intelligence
for Informatics, Computing, and Technologies, 3ICT 2021, Institute of Electrical and
Electronics Engineers Inc., Sep. 2021, pp. 494–497. doi:
10.1109/3ICT53449.2021.9581963.

[7] G. Zhen, L. Yingying, X. Weifang, and D. Jingcheng, “A bibliometric and scientific


knowledge map study of the drug therapies for asthma-related study from 1982 to
2021,” Frontiers in Pharmacology, vol. 13. Frontiers Media S.A., Oct. 03, 2022. doi:
10.3389/fphar.2022.916871.

[8] M. S. Kim, J. H. Lee, Y. J. Jang, C. H. Lee, J. H. Choi, and T. E. Sung, “Hybrid deep
learning algorithm with open innovation perspective: A prediction model of asthmatic
occurrence,” Sustainability (Switzerland), vol. 12, no. 15, Aug. 2020, doi:
10.3390/su12156143.

45
[9] L. S. Becirovic, A. Deumic, L. G. Pokvic, and A. Badnjevic, “Aritificial Inteligence
Challenges in COPD management: A review,” in BIBE 2021 - 21st IEEE International
Conference on BioInformatics and BioEngineering, Proceedings, Institute of Electrical
and Electronics Engineers Inc., 2021. doi: 10.1109/BIBE52308.2021.9635374.

[10] W. K. D. Jayamini, F. Mirza, M. A. Naeem, and A. H. Y. Chan, “State of Asthma-


Related Hospital Admissions in New Zealand and Predicting Length of Stay Using
Machine Learning,” Applied Sciences (Switzerland), vol. 12, no. 19, Oct. 2022, doi:
10.3390/app12199890.

[11] AKBAR, W., WU, W. P., FAHEEM, M., SALEEM, M. A., GOLILARZ, N. A., &
HAQ, A. U. (2019, December). Machine learning classifiers for asthma disease
prediction: a practical illustration. In 2019 16th International Computer Conference on
Wavelet Active Media Technology and Information Processing (pp. 143-148). IEEE.

[12] Harvey, J. L., & Kumar, S. A. (2019, December). Machine learning for predicting
development of asthma in children. In 2019 IEEE Symposium Series on
Computational Intelligence (SSCI) (pp. 596-603). IEEE.

[13] A. Hussain et al., “diagnostics Forecast the Exacerbation in Patients of Chronic


Obstructive Pulmonary Disease with Clinical Indicators Using Machine Learning
Techniques,” 2021, doi: 10.3390/diagnostics.

[14] D. Patel, G. L. Hall, D. Broadhurst, A. Smith, A. Schultz, and R. E. Foong, “Does


machine learning have a role in the prediction of asthma in children?,” Paediatric
Respiratory Reviews, vol. 41. W.B. Saunders Ltd, pp. 51–60, Mar. 01, 2022. doi:
10.1016/j.prrv.2021.06.002.

[15] D. M. Kothalawala et al., “Development of childhood asthma prediction models using


machine learning approaches,” Clin Transl Allergy, vol. 11, no. 9, Nov. 2021, doi:
10.1002/clt2.12076.

[16] M. Bari Antor et al., “A Comparative Analysis of Machine Learning Algorithms to


Predict Alzheimer’s Disease,” J Healthc Eng, vol. 2021, 2021, doi:
10.1155/2021/9917919.

[17] P. D. Terry, R. E. Heidel, and R. Dhand, “Asthma in adult patients with covid-19
prevalence and risk of severe disease,” American Journal of Respiratory and Critical

46
Care Medicine, vol. 203, no. 7. American Thoracic Society, pp. 893–905, Apr. 01,
2021. doi: 10.1164/rccm.202008-3266OC.

[18] A. L. Yadav, K. Soni, and S. Khare, “Heart Diseases Prediction using Machine
Learning,” in 2023 14th International Conference on Computing Communication and
Networking Technologies, ICCCNT 2023, Institute of Electrical and Electronics
Engineers Inc., 2023. doi: 10.1109/ICCCNT56998.2023.10306469.

[19] S. Xiong, W. Chen, X. Jia, Y. Jia, and C. Liu, “Machine learning for prediction of
asthma exacerbations among asthmatic patients: a systematic review and meta-
analysis,” BMC Pulm Med, vol. 23, no. 1, Dec. 2023, doi: 10.1186/s12890-023-02570-
w.

[20] R. Ullah, S. Khan, H. Ali, I. I. Chaudhary, M. Bilal, and I. Ahmad, “A comparative


study of machine learning classifiers for risk prediction of asthma disease,”
Photodiagnosis Photodyn Ther, vol. 28, pp. 292–296, Dec. 2019, doi:
10.1016/j.pdpdt.2019.10.011.

[21] J. G. Zein, C. P. Wu, A. H. Attaway, P. Zhang, and A. Nazha, “Novel Machine


Learning Can Predict Acute Asthma Exacerbation,” Chest, vol. 159, no. 5, pp. 1747–
1757, May 2021, doi: 10.1016/j.chest.2020.12.051.

[22] B. S. Agnikula Kshatriya et al., “Identification of asthma control factor in clinical


notes using a hybrid deep learning model,” BMC Med Inform Decis Mak, vol. 21, Nov.
2021, doi: 10.1186/s12911-021-01633-4.

[23] A. M. Pescatore et al., “A simple asthma prediction tool for preschool children with
wheeze or cough,” Journal of Allergy and Clinical Immunology, vol. 133, no. 1, 2014,
doi: 10.1016/j.jaci.2013.06.002.

[24] Tsang, K. C., Pinnock, H., Wilson, A. M., & Shah, S. A. (2020, July). Application of
machine learning to support self-management of asthma with mHealth. In 2020 42nd
annual international conference of the IEEE engineering in medicine & biology society
(EMBC) (pp. 5673-5677). IEEE.

[25] M. S. Kim, J. H. Lee, Y. J. Jang, C. H. Lee, J. H. Choi, and T. E. Sung, “Hybrid deep
learning algorithm with open innovation perspective: A prediction model of asthmatic

47
occurrence,” Sustainability (Switzerland), vol. 12, no. 15, Aug. 2020, doi:
10.3390/su12156143.

[26] J. Finkelstein and I. cheol Jeong, “Machine learning approaches to personalize early
prediction of asthma exacerbations,” Ann N Y Acad Sci, vol. 1387, no. 1, pp. 153–165,
Jan. 2017, doi: 10.1111/nyas.13218.

[27] B. Mali, S. Dhal, and A. K. Das, “Diagnosis of Asthma in Children Based on


Symptoms: A Machine Learning Approach,” in IEEE Region 10 Annual International
Conference, Proceedings/TENCON, Institute of Electrical and Electronics Engineers
Inc., 2021, pp. 782–787. doi: 10.1109/TENCON54134.2021.9707283.

[28] B. S. Agnikula Kshatriya et al., “Identification of asthma control factor in clinical


notes using a hybrid deep learning model,” BMC Med Inform Decis Mak, vol. 21, Nov.
2021, doi: 10.1186/s12911-021-01633-4.

[29] B. D. C. N. Prasad, P. E. S. N. Krishna Prasad, and Y. Sagar, “A Comparative Study of


Machine Learning Algorithms as Expert Systems in Medical Diagnosis (Asthma).”

[30] A. Badnjevic, L. Gurbeta, and E. Custovic, “An Expert Diagnostic System to


Automatically Identify Asthma and Chronic Obstructive Pulmonary Disease in
Clinical Settings,” Sci Rep, vol. 8, no. 1, Dec. 2018, doi: 10.1038/s41598-018-30116-2.

[31] M. A. Awal et al., “An Early Detection of Asthma Using BOMLA Detector,” IEEE
Access, vol. 9, pp. 58403–58420, 2021, doi: 10.1109/ACCESS.2021.3073086.

[32] M. Payal, T. Ananth Kumar, and S. A. Ajagbe, “Support Vector Machines (SVMS)
Based Advanced Healthcare System Using Machine Learning Techniques International
Journal of Innovative Research in Computer and Communication Engineering
IJIRCCE©2022 | An ISO 9001:2008 Certified Journal | 3007 Support Vector Machines
(SVMS) Based Advanced Healthcare System Using Machine Learning Techniques”,
doi: 10.15680/IJIRCCE.2022.1005020.

[33] Taunk, K., De, S., Verma, S., & Swetapadma, A. (2019, May). A brief review of
nearest neighbor algorithm for learning and classification. In 2019 international
conference on intelligent computing and control systems (ICCS) (pp. 1255-1260).
IEEE.

48
[34] M. Pyingkodi et al., “Asthma Disease Risk Prediction Using Machine Learning
Techniques,” in 2023 International Conference on Computer Communication and
Informatics, ICCCI 2023, Institute of Electrical and Electronics Engineers Inc., 2023.
doi: 10.1109/ICCCI56745.2023.10128635.

[35] S. Bharati, P. Podder, and M. R. H. Mondal, “Hybrid deep learning for detecting lung
diseases from X-ray images,” Inform Med Unlocked, vol. 20, Jan. 2020, doi:
10.1016/j.imu.2020.100391.

[36] R. Khasha, M. M. Sepehri, and S. A. Mahdaviani, “An ensemble learning method for
asthma control level detection with leveraging medical knowledge-based classifier and
supervised learning,” J Med Syst, vol. 43, no. 6, Jun. 2019, doi: 10.1007/s10916-019-
1259-8.

49
Publication Details

1. The patent "Arduino-Based Facemask with Sensors for Predicting Asthma"


(Application No. 202311068112) is published publicly and describes a new
technology for respiratory health. It analyses breathing patterns and air quality in real
time by incorporating sensors into a facemask, allowing it to predict early asthma
attacks. It uses Arduino technology and offers proactive asthma care for a better
quality of life.
2. Rajat Rana, Tarun Jangwal, Varsha Sahni, and Vijay Kumar Garg's work "Harnessing
Supervised Learning & Genetic Algorithms for Asthma Prediction" (Paper Id:
1234) in the conference ICCS-2023 hosted in LPU describes sophisticated approaches
for accurately predicting asthma patients. This study combines supervised learning
and genetic algorithms, representing a big step forward in personalised healthcare
solutions.

50
Plagiarism report4
ORIGINALITY REPORT

9 %
SIMILARITY INDEX
6%
INTERNET SOURCES
5%
PUBLICATIONS
3%
STUDENT PAPERS

PRIMARY SOURCES

1
Submitted to Wollega University
Student Paper 1%
2
www.coursehero.com
Internet Source 1%
3
Submitted to University of Westminster
Student Paper <1 %
4
Justin T McDaniel, D L Albright, K Laha-Walsh,
H Henson, S McIntosh. "Alcohol screening
<1 %
and brief intervention among military service
members and veterans: rural–urban
disparities", BMJ Military Health, 2020
Publication

5
www.geeksforgeeks.org
Internet Source <1 %
6
www.v7labs.com
Internet Source <1 %
7 Sudi Murindanyi, Margaret Nagwovuma,
Barbara Nansamba, Ggaliwango Marvin.
<1 %
"Explainable Ensemble Learning and
Trustworthy Open AI for Customer

51

You might also like