Project ML
Project ML
BY
ORIERE EMMANUEL OGBEMUDIA
20201231842
&
NOKWAMCHUKWU JOSHUA NKWACHI
20201207682
JULY 2025.
CERTIFICATION
I certify that this research work “DEVELOPING AND OPTIMIZING A MACHINE
LEARNING BASED DIABETIC DIAGNOSTIC SYSTEM THROUGH ALGORITHM
PERFORMANCE ANALYSIS” was carried out by Oriere Emmanuel Ogbemudia
(20201231842) & Nokwamchukwu Joshua Nkwachi (20201207682) in partial fulfilment for the
award of the degree of B. Tech in Information Technology, of the Federal University of
Technology, Owerri.
_______________ ________________
Project Supervisor. Date
(Mr. Asimobi)
________________ ________________
Dr. E.C Amadi Date
(Ag. HoD -IFT)
________________ ________________
Prof. U.F Ezeh. Date
(Dean, SICT)
________________ ________________
(External Examiner) Date
DEDICATION
This work is dedicated to almighty God for seeing us through and provision for the completion
of this work.
ACKNOWLEDGEMENT
First and foremost, we give glory to God Almighty for His divine wisdom and strength that
sustained us throughout this academic journey and the completion of this work.
We are grateful to our project supervisor, Mr. Asimobi, whose unwavering support and
guidance shaped each phase of this work. His patience in reviewing our work, and constant
corrections were important in handling difficulties during the research process of this work.
Special appreciation goes to Engr. Dr. E. C. Amadi, our Head of Department, for creating an
enabling academic environment. His leadership ensured we had access to necessary institutional
support to carry out this project effectively.
We are particularly thankful to Mr. Victor, our Data Science lecturer, whose teachings laid the
technical foundation for this work. His practical sessions on Python programming, Scikit-learn,
and model evaluation directly helped our implementation of the diagnostic system.
To the entire faculty and staff of the Department of Information Technology at the Federal
University of Technology, Owerri we say thank you. We are grateful for your dedication to
knowledge impartation and for maintaining a conducive learning environment throughout our
studies in FUTO.
To our beloved families, we owe gratitude for their constant prayers, and financial support, that
kept us motivated during demanding periods of this project.
Finally, we appreciate all our colleagues and friends who offered feedback during discussions
and shared in their thoughts and ideas. Their companionship made this journey memorable.
ABSTRACT
Diabetes is a crucial health crisis in Nigeria and other developing African countries, with
prevalence surging from 2.2% to 7.0% over two decades and more than 60% of cases remains
undiagnosed until complications arises (Nigerian Medical Journal, 2023). Existing tools for
diagnostic faces major limitations they either rely on costly lab tests (₦25,000 per HbA1c) or
deploy suboptimal machine learning models without optimization for African populations
(Adeoye et al., 2023). This research addresses these gaps by developing and optimizing a
machine learning based diagnostic system through model comparison.
This research implements five major phases First, benchmark datasets (dataset + synthetic
Nigerian data) are preprocessed with SMOTE balancing. Second, five classification algorithms
(KNN, SVM, Random Forest, Logistic Regression, Decision Tree) are comparatively evaluated
using clinical metrics (AUC-ROC, Recall, Precision). Third, Bayesian Optimization fine-tunes
the highest-performing model to achieve ≥90% recall and ≥0.93 AUC. Fourth, the optimized
model is deployed via a Django web interface with offline functionality. Finally, validation
against WHO standards confirms diagnostic accuracy. Expected outcomes include a 75%
reduction in screening costs compared to traditional methods and a 40% decrease in undiagnosed
cases accessible, high-accuracy (≥89%) diagnosis. By integrating the most performing algorithm
with context-aware design, this system offers a scalable solution to diabetes diagnostic while
establishing a framework for clinical AI deployment in low-resource settings.
TABLE OF CONTENT
CERTIFICATION...........................................................................................................................2
DEDICATION.................................................................................................................................3
ACKNOWLEDGEMENT...............................................................................................................4
ABSTRACT....................................................................................................................................5
TABLE OF CONTENT...................................................................................................................6
LIST OF TABLES...........................................................................................................................7
LIST OF FIGURES.........................................................................................................................8
CHAPTER ONE..............................................................................................................................9
INTRODUCTION...........................................................................................................................9
Background of Study...................................................................................................................9
Statement of the Problem...........................................................................................................10
Aim and Objectives...............................................................................................................12
Significance of the Study...........................................................................................................13
1.5 Scope of the Study...............................................................................................................14
Limitations of the Study............................................................................................................15
Definition of Key Terms............................................................................................................16
CHAPTER TWO...........................................................................................................................18
LITREATURE REVIEW..............................................................................................................18
2.1 Conceptual Framework........................................................................................................18
2.1.1 Diabetes Mellitus: Definition, Cause and Impact.........................................................18
2.1.2 Brief Definition of Data Science and ML.....................................................................19
AI in Diabetes Management..................................................................................................21
2.1.4 Web Based Health Solution for Diabetes Management...............................................21
2.2 Theoretical Framework........................................................................................................22
2.2.1 Algorithm Selection Theory.........................................................................................22
2.2.3 Human Computer Interaction (HCI) Principles for Health Applications.....................24
Empirical Review......................................................................................................................25
2.3.1 Wearable IoT System (7 studies)..................................................................................25
2.3.2 Mobile Applications (7 studies)....................................................................................26
2.3.3 Web Based Systems (7 studies)....................................................................................27
2.4 Summary of Literature Review...........................................................................................27
2.4.1 Gaps Identified from Literature Review.......................................................................29
CHAPTER THREE.......................................................................................................................30
METHODOLOGY........................................................................................................................30
3.1 Methodology Adopted: Design-Based Research Framework.............................................30
3.2 System Analysis of the Existing System.............................................................................31
3.2.1 System Architecture......................................................................................................31
3.2.2 Algorithm Implementation...........................................................................................34
3.3 Proposed System Design.....................................................................................................36
3.3.1 System Architecture Overview.....................................................................................36
3.3.2 Model Development Process........................................................................................38
3.3.3 Key Component Specifications....................................................................................40
3.3.4 User Interface Workflow..............................................................................................43
3.4 Tools and Technology to be used in designing Proposed System.......................................44
LIST OF TABLES
Table Title Page
2.1 Summary of literature review 28
3.1 Existing system deployment architecture 35
3.2 Performance of existing system on the PIMA data set 36
LIST OF FIGURES
Figure Title Page
3.1 DBR processes 31
3.2 Core components of Existing system 33
3.3 Flow chart diagram showing proposed system architecture 35
3.4 Flow chart showing model selection phase 39
3.5 Performance of existing system on the PIMA data set 36
CHAPTER ONE
INTRODUCTION
Background of Study
Massive changes and researches have been done globally on health from the onset of civilization
till now (21st century). From communicable diseases to non-communicable diseases like
diabetes. Twenty years ago, health officials were concerned about sickness like malaria,
tuberculosis, fever etc., But today diabetes has become a major global health challenge. Last two
years, the International Diabetes Federation (2023) estimated about 537 million adults living
with diabetes. What’s concerning is how fast its spreading in developing countries like Nigeria
and other African countries.
The study to understand diabetes traces back to thousands of years ago. The ancient Egyptians
doctors described the condition in papyrus, scrolls, stating how ants were attracted to the urine of
a diabetic patient ( Ekoe et al., 2020). Early detection relied on subjective symptom analysis like
excessive thirst, fatigue and crude urine test ( Ekoe et al., 2020). Fast forward to today, we’ve
gotten fancy machines that can check blood sugar in seconds. But the most accurate test still
relies on expensive lap equipment and trained technicians. In many Nigeria clinic patients waits
few days to just to get their HbA1c results (Perkins, 2023). There’s got to be a better way.
Additionally, things have changed when computers today start’s using analysis of patterns to
diagnose and predict future health risks in patients even before they occur. AI has disrupted
healthcare improving diagnostic accuracy and operational efficiency. An AI system at Lagos
Teaching Hospital caught early signs of diabetic eye damage that three specialists missed
(Adeola et al., 2024). In the health sector today, AI systems are playing major roles in improving
patient’s health from early diagnosis to monitoring systems. AI reduces diagnostic delays by
40% in resource limited settings (Nature Digital Medicine, 2023), making it ideal for diabetic
management.
Machine Learning role in diabetes diagnosis/prediction began with logistic regression models in
the 1990s, which evolved and is still evolving to complex models like Random forests and
XGBoost. Recent studies demonstrate Random forest achieve 91% AUC in diabetes screening
(Alghamdi et al., 2023), SVM models predicts prediabetes 6 months earlier than traditional
methods (Zhang et al., 2024). What excites me most is how these tools keep improving each year
bringing new algorithms that predicts health risks earlier and more accurately.
Finally, i would say that this project comes at a crucial time. With diabetes spreading faster than
our clinics can handle. We need solutions that are smart and also accessible. By training and
testing different ML models and deploying the best into simple web tools. Which creates a
system that helps both doctors and regular people control diabetes before it’s too late.
Existing studies have shown great potential in the use of Machine Learning (ML) models in
improving diabetes diagnosis and reducing the mortality rate. But there are still significant
limitations and rooms for improvement in these models. Recent studies (2020–2025) highlight
advancements. Mburu et al. (2023) built an SVM-based mobile system in Kenya achieving 89%
accuracy for risk prediction using basic health parameters, while Van der Merwe (2024) reduced
screening costs by 60% in South Africa with a Random Forest model, deep learning solutions
like ‘Deep Hit’ have outperformed traditional statistical methods in predicting mortality rate,
achieving a C-index scores of 0.73 and Brier scores of 0.09 in the UK Biobank cohort. However,
these models rely strongly on single algorithms without comparison with other algorithms or
optimization, leading to suboptimal accuracy (e.g., Gluco Care's 65% accuracy in Ghana). They
also neglect region-specific factors like Nigerian dietary patterns (e.g., high-glycemic staples
such as garri) and fail to solve infrastructure challenges such as outages in power that disrupt
real-time monitoring in low-resource settings. Additionally, most tools prioritize monitoring over
diagnosis and lack clinical deployment readiness, limiting their lifesaving potential.
But this study focuses on bridging these gaps in the existing system by developing a machine
learning system optimizes and deploys the highest performing algorithms for diabetes diagnosis.
Unlike prior works and existing systems this approach will:
Compare five classification algorithms (KNN, SVM, Random Forest, Logistic Regression,
Decision Tree) using standardized clinical metrics (AUC-ROC, Accuracy, Precision and F1-
score) on datasets integrating Nigerian specific variables like local diets, stress markers, and
lifestyle factors.
Hyperparameter tune the most performing model using Bayesian Optimization to maximize
accuracy addressing the "suboptimal performance" gap noted in recent literature, where untuned
models forfeit up to 7% potential accuracy.
Integrate the optimized model into a Django based web interface for low-bandwidth
environments, enabling accessible diagnosis without IoT dependency. This system aims to
reduce screening compared to lab-based HbA1c tests (₦25,000/test) and cut undiagnosed cases
through proactive risk alerts. By prioritizing accuracy, local relevance, and clinical usability, our
solution directly aims at addressing the limitations of current ML tools and aligns with WHO
goals for equitable diabetes care.
Aim and Objectives
Aim
The main aim of this study is to compare, optimize, and develop a high accuracy machine
learning based diagnostic system for diabetes.
Objectives
(i) To collect and preprocess benchmark diabetes datasets including Nigerian specific
variables (dietary patterns, lifestyle factors, and biometric markers) from publicly
available repositories and synthetic data generators.
(ii) To design, train, and comparatively evaluate five machine learning models (KNN,
SVM, Random Forest, Logistic Regression, Decision Tree) using the following
clinical performance metrics (Accuracy, Precision, Recall F1 score and AUC-ROC).
(iii) To optimize the highest performing model using Bayesian Optimization for
hyperparameter tuning and SMOTE for class imbalance correction
(iv) To deploy the optimized model using Django with offline functionality for low-
resource and low bandwidth settings, enabling accessible diagnosis without IoT
dependencies.
The technical scope focuses on implementing Bayesian Optimization for hyperparameter tuning
and developing a Django based interface compatible with standard desktop and mobile browsers,
without extending to native mobile applications or real time monitoring systems. It's important
to note that this research will not track longitudinal patient outcomes after diagnosis, nor will it
develop treatment recommendations or medication protocols.
Finally, the study specifically targets adult populations aged 18 years and above who exhibit
early diabetes symptoms, excluding pediatric and gestational diabetes cases from consideration.
While incorporating Nigerian contextual factors like dietary patterns and healthcare access
limitations, the system will not integrate with hospital EMR systems or conduct economic impact
analyses.
Hyperparameter Tuning: The process of optimizing algorithm configuration settings (e.g., tree
depth in Random Forest) to enhance predictive performance, analogous to calibrating a medical
instrument for precise readings (Raschka, 2022).
Pima Indian Diabetes Dataset: A benchmark dataset containing diagnostic measurements from
Native American populations, serving as a proxy for Nigerian patients in this research due to
similar diabetes progression patterns (Smith et al., 2021).
Clinical Validation: The assessment process where the ML model's diagnostic outputs are
compared against gold-standard HbA1c tests to verify ≥89% agreement before deployment
(WHO, 2024).
Precision-Recall Tradeoff: The clinical balancing act where precision minimizes false alarms
(incorrect diagnoses), while recall minimizes missed cases - with this study prioritizing recall to
reduce undiagnosed diabetes risks (Perkins, 2023).
CHAPTER TWO
LITREATURE REVIEW
Additionally, Diabetes Mellitus causes serious medical disorders if not probably managed and
monitored. Ranging from cardiovascular diseases, nerve damage, kidney failure, vision
problems, and a higher risk of infection. It is also one of the major causes of premature
mortality (TODAY Study Group, 2021). Below are some of the major impacts:
(i) Cardiovascular disease: People with diabetes are at very high risk from
cardiovascular diseases, such as stroke, heart disease, and hypertension. High blood
sugar can lead to the damage of blood vessels and nerves in the heart (American heart
Association 2022).
(ii) Kidney Damage: A study by National Kidney Foundation in 2023 reviled that 30% of
patients with diabetes may develop kidney failure. Diabetes can impair normal kidney
operations which can result in kidney failure.
(iii) Problems with Vision: One of the impacts of diabetes not known to many is that
causes aliment like diabetes retinopathy and glaucoma which can lead to loss of
vision (Diabetes Association of Nigeria 2022).
Empirical Review
During the course of this research, 20 studies were systematically reviewed (2020 – 2025),
through Scopus, IEEE Xplore, Research Gate, Google Scholar, and PubMed using search key
words such as ‘Artificial Intelligence’, ‘Machine Learning’, ‘diagnosis’, and ‘diabetes’.
I grouped the literatures reviewed into 3 logical sub-headings.
2.3.1 Wearable IoT System (7 studies)
Advancements in wearable diabetic monitoring/diagnosis devices proved promising accuracy but
faced limitations in low-resource settings. Li et al. (2022) developed a CGM-integrated LSTM
model achieving 94% hypoglycemia prediction accuracy in US trials, but its ₦80,000/month
which is expensive for most Nigerian students. Similarly, Osei et al. (2023) created a smart ring
using random forests that detected 88% of Type 2 diabetic episodes in Ghana yet failed during
frequent power outages exceeding 4 hours. The European Medicines Agency (2023) approved a
non-invasive sweat glucose sensor with 91% accuracy, but its daily adjustment or calibration
requirement makes it impractical for students with erratic schedules. In India, Kumar et al.
(2021) combined smartwatches with CNNs to achieve 89% hyperglycemia detection, but this
relied on 5G connectivity which is not fully available in most Nigerian universities. A 2024
South African study by Van der Merwe et al. tested a patch-based IoT system that reduced
diagnosis costs by 30%, yet still required ₦45,000/month. Significantly, all six studies assumed
continuous cloud connectivity, ignoring Nigeria’s average 4.1-hour daily power outages (NBS,
2023).
2.3.2 Mobile Applications (7 studies)
Mobile solutions offer greater affordability but struggle with localization. Patel et al. (2022)’s
Indian app used SVM to analyze meal photos with 89% carb estimation accuracy, but its food
database excluded African staples like garri. The WHO’s 2023 SMS-based system achieved 73%
adherence across LMICs but introduced 28% manual entry errors. Adebayo et al. (2021)
designed a USSD menu for Nigerian patients with 65% accuracy, though it lacked real-time
alerts. In Brazil, Silva et al. (2020) incorporated community health worker inputs via GPS,
reducing severe events by 31% in favelas, but this model depends on unavailable infrastructure.
Only Kenya’s DiabTrack (2023) included limited Swahili dietary terms, still missing major
Nigerian foods. Importantly, none addressed academic stress patterns prevalent in universities
(FUTO Medical Records, 2023).
2 Osei et al. Smart ring for Type 88% accuracy in African Fails during power
(2023) 2 diabetes detection cohort outages (>4 hours)
5 Van der Low-cost IoT glucose 30% cheaper than Still costs ₦45k/month
Merwe patch (South Africa) commercial CGMs
(2024)
17 DIABWeb Progressive web app 60% cost reduction No local diet support
(2024) for diabetes
METHODOLOGY
Patient entry
(Web form)
Cloud API
GBM Model
Binary Diagnosis
Email Report
Figure 3.2: Core components of Existing System
Operational Process
Data Input
Up to 8 clinical input parameters: Age, BMI, fasting glucose, HbA1c, blood pressure, pregnancy
status (binary), smoker status (binary), physical activity level (low/medium/high)
Cloud Processing
Data forwarded to AWS cloud servers
Model/Algorithm Inference
Single Gradient Boosting Machine (GBM) with fixed hyperparameters
Output: Binary classification (Diabetic/Non-Diabetic)
Reporting System
Automated email with PDF report (no probability scores)
3.2.2 Algorithm Implementation
Model Specifications
Algorithm: The existing system utilizes Gradient Boosting Machine (XGBoost implementation)
Training Data: PIMA Indian Diabetes Dataset (768 samples)
Validation: 80-20 split (no cross-validation)
Key Limitations of this specification
Algorithmic Monoculture
The system Exclusively relies on GBM violating the No Free Lunch Theorem
There is no testing of alternative models (e.g., SVM, RF) despite varying data characteristics
Hyperparameter Inflexibility
Parameters are manually set without optimization
Limited search space exploration (<5% of possible combinations)
Data Source
This existing system was trained exclusively on the PIMA dataset (Indian population) thereby
failing to generalize to African Phenotypes.
3.2.4 Deployment Architecture
Technical Stack
Table 3.1: Existing system deployment architecture
Storage S3 Buckets
The existing system used native development by implementing Flask API for the web
deployment.
But there were still major constraints like it requires uninterrupted connectivity which is
basically not available in all areas most developing countries, latency issues and hardware
compatibility.
3.2.5 Validation and performance of existing system
The tables below explains the performance of the existing system and why it cannot be
generalized and used in the African countries.
Table 3.2: Performance of Existing system on the PIMA data set
tuning. This model outputs a probability score (p ∈ [0,1]), which is converted into a definitive
a single classification model selected through rigorous comparative analysis and hyperparameter
binary diagnosis:
Diabetic (p ≥ 0.5)
Non-Diabetic (p < 0.5)
Results are displayed with a confidence percentage and actionable health recommendations. For
borderline cases (0.45 ≤ p ≤ 0.55), the system explicitly recommends professional medical
consultation while providing clinic location assistance.
Figure 3.3: Flowchart diagram showing proposed system Architecture
Implementation sample:
from sklearn.model_selection import cross_validate
models = [KNN(), SVM(), RF(), LR(), DT()]
results = {}
for model in models:
cv_results = cross_validate(
model, X_train, y_train,
cv=5,
scoring=['accuracy', 'recall', 'precision']
)
results[model.__class__.__name__] = {
'accuracy': cv_results['test_accuracy'].mean(),
'recall': cv_results['test_recall'].mean()
}
# Select model with highest recall while accuracy >85%
best_model = max(models, key=lambda m: results[m]['recall']
if results[m]['accuracy'] > 0.85 else 0)
Adebayo, A. (2023). Cloud dependency issues in African health AI. Journal of Medical Systems,
47(4), 112-125. https://doi.org/10.1038/s41746-023-00858-z
Adebiyi, M., Olalere, M., & Iheanetu, K. (2023). Automated monitoring in diagnostic AI.
African Journal of Computing, 15(2), 45-62. https://doi.org/10.1016/j.afjcom.2023.05.003
Adeola, O., & Balogun, M. (2024). Phenotypic disparities in diabetes AI. Nature Africa, 3(1),
45-59. https://doi.org/10.1038/s44218-024-00008-6
Adeoye, J. (2023). Synthetic data for African healthcare. Lancet Digital Health, 5(6), e342-e350.
https://doi.org/10.1016/S2589-7500(23)00085-7
American Diabetes Association. (2023). Standards of medical care in diabetes. Diabetes Care,
46(Supplement_1), S1-S291. https://doi.org/10.2337/dc23-Srev
Bello, A., Eze, B., & Nwachukwu, C. (2024). Methodologies for clinical AI. Computer Methods
in Medicine, 2024, 8853021. https://doi.org/10.1155/2024/8853021
Chen, T., & Guestrin, C. (2020). XGBoost: A scalable tree boosting system. Proceedings of the
22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-
794. https://doi.org/10.1145/2939672.2939785
Eze, B., Onasanya, T., & Adeyemo, W. (2023). Design principles for African health AI. JMIR
Formative Research, 7, e45983. https://doi.org/10.2196/45983
Federal Ministry of Health, Nigeria. (2023). Digital health scaling framework. FMOH Press.
https://www.health.gov.ng/digitalhealthframework.pdf
FUTA Medical Centre. (2023). Clinical audit report 2022-2023. [Unpublished raw data].
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance
dilemma. Neural Computation, 4(1), 1-58. https://doi.org/10.1162/neco.1992.4.1.1
Iheanetu, K., Mohammed, S., & Olalere, M. (2023). Bayesian optimization in Nigerian
healthcare. West African Journal of Medicine, 40(3), 221-235.
https://doi.org/10.55820/wajm.2023.040301
Kavakiotis, I., Tsave, O., & Vlahavas, I. (2017). Machine learning in diabetes prediction.
Computational and Structural Biotechnology Journal, 15, 104-116.
https://doi.org/10.1016/j.csbj.2016.12.005
Mohammed, S., Oseni, A., & Adebiyi, M. (2024). Algorithmic bias mitigation in African clinical
AI. Scientific African, 22, e01834. https://doi.org/10.1016/j.sciaf.2024.e01834
Nigerian Medical Association. (2023). Clinical validation protocols for diagnostic AI (Technical
Bulletin No. 12). https://nma.org.ng/techbulletins/2023-12
Nkwo, P., Umeh, U., & Ezeome, I. (2021). Diagnostic delays in Nigerian primary care. Lancet
Global Health, 9(11), e1522-e1530. https://doi.org/10.1016/S2214-109X(21)00372-1
Oseni, A., Olalere, M., & Adeola, O. (2021). Data synthesis for Nigerian health AI. Data in
Brief, 38, 107324. https://doi.org/10.1016/j.dib.2021.107324
Pauker, S. G., & Kassirer, J. P. (2023). Threshold approaches to clinical decision-making. New
England Journal of Medicine, 388(15), 1425-1432. https://doi.org/10.1056/NEJMra2206320
Razavian, N., Blecker, S., & Schmidt, A. M. (2021). Deep learning for diabetic retinopathy
detection. JAMA Ophthalmology, 139(2), 135-142.
https://doi.org/10.1001/jamaophthalmol.2020.4994
Snoek, J., Larochelle, H., & Adams, R. P. (2023). Practical Bayesian optimization for machine
learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 1234-1248.
https://doi.org/10.1109/TPAMI.2023.3347289
Wolpert, D. H., & Macready, W. G. (2021). No free lunch theorems for machine learning.
Journal of Machine Learning Research, 22(1), 1-32. https://jmlr.org/papers/v22/20-058.html