Major Project
Major Project
DR.V.SRINIVAS RAO
Professor,Dean
Department of Computer Science and Engineering
JUNE 2025
i
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
BHARATINSTITUTE OFENGINEERINGAND TECHNOLOGY
Accredited by NAAC, Accredited by NBA (UG Programmes: CSE, ECE, EEE & Mechanical)
Approved by AICTE, Affiliated to JNTUH Hyderabad
Ibrahimpatnam -501 510, Hyderabad, Telangana
Certificate
This is to certify that the Project work (Phase-2) entitled “DIABETES
PREDICTION USING MACHINE LEARNING TECHNIQUES” is the bonafide
work done
By
ii
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of the task would be put
incomplete without the mention of the people who made it possible, whose constant guidance
and encouragement crown all the efforts with success.
We avail this opportunity to express our deep sense of gratitude and hearty thanks to
Sri CH. Venugopal Reddy, Chairman & Secretary of BIET, for providing congenial
atmosphere and encouragement.
We would like to thank Prof. G. Kumaraswamy Rao, Former Director & O.S. of DLRL
Ministry of Defence, Sr. Director R&D, BIET, and Dr. V Srinivasa Rao, Dean CSE, for having
provided all the facilities and support.
We would like to thank our Department Incharge Dr. Deepak, for encouragement at
various levels of our Project.
We are thankful to our Project Coordinator Dr. Rama Prakasha Reddy.Ch, Assistant
Professor, Computer Science and Engineering for her support and cooperation throughout the
process of this project.
We are thankful to our guide Dr. V Srinivas Rao Professor & Dean, Computer Science
and Engineering for his sustained inspiring Guidance and cooperation throughout the process of
this project. His wise counsel and suggestions were invaluable.
We express our deep sense of gratitude and thanks to all the Teaching and Non-Teaching
Staff of our college who stood with us during the project and helped us to make it a successful
venture.
We place highest regards to our Parent, our Friends and Well-wishers who helped a lot in
making the report of this project
E.Akhila 21E11A0512
G.Vinitha 21E11A0516
B.Suprathika 21E11A0505
M.Sravani 21E11A0523
iii
Declaration
We hereby declare that this Project Work (phase-2) is titled Diabetes
1.
2.
3.
4.
iv
ABSTRACT
v
TABLE OF CONTENTS
Contents
Chapter no. Title Page no
2. Related Work………………………………………………………………………………………………… 7
3. Motivation……………………………………………………………………………………………………… 8
4. Objectives……………………………………………………………………………………………………… 9
5. Problem Statement………………………………………………………………………………………… 11
6. Design Methodology……………………………………………………………………………………… 12
6.1 System Architecture…………………………………………………………………………………… 12
6.2 System Modules…………………………………………………………………………………………… 15
6.3 Requirement Specification…………………………………………………………………………… 19
6.4 UML Diagrams…………………………………………………………………………………………… 20
7. Experimental Studies………………………………………………………………………………………….. 26
6.5 Test Cases…………………………………………………………………………………………………………
29
6.6 Result Analysis…………………………………………………………………………………………………… 30
vii
LIST OF TABLES
Table No. Caption Page
No.
ix
2024-2025
Symbol Description
ML Machine Learning
SVM Support Vector Machine
KNN K-Nearest Neighbors
DT Decision Tree
ROC Receiver Operating Characteristic
AUC Area Under Curve
GUI Graphical User Interface
CSV Comma Separated Values
BMI Body Mass Index
PIMA Pima Indian Diabetes Dataset
PCA Principal Component Analysis
EDA Exploratory Data Analysis
x
2024-2025
1. INTRODUCTION
The diabetes prediction system consists of several key components that contribute to the
overall functionality. Diabetes mellitus is a chronic metabolic disorder that has become a
significant global health concern due to its rapidly increasing prevalence. It is primarily
characterized by high blood sugar levels resulting from the body's inability to produce or
effectively use insulin. If left undiagnosed or poorly managed, diabetes can lead to severe
health complications such as heart disease, kidney failure, nerve damage, and vision loss.
Early diagnosis and timely treatment are crucial in reducing the risk of such complications
and improving patient outcomes. Traditional diagnostic methods often involve clinical tests
and expert medical evaluation, which can be time-consuming, costly, and sometimes
inaccessible, especially in rural or underdeveloped areas. With the exponential growth of
healthcare data and advances in computational technologies, machine learning (ML) has
emerged as a promising approach for developing intelligent systems capable of predicting
diseases based on historical data.
Machine learning techniques can identify complex patterns and relationships within large
datasets and use them to make accurate predictions, making them highly effective for medical
diagnosis tasks. Each component This project focuses on building a diabetes prediction
system using various machine learning algorithms such as Support Vector Machine (SVM),
K-Nearest Neighbors (KNN) and Decision Tree. The system is developed using the Pima
Indians Diabetes Dataset, which contains medical records of female patients along with
features like glucose level, blood pressure, insulin level, body mass index (BMI), and age.
The dataset undergoes preprocessing steps including handling missing values, normalization,
and feature selection before being used to train the models. Each model is evaluated using
performance metrics like accuracy, precision, recall, F1-score, and the Area Under the ROC
Curve (AUC) to determine its effectiveness. The goal of this system is to provide an accurate,
efficient,and accessible tool that can aid healthcare professionals and individuals in the early
detection of diabetes, potentially reducing the burden of the disease on both patients and
health care systems.
1
2024-2025
1.2 Components
The diabetes prediction system consists of several key components that contribute to the
overall functionality. Each component plays a crucial role in the data preprocessing, model
training, evaluation, and prediction process. Below is an overview of the primary components:
1. Data Collection
Dataset: The Pima Indians Diabetes Dataset serves as the primary data source. This dataset
contains 768 instances, each with 8 input features and one binary target variable (indicating
the presence or absence of diabetes).
Attributes: The dataset includes the following features:
Diabetes Pedigree Function: A function which scores the likelihood of diabetes based on
family history.
Age: Agein years.
Target: The target variable indicates whether the individual has diabetes (1) or not (0).
2. Data Preprocessing
Data Cleaning: The dataset includes some missing or zero values that were handled using
imputation techniques.
o For example, missing glucose levels were imputed using the median value.
o Zero values in features likeglucose and insulin were treated as missing and imputed.
Normalization: Feature scaling was applied using Min-Max scaling to standardizethe input.
2
2024-2025
Train-Test Split: The dataset was split into 80% training data and 20% testing data, using
stratified sampling to preserve the proportion of diabetic and non-diabetic cases in both
subsets.
Decision Tree: A tree-based model that makes decisions based on feature values, resulting in
a flowchart-like structure.
Support Vector Machine (SVM): A classification algorithm that finds the hyperplane that
best separates the classes.
K-Nearest Neighbors (KNN): A non-parametric algorithm that classifies data based on the
closest points to the test instance.
4. Model Evaluation
Accuracy: The proportion of correct predictions out ofthe total number of predictions.
Precision: The proportion of true positive predictions among all positive predictions made by
the model.
Recall (Sensitivity): The proportion of true positive predictions among all actual positive
instances.
F1-Score: The harmonic mean of precision and recall, providing a balance between the two
metrics.
ROC-AUC: The Area Under the Receiver Operating Characteristic Curve, which represents
the ability of the model to discriminate between the two classes.
3
2024-2025
Model Inference: Once the model is trained and evaluated, it can be used to predict whether
a new individual is diabetic or not based on their input features.
Web/Software Interface: A simple interface can be created to allow healthcare providers or
individuals to input their data and receive predictions regarding their diabetes status.
Integration: The trained model can be integrated into healthcare applications or deployed on
the cloud for scalable access.
4
2024-2025
5
2024-2025
DECISION TREE(DT):
A Decision Tree is a supervised learning algorithm that is especially useful for classification
tasks such as predicting diabetes. It models decisions and their possible consequences in the
form of a tree-like structure. Starting from the root, the algorithm splits the dataset into
branches based on feature values that result in the best separation of classes. Each internal
node represents a test on an attribute (e.g., "Is glucose level > 120?"), and each leaf node
represents the final decision (diabetic or non-diabetic). Decision Trees are easy to interpret
and visualize, making them suitable for medical applications where explainability is
important.
6
2024-2025
2. RELATED WORK
Over the years, numerous studies have explored the use of machine learning techniques for
diabetes prediction, leveraging clinical datasets such as the Pima Indians Diabetes Dataset.
Researchers have applied a variety of algorithms including logistic regression, decision trees,
support vector machines (SVM), k-nearest neighbors (KNN), and ensemble methods like to
classify individuals as diabetic or non-diabetic. In many studies, random forest and SVM
have demonstrated high accuracy and robustness, particularly when combined with proper
data preprocessing and feature selection. Some researchers have also experimented with deep
learning models such as artificial neural networks (ANN) to capture complex relationships
within the data. Additionally, optimization techniques like grid search and cross-validation
have been widely used to fine-tune model parameters and improve performance. Recent
work has also emphasized the importance of addressing imbalanced datasets and missing
values to enhance the reliability of predictions. Overall, the literature indicates that machine
learning offers promising results for diabetes prediction and continues to evolve with the
integration of hybrid models, real-time prediction systems, and explainable AI to support
clinical decision-making.
7
2024-2025
3. MOTIVATION
Diabetes mellitus has emerged as one of the most pressing global health concerns, affecting
millions of people and leading to severe health complications if not diagnosed and managed
in a timely manner. The chronic nature of the disease, coupled with its often as
symptomatic early stages, makes early detection critical for effective intervention and long-
term management. Traditional diagnostic approaches, such as blood tests and manual
analysis by healthcare professionals, can be time-consuming, expensive, and prone to
human error, especially in areas with limited access to medical expertise and infrastructure.
This scenario underscores the need for intelligent, automated systems that can assist in
predicting the likelihood of diabetes at an early stage using readily available medical data.
The rapid growth of healthcare data, advancements in computational power, and the
evolution of machine learning technologies present a significant opportunity to transform
diabetes diagnosis through predictive analytics. Machine learning algorithms are capable of
identifying complex, non-linear relationships in medical datasets, making them highly
effective for classifying patients based on risk factors. By training predictive models on
historical patient data, we can develop systems that provide fast, cost-effective, and
accurate predictions, thus enabling timely clinical decisions and reducing the burden on
healthcare systems. Additionally, such predictive models can be integrated into mobile
health applications and telemedicine platforms to reach underserved populations.
The motivation for this project stems from the potential impact of machine learning in
bridging the gap between early diagnosis and disease prevention. Implementing an
intelligent system for diabetes prediction can not only enhance patient outcomes but also
contribute to public health strategies aimed at controlling the global rise of diabetes through
data-driven solutions. application is developed to help people who are unintentionally
nonadherent. To avoid the possibility of suffering from unintentional nonadherence, a
mobile application is developed in such a way that it reminds people to take medicine of
correct dosage in time as smart phones has become very common now-a-days.
8
2024-2025
4. OBJECTIVES
The project aims to preprocess the dataset through techniques such as handling missing
values, normalization, and feature selection to ensure high-quality input for model
training. Additionally, the project seeks to analyze the importance of different health
parameters in predicting diabetes and compare model performance using evaluation
metrics such as accuracy, precision, recall, F1-score, and ROC-AUC score. Another key
goal is to create a reliable, interpretable, and cost-effective system that can aid healthcare
professionals in making informed decisions, thereby improving early diagnosis and
patient outcomes.
FEASIBILITY STUDY:
Before initiating the development of the diabetes prediction system using machine learning,
a comprehensive feasibility study was conducted to assess the project's practicality and
ensure its successful implementation. The study evaluated three main aspects: technical,
operational, and economic feasibility.
TECHNICAL FEASIBILITY
OPERATIONAL FEASIBILITY
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY:
Focuses on whether the required technology, tools, and resources are available and suitable
for building the system. In this case, the project is highly feasible from a technical
standpoint. The necessary tools, such as Python, Scikit-learn, Pandas, and Notebook, are
open-source and well-documented. The machine learning models being implemented— such
as Logistic Regression, Decision Tree, and SVM—are well-established and supported by
extensive libraries. Additionally, the Pima Indians Diabetes Dataset is publicly available
9
2024-2025
and suitable for the classification task. Since the system does not require any advanced
hardware or complex infrastructure, development and testing can be done on standard
computing systems. determine whether the proposed system is technically feasible, we
should take into consideration the technical issues involved behind the system. Maintenance
of Elementary School Data uses the web technologies, which is rampantly employed these
days worldwide. The world without the web is incomprehensible today.
OPERATIONAL FEASIBILITY:
Operational feasibility assesses whether the proposed system can be used effectively in a
real-world setting. The system is designed to be user-friendly and can easily be integrated
into healthcare environments. Once developed, it can be used by healthcare professionals
with minimal training to input patient data and receive predictions. The system’s ability to
provide quick and accurate results makes it operationally efficient and useful in medical
decision-making, especially in primary care and remote healthcare settings.
ECONOMIC FEASIBILITY
Economic feasibility examines the cost-effectiveness of the system. The project is highly
economical as it uses free tools and datasets, reducing development and deployment costs.
Moreover, by enabling early detection of diabetes, the system can help reduce long-term
healthcare costs for patients and medical institutions. The potential benefits, including early
intervention, reduced hospitalization, and improved health outcomes, make the system a
valuable and cost-effective solution for both public and private healthcare providers.
In conclusion, the feasibility study confirms that the project is viable and practical from
technical, operational, and economic perspectives. It supports the development of a reliable,
efficient, and affordable diabetes prediction system that can have a significant impact on
health care delivery.
10
5. PROBLEM STATEMENT
The proposed system aims to utilize machine learning techniques to build an intelligent and
automated solution for predicting the risk of diabetes based on patient health data. Unlike
traditional diagnostic methods that require manual interpretation and lab testing, this system
leverages historical data and trained models to provide fast and accurate predictions. The
system takes key input features such as glucose levels, BMI, age, insulin, blood pressure, and
others, and processes them through various machine learning algorithms like Decision Tree,
KNN (k-nearest neighbors) and Support Vector Machine (SVM). These models are trained on
well-known datasets, such as the Pima Indians Diabetes Dataset, to learn patterns associated
with diabetic and non-diabetic patients. The final model is selected based on performance
metrics such as accuracy, precision, recall, and F1-score. The system also includes steps to
handle missing data, normalize values, and select important features for optimal prediction.
11
2024-2025
6. DESIGNMETHODOLOGY
The system architecture for the diabetes prediction model is structured into five key layers:
data acquisition, preprocessing, model training, prediction, and user interface. First,
healthcare data (e.g., from the Pima Indians Diabetes Dataset) is collected. The
preprocessing layer cleans and normalizes the data, handling missing values and selecting
important features. The processed data is then used in the model training layer, where
various machine learning algorithms like Decision tree and k-nearest neighbors, Super
visied machine learning are applied. Once trained, the model moves to the prediction layer
to evaluate new patient data. Finally, a user-friendly interface allows users to input health
metrics and receive instant diabetes predictions. This architecture ensures accurate,
scalable, and real-time predictions. architecture consists of all the modules of the main
module (Utility Shack.exe) are shown below.
12
2024-2025
1. Start: The process begins when the system is initiated. This step marks the start of
the diabetes prediction process.
2. Input Patient Data: In this step, the system collects relevant health data from the user.
This typically includes key attributes such as blood glucose levels, BMI (Body Mass Index),
age, insulin, blood pressure, and other health metrics.
3. Data Preprocessing: Once the data is received, it undergoes preprocessing to
ensure that it is clean and ready for the model. This step involves handling any missing
values, normalizing or scaling the data to a standard range, and selecting the most
important features that contribute to predicting diabetes. Proper preprocessing ensures
the model performs effectively.
4. Load Trained ML Model: The preprocess data is then passed to a trained machine
learning model, such as K-Nearest Neighbors, Support Vector Machine (SVM), or
Decision tree, that has been trained on historical patient data. This step is where the
system uses the previously trained model to make predictions based on the new input
data.
5. Predict Diabetes Status: The trained model processes the input data and generates
Prediction whether the patient is likely to have diabetes or not. The model uses the
learned During training to classify the new data.
6. Display Result: The prediction is displayed to the user, informing them of the
likelihood of being diabetic. This result can help healthcare providers in making
informed decisions about further tests or treatment.
7. End: The process concludes after displaying the prediction.
13
2024-2025
The technical architecture of the diabetes prediction system includes data collection,
preprocessing, model training (using KNN, SVM, and Decision Tree), and evaluation. The best
model is deployed using Flask, enabling real-time predictions via a web interface. A backend
server processes user inputs, runs predictions, and returns results, ensuring accurate and efficient
diagnosis support.
14
2024-2025
15
2024-2025
knn_pred = knn_model.predict(X_test_scaled)
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train_scaled, y_train)
dt_pred = dt_model.predict(X_test_scaled)
svm_acc = accuracy_score(y_test, svm_pred) * 100
knn_acc = accuracy_score(y_test, knn_pred) * 100
dt acc = accuracy_score (y_test, dt_pred) * 100
print (f"SVM Accuracy: {svm_acc:.2f}%")
print (f"KNN Accuracy: {knn_acc:.2f}%")
print(f"Decision Tree Accuracy: {dt_acc:.2f}%")
print("\n--- Classification Report: SVM ---")
print(classification_report(y_test, svm_pred))
print("\n--- Classification Report: KNN ---")
print(classification_report(y_test, knn_pred))
print("\n--- Classification Report: Decision Tree ---")
print(classification_report(y_test, dt_pred))
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 4))
plt.subplot(1, 3, 1)
sns.heatmap(confusion_matrix(y_test, svm_pred), annot=True, fmt="d", cmap="Blues")
plt.title("SVM Confusion Matrix")
plt.subplot(1, 3, 2)
sns.heatmap(confusion_matrix(y_test, knn_pred), annot=True, fmt="d", cmap="Greens")
plt.title("KNN Confusion Matrix")
plt.subplot(1, 3, 3)
sns.heatmap(confusion_matrix(y_test, dt_pred), annot=True, fmt="d", cmap="Oranges")
plt.title("Decision Tree Confusion Matrix")
plt.tight_layout()
plt.show()
# Accuracy Bar Graph
models = ["SVM", "KNN", "Decision Tree"]
accuracy = [svm_acc, knn_acc, dt_acc]
16
2024-2025
plt.figure(figsize=(8, 5))
plt.bar(models, accuracy, color=["blue", "green", "orange"])
plt.xlabel("ML Algorithms")
plt.ylabel("Accuracy (%)")
plt.title("Comparison of ML Models for Diabetes Prediction")
plt.ylim(0, 100)
plt.show()
import matplotlib.pyplot as plt
# Accuracy values in percentage
knn_accuracy = 82.5
svm_accuracy = 85.0
dt_accuracy = 80.2
models = ['KNN', 'SVM', 'Decision Tree']
accuracies = [knn_accuracy, svm_accuracy, dt_accuracy]
colors = ['skyblue', 'lightgreen', 'salmon']
plt.figure(figsize=(10, 6))
bars = plt.bar(models, accuracies, color=colors)
# Add accuracy values on top of bars
for bar, acc in zip(bars, accuracies):
plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() - 2,
f'{acc:.1f}%', ha='center', va='bottom', fontsize=12
# Graph details
plt.title('Accuracy Comparison of ML Algorithms for Diabetes Prediction', fontsize=14)
plt.xlabel('ML Algorithms', fontsize=12)
plt.ylabel('Accuracy (%)', fontsize=12)
plt.ylim(0, 100)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
17
2024-2025
6.2 MODULES:
Backend(app.py):
# app.py
from flask import Flask, render_template, request, redirect, url_for, session
import joblib
import numpy as np
app = Flask( name )
app.secret_key = 'diabetes_secret'
# Load models
knn = joblib.load("models/knn_model.pkl")
svm = joblib.load("models/svm_model.pkl")
dt = joblib.load("models/dt_model.pkl")
scaler = joblib.load("models/scaler.pkl")
# Dummy login credentials
users = {'admin': 'admin123'}
@app.route('/')
def home():
return render_template("home.html")
@app.route('/login', methods=['GET', 'POST'])
def login():
if request.method == 'POST':
uname = request.form['username']
pwd = request.form['password']
if uname in users and users[uname] == pwd:
session['user'] = uname
return redirect(url_for('predict'))
else:
return render_template('login.html', error="Invalid Credentials")
return render_template('login.html')
@app.route('/predict', methods=['GET', 'POST'])
def predict():
if 'user' not in session:
18
2024-2025
return redirect(url_for('login'))
if request.method == 'POST':
input_features = [float(x) for x in request.form.values() if x != request.form['model']]
input_scaled = scaler.transform([input_features])
model_type = request.form['model']
if model_type == 'KNN':
pred = knn.predict(input_scaled)[0]
elif model_type == 'SVM':
pred = svm.predict(input_scaled)[0]
elif model_type == 'DT':
pred = dt.predict(input_scaled)[0]
else:
pred = None
return render_template('result.html', prediction=pred)
return render_template('predict.html')
@app.route('/logout')
def logout():
session.pop('user', None)
return redirect(url_for('home'))
if name == ' main ':
app.run(debug=True)
19
2024-2025
The Home Page of the Diabetes Prediction Website serves as the welcoming interface and
the central entry point for users. It introduces the purpose of the application — providing a
convenient and accessible platform for predicting diabetes risk using machine learning
models. The design is clean and user-friendly, typically featuring a welcoming message and
a prominent call-to-action button (such as "Login to Predict") that directs users to the login
page.
20
2024-2025
The Login Page of the Diabetes Prediction Website is a secure entry point designed to
authenticate users before granting access to prediction and analysis features. It includes
fields for entering a username and password, ensuring that only authorized users can use
the system's functionalities. This adds a layer of privacy and security, especially when
dealing with sensitive health-related information. The page is designed to be simple and
intuitive, with clear labels and a clean layout. Upon submitting valid credentials, users
are redirected to the prediction page.
21
2024-2025
The Prediction Page is the core functionality of the Diabetes Prediction Website,
allowing users to input specific patient health data and receive a diabetes risk prediction.
The form on this page includes important medical attributes such as Pregnancies, Glucose
level, Blood Pressure, Skin Thickness, Insulin level, BMI (Body Mass Index), Diabetes
Pedigree Function, and Age. These are the features used by machine learning models to
assess whether a person is likely to have diabetes Additionally, the user can select a
preferred prediction algorithm — KNN (K-Nearest Neighbours), SVM (Support Vector
Machine), or Decision Tree — from a dropdown menu. After entering all the details and
selecting the model, the user submits the form. The system then scales the input data
using a trained scaler and runs the prediction using the chosen model.
22
2024-2025
</form>
<a href="/logout">Logout</a>
</div>
</body>
</html>
The Result Page displays the outcome of the diabetes prediction based on the data
provided by the user. After processing the input through the selected machine learning
model, the result is shown clearly—indicating whether the patient is likely or unlikely to
have diabetes. The result is presented in a visually distinct way using colors (e.g., red for
positive, green for negative), making it easy to understand. This page also includes an
option to go back and make another prediction, ensuring smooth navigation and
usability. It plays a crucial role in giving immediate and clear feedback to the user.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Prediction Result</title>
<style>
</head>
<body>
<div class="result-box">
<h2>Prediction Result</h2>
{% if prediction == 1 %}
<p class="positive"> · ı. The patient is likely to have diabetes.</p>
{% else %}
<p class="neg a t i v e " > ■
T he patient is unlikely to have diabetes.</p>
{% endif %}
<a href="/predict" class="btn">Predict Again</a>
</div>
</body>
</html>
23
2024-2025
24
2024-2025
A use case diagram visually represents the interaction between users (or external systems)
and the various functionalities of your proposed diabetes prediction system. It typically
consists of actors (users like patients, doctors, and administrators) and use cases (specific
actions they perform).
In this system, key actors may include:
Patients: They input medical data and receive predictive results.
Doctors: They access detailed reports and provide medical insights.
Admin: Manages data and system configurations.
The diagram highlights interactions such as data entry, model
prediction, report generation, and feedback submission. These use cases are connected via
associations to relevant actors, showing how different components of the system function
together.
25
2024-2025
The class diagram for diabetes prediction using KNN, Decision Tree, and SVM represents
the key components and their relationships in the system. It includes Patient Data, which
holds attributes like glucose levels, blood pressure, BMI, and insulin levels. The
Preprocessing Module handles data cleaning and normalization before feeding it into the
Machine Learning Models. The system consists of three classifiers—KNN Classifier,
Decision Tree Classifier, and SVM Classifier each implementing respective prediction logic.
These classifiers interact with the Evaluation Module. The Prediction Interface serves as the
user-facing component, allowing input of patient data and returning diabetes risk predictions.
Finally, the Flask-based Web Application integrates all components, ensuring smooth real-
time interactions between users and the trained models.
26
2024-2025
27
2024-2025
The activity diagram for Diabetes Prediction Using Machine Learning Techniques outlines
the step-by-step flow of the system. It begins with the user inputting patient data, followed
by data preprocessing steps like cleaning and splitting the dataset. Next, machine learning
models are trained and evaluated based on performance metrics. The best-performing
model is then used to make predictions on new input data. Finally, the system displays
whether the patient is diabetic or non-diabetic, completing the process. This diagram
visually represents the logical and sequential flow of the prediction system.
For diabetes prediction using KNN, SVM, and Decision Tree outlines the structural
components of the system and their relationships. The Patient Data class stores attributes
such as glucose levels, blood pressure, BMI, and insulin readings, which are essential for
prediction. The Preprocessing Module ensures the data is cleaned, normalized, and
prepared before feeding it into the classifiers. The core of the system consists of three
Machine Learning Model Classes—KNN Classifier, SVM Classifier, and Decision Tree
Classifier—each responsible for applying its respective algorithm to determine diabetes
risk based on patient input.
28
2024-2025
The sequence diagram for Diabetes Prediction Using Machine Learning Techniques
illustrates the interaction between different components of the system—namely the User,
System, and Machine Learning Model. The process begins with the user providing input
data, such as medical parameters. This data is received by the system, which then processes
and formats it appropriately before passing it to the trained machine learning model. The
model analyzes the input and returns a prediction indicating whether the individual is
diabetic or not. Finally, the system communicates this result back to the user. This
sequence highlights how each component works together in real-time to deliver a
predictive diagnosis efficiently.
29
2024-2025
The state chart diagram for Diabetes Prediction Using Machine Learning Techniques
represents the different states the system transitions through during its operation. It begins
with the initial state where the system is idle or waiting for input. Once the user enters
patient data, the system transitions to the data preprocessing state, where it cleans and
prepares the data. From there, it moves to the model training or loading state, depending
on whether a new model is being trained or a pre-trained model is being used. After the
model is ready, the system enters the prediction state, where the input data is analyzed to
generate a result. Finally, it reaches the output state, where the prediction—whether the
patient is diabetic or non-diabetic—is displayed to the user. The process ends with the
system returning to the idle state, ready for new input. This diagram effectively shows
how the system's internal states change throughout the prediction process
30
2024-2025
7. Experimental Studies:
7.1 Testing Process:
Unit Testing:
Unit testing involves validating individual components of the system in isolation to ensure
they perform as expected. In our project, unit testing was applied in the model training
phase using Google Colab. We individually tested each machine learning algorithm—
KNN, SVM, Decision Tree, to confirm that they function correctly, can fit the training
data, and generate predictions on unseen data. This early testing helped isolate issues like
incorrect preprocessing or model configuration before moving on to integration testing
involves testing individual components or functions of the system in isolation.
Accuracy Testing:
Accuracy testing is crucial for evaluating how well our machine learning models perform
in predicting diabetes. In Colab, we split the dataset into training and testing sets, then
calculated accuracy, precision, recall, and F1-score for each model. This allowed us to
compare the performance of models like KNN, SVM, Decision Tree, and CNN. The
accuracy scores were later visualized on the website through a Model Accuracy
Comparison page, helping users understand which algorithm performs best.
Integration Testing:
Integration testing ensures that various components of the application work together as a
complete system. In the web application, we tested whether the model inputs collected via
forms were correctly passed to the prediction functions, whether the prediction results
were processed properly, and whether navigation between pages (like from login to
predict) worked seamlessly. It helped ensure the Flask routes, templates, and backend logic
functioned as an integrated unit.
31
2024-2025
Functional Testing:
Functional testing verifies that each part of the application performs its intended function.
In this project, we conducted functional tests on the login system, prediction submission
form, model selection dropdown, and logout functionality. We ensured that all inputs are
accepted correctly, proper predictions are returned, and each button and link routes users
to the right page. This testing guaranteed that the application behaves as expected from a
user's perspective.
Usability Testing:
Usability testing evaluates how easy and intuitive the application is for end-users. We
assessed the layout, color scheme, and navigation flow of the website to ensure it is
visually appealing and simple to use. Feedback was considered to improve elements like
form structure, label clarity, and button positions. The consistent design and responsive UI,
tested on both desktops and mobile devices, made the application accessible and user-
friendly.
Manual Testing
Manual testing involves using the application like a regular user to find bugs or usability
issues that automated tests might miss. In this project, we manually tested all features
including login with correct and incorrect credentials, entering valid and invalid patient
data, and navigating through all pages like Home, Predict, Results, Graphs, Suggestions,
and Reports. It helped identify issues in real-time and validated that all functionalities are
working properly before deployment
32
2024-2025
Step 1: The user accesses the diabetes prediction website through a browser
Step 2: The user logs in with their credentials or registers for a new account if they are a new
user.
Step 3: After successful login, the user clicks on the “Diabetes Prediction” section/page
from the navigation menu.
Step4: Input Health Details: Glucose level, Blood Pressure, Insulin, BMI, Age, Skin Thickness,
Pregnancies etc.
Step 5: The user selects the desired machine learning model (KNN, SVM, or Decision Tree)
from a dropdown or radio button option.
Step 6: The user clicks the “Predict” or “Submit” button to send the data to the backend server
Step 8: The result is displayed on the screen indicating whether the user is Diabetic or
Non-Diabetic
33
2024-2025
34
2024-2025
The bar graph illustrates the comparison of accuracy percentages among three machine
learning models—SVM (Support Vector Machine), KNN (K-Nearest Neighbors), and
Decision Tree—used for diabetes prediction. The y-axis represents the accuracy in
percentage, while the x-axis lists the machine learning algorithms. From the graph, it is
evident that the SVM model outperforms the others with the highest accuracy of around
76%, followed by the Decision Tree with an accuracy slightly lower, and KNN showing
the lowest performance, approximately 70%. This graphical analysis helps in selecting the
most reliable model for deployment based on predictive performance, where SVM stands
out as the most effective algorithm among the three tested.
35
2024-2025
The given image displays the confusion matrices for three machine learning models—
SVM, KNN, and Decision Tree—used in the diabetes prediction project. Each matrix
provides a detailed summary of prediction results, comparing actual outcomes (rows) with
predicted outcomes (columns). The value at the top-left of each matrix shows the true
negatives (patients correctly identified as non-diabetic), while the bottom-right shows true
positives (patients correctly identified as diabetic).
SVM Confusion Matrix: Shows 81 true negatives and 36 true positives, indicating good
performance with relatively fewer false positives (18) and false negatives (19).
KNN Confusion Matrix: Has 79 true negatives and 28 true positives, but more errors with
20 false positives and 27 false negatives, reflecting lower accuracy.
Decision Tree Confusion Matrix: Reports 75 true negatives and the highest number of true
positives (40), with fewer false negatives (15) but more false positives (24) compared to
SVM.
This comparative analysis helps understand how each model handles misclassifications,
with SVM providing a balanced performance, while the Decision Tree factors detecting
positives (diabetes cases), and KNN shows relatively weaker prediction strength overall.
36
2024-2025
The given bar graph illustrates the accuracy comparison of three machine learning
algorithms—KNN, SVM, and Decision Tree—used for diabetes prediction. The vertical
axis represents accuracy in percentage, while the horizontal axis lists the three models.
From the graph, it's evident that SVM (Support Vector Machine) performs the best with
the highest accuracy of 85.0%, followed by KNN (K-Nearest Neighbours) at 82.5%, and
Decision Tree at 80.2%.
This comparison clearly shows that SVM is the most reliable model for this dataset in
terms of predictive accuracy, making it a preferred choice for deployment in a real-time
diabetes prediction system. The visual representation effectively communicates the
performance differences among the algorithms, supporting data-driven decision-making in
model selection.
37
2024-2025
38
2024-2025
Testcase for Login Page:
39
2024-2025
Testcase for Prediction Result:
40
2024-2025
Testcase for Prediction Result:
NEGATIVE CASE Age, Glucose, BMI Patient likely does not have
diabetes
41
8. CONCLUSION AND FUTURE WORK:
In conclusion, machine learning offers a powerful approach to diabetes prediction by
leveraging patient health data to make informed classifications. Through methods such as
Decision Tree,(SVM)-Support Vector Machine, (KNN) k-nearest neighbor models can
analyze key features like glucose levels, BMI, and blood pressure to assess diabetes risk.
The integration of graphical representations, including confusion matrices and ROC curves,
enhances the interpretability of predictions, allowing for better evaluation of model
performance. While current ML techniques provide promising results, further improvements
can be achieved through hyperparameter tuning, feature engineering, and ensemble
methods. Deploying these models via web applications ensures accessibility and practical
use in real- world healthcare scenarios, contributing to early diagnosis and preventive care
strategies. Future developments could integrate deep learning for higher accuracy.
Future work in diabetes prediction using machine learning can focus on improving
accuracy, interpretability, and real-world applicability. One potential direction is integrating
deep learning models, such as Convolutional Neural Networks (CNNs) or Recurrent Neural
Networks (RNNs), to capture more complex patterns in medical data. Expanding datasets to
include diverse patient demographics and real-time data from wearable devices can further
improve model generalization. Deploying the model as a user-friendly web application with
continuous updates will enhance accessibility for healthcare providers.Lastly, ensuring
robust model validation through software testing methods, including unit testing, functional
testing, and black-box testing, will contribute to the reliability of ML-based diabetes
prediction systems.
42
References:
43
2024-2025