JAYPEE INSTITUTE OF INFORMATION TECHNOLOGY, NOIDA
B. TECH 5th SEMESTER
Fundamentals of Machine Learning Project
TITLE OF PROJECT:
Cancer Detection Models
Submitted By: Submitted To :
Enrollment No. Name
22103124 Khushi Agarwal
22103148 Rishav Sachdeva Dr. Sherry Garg
22103143 Soham Kukreti
22103151 Daksh Jain
PROJECT REPORT
PROBLEM STATEMENT
Cancer diagnosis is a critical area of healthcare that demands accurate and timely
predictions. Early and precise detection of cancer can significantly improve
treatment outcomes and patient survival rates. This project focuses on building
machine learning models that can classify cancer levels (malignant/benign) using
patient data.
OVERVIEW OF THE PROJECT
This project employs machine learning techniques to detect cancer based on
medical data. By leveraging algorithms like K-Nearest Neighbors (KNN), Logistic
Regression, Naive Bayes, and Support Vector Machines (SVM), the project aims
to compare their effectiveness in predicting cancer levels. This comparative study
helps identify the most accurate and efficient model, enabling advancements in
automated cancer detection.
OBJECTIVE OF THE PROJECT
● To develop a machine learning-based cancer detection models capable of
classifying cancer as benign or malignant.
● To compare the accuracy and effectiveness of KNN, Logistic Regression,
Naive Bayes, and SVM algorithms.
● To provide a reliable and efficient tool to assist in the early diagnosis of
cancer.
ALGORITHM INSIGHTS
1. K-Nearest Neighbors (KNN)
○ A non-parametric algorithm that classifies data points based on the
majority class of their k-nearest neighbors.
○ Suitable for small datasets and intuitive to implement, though it can be
computationally expensive for large datasets.
2. Logistic Regression
○ A statistical method for binary classification based on a linear
relationship between the input features and the log-odds of the target
variable.
○ Simple to implement and interpret, making it a baseline model for
classification tasks.
3. Naive Bayes
○ A probabilistic algorithm based on Bayes’ theorem, assuming feature
independence.
○ Fast and efficient for high-dimensional datasets, though its
independence assumption may not hold for all features.
4. Support Vector Machine (SVM)
○ A robust algorithm that finds the hyperplane that best separates
classes in a dataset.
○ Effective for high-dimensional spaces and datasets with a clear
margin of separation.
FLOWCHART
MODEL IMPLEMENTATION AND EVALUATION
1. Dataset Overview
The dataset contains cancer-related medical data and labels.
● Total Records: 1000
● Columns:
○ 23 Features
○ Features include various medical measurements such as Obesity,
GeneticRisk, BalancedDiet, WeightLost, ShortnessOfBreath,etc.
○ Target: Level (0 for benign, 1 for malignant).
2. Missing Data Handling
● Identified missing values and imputed them using the median value for
numerical columns.
3. Feature Engineering
● Encoded categorical variables (if any) using one-hot encoding.
● Standardized numerical features to improve model performance.
MODEL TRAINING AND EVALUATION
1. Algorithms/Models Implemented
● K-Nearest Neighbors (KNN): Tested for its simplicity and performance on
small datasets.
● Logistic Regression: Used as a baseline model for cancer detection.
● Naive Bayes: Implemented to leverage its probabilistic approach for
classification.
● Support Vector Machine (SVM): Included for its robustness in handling
complex decision boundaries.
2. Evaluation Metrics
● Accuracy: Measures the proportion of correct predictions.
● Precision, Recall, F1 Score: Provide insights into the model’s ability to
classify positive and negative samples.
● Confusion Matrix: Visualizes true positives, false positives, true negatives,
and false negatives.
● Cross-validation: Ensures reliability and consistency in model evaluation.
Confusion Matrices for all 4 models:
1. Naive Bayes:
Accuracy : 91%
2. Logistic regression:
Accuracy: 98.4%
3. Support vector machines (SVM)
Accuracy: 99.9%
4. K nearest Neighbors:
Accuracy : 96.5%
CONCLUSION
This project demonstrates the application of machine learning algorithms in cancer
detection. Among the four algorithms tested, Support Vector Machines (SVM)
emerged as the most effective, achieving the highest accuracy and F1 score. By
leveraging SVM’s robustness, this model can serve as a reliable tool for early
cancer diagnosis. The study also highlights the strengths and weaknesses of KNN,
Logistic Regression, and Naive Bayes, providing a comprehensive understanding
of their applicability in medical data analysis.
REFERENCES
1. Hastie, Trevor, et al. "The Elements of Statistical Learning: Data Mining,
Inference, and Prediction." Springer, 2009.
2. Bishop, Christopher M. "Pattern Recognition and Machine Learning."
Springer, 2006.
3. Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine
Learning 20.3 (1995): 273-297.