KEMBAR78
Internship Codsoft Machine Learning | PDF | Receiver Operating Characteristic | Machine Learning
0% found this document useful (0 votes)
161 views36 pages

Internship Codsoft Machine Learning

The document presents an internship report by Mohammed Abdul Adil on an Applied Machine Learning project, which includes four key projects focused on movie genre prediction, credit card fraud detection, customer churn prediction, and SMS spam classification. Each project utilizes data preprocessing, feature engineering, and machine learning algorithms to develop predictive models that address real-world challenges. The report outlines the methodologies, literature review, system analysis, and proposed improvements for effective machine learning applications in various domains.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
161 views36 pages

Internship Codsoft Machine Learning

The document presents an internship report by Mohammed Abdul Adil on an Applied Machine Learning project, which includes four key projects focused on movie genre prediction, credit card fraud detection, customer churn prediction, and SMS spam classification. Each project utilizes data preprocessing, feature engineering, and machine learning algorithms to develop predictive models that address real-world challenges. The report outlines the methodologies, literature review, system analysis, and proposed improvements for effective machine learning applications in various domains.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

REPORT ON INTERNSHIP PROJECT

Applied Machine Learning Internship

Submitted in partial fulfillment of the requirement for the award of the


Degree of
BACHELOR OF ENGINEERING
In
COMPUTER SCIENCE AND ENGINEERING
By
Mohammed Abdul Adil (1604-21-733-028)

Department of Computer Science and Engineering


Muffakham Jah college of Engineering and Technology
Hyderabad -500 034
YEAR OF SUBMISSION: 2024-2025

i
CERTIFICATE

ii
CERTIFICATE OF COMPLETION

This is to certify that the Internship Report on “Applied Machine Learning Internship” submitted
by Mohammed Abdul Adil bearing Roll number 1604-21-733-028 in partial
fulfillment of the requirements for the award of the Degree of Bachelor of Engineering in
Computer Science. This is a record of her bonafide work under her guidance and supervision.

Mrs. AFREEN SULTANA


(Associate Professor)

iii
DECLARATION

This is to certify that the work reported in the summer internship entitled “Applied Machine
Learning Internship” is a record of work done by me in the Department of Computer
Science and Engineering, Muffakham Jah College of Engineering and Technology, Osmania
University. The report is based on the project work done entirely by me and not copied from
any other source.

Mr. Mohammed Abdul Adil

(1604-21-733-028)

iv
ACKNOWLEDGEMENT

I would like to express my sincere gratitude and indebtedness to our project course coordinator
Mrs. Afreen Sultana, Associate Professor, Computer Science, Muffakham Jah College of
Engineering and Technology, for her valuable suggestions and interest throughout the course
of this project.
I am also thankful to the mentors of CodSoft. for providing excellent courses
and a nice atmosphere for completing this project successfully.
Finally, I would like to take this opportunity to thank my family for their support through the
work.

v
ABSTRACT

The projects demonstrate the use of data science and machine learning to tackle various
challenges in various fields. These include content categorization, financial security, customer
retention, and spam combat. The approach involves data preprocessing, feature engineering, and
predictive modeling. The data is meticulously prepared, including cleaning, encoding, selection,
and transformation into numerical features. Model training and validation techniques like cross-
validation and hyperparameter tuning are used to ensure robust models that can generalize
effectively to unseen data, reducing issues of overfitting or underfitting.

The four projects use natural language processing (NLP) techniques to predict movie genres,
detect fraud, identify at-risk customers, and classify SMS spam. The movie genre prediction
project uses text preprocessing and TF-IDF for feature extraction, while the fraud detection
project uses machine learning algorithms like Logistic Regression and Decision Trees to identify
fraudulent patterns in credit card transactions. The customer churn prediction project uses these
learnings to identify at-risk customers, and the SMS spam classification project uses NLP
techniques to classify messages as spam or legitimate. These application-specific methods
create highly effective predictive models.

Despite their different areas of application, all four projects share a common goal: to develop
robust, reliable, and accurate predictive models that provide valuable insights. These models not
only automate essential processes but also contribute to practical solutions in their respective
fields. The results from these projects will improve content organization, enhance financial
security, foster customer loyalty, and protect messaging systems, paving the way for continuous
innovation. By using these technologies, the path is cleared to further enhance real world
applications of machine learning and data science, showing the impact of using data to make
improvements in a multitude of areas and demonstrating the potential for transformative
progress in these vital domains.

vi
INDEX

Contents Page No.

Certificate ii

Certificate of Completion iii

Declaration iv

Acknowledgement v

Abstract vi

CHAPTER I

Introduction 1

CHAPTER II

Literature Survey 2

CHAPTER III: SYSTEM ANALYSIS

3.1 Existing System 6

3.2 Proposed System 6

3.3 Software Requirements 7

3.4 Hardware Requirements 7

CHAPTER IV: SYSTEM DESIGN

4.1 Architecture 8

4.2 Scalability and Flexibility 9

CHAPTER V: IMPLEMENTATION

5.1 Customer Churn Prediction 10

5.2 SMS Spam Classification 14

5.3 Credit Card Fraud Detection 18

5.4 Movie Genre Prediction 22


INDEX

Contents Page No.

CHAPTER VI: CONCLUSION AND FUTURE WORK

6.1 Conclusion 23

6.2 Future Work 23

References 24
INTRODUCTION

In an era defined by data-driven insights, the ability to extract meaning and make accurate
predictions from complex datasets has become paramount. These four projects, each focusing
on a distinct challenge, represent an exploration of the power of machine learning to
address real-world problems. From content categorization and fraud prevention to customer
retention and spam detection, these initiatives aim to harness the potential of data science to
drive positive impact.

The core objectives of these projects lie in the design and implementation of robust
machine-learning models. Each project leverages specific preprocessing techniques, feature
extraction methods, and model training algorithms suited to the data and the problem
at hand. By rigorously testing and validating the trained models, this collection of
projects focuses on developing reliable and accurate systems that can be deployed in
practical scenarios.

These endeavors explore diverse domains, each requiring a unique approach. The movie genre
prediction project employs text processing and TF-IDF to categorize films based on plot
summaries; the fraud detection project utilizes feature engineering and algorithms to identify
suspicious credit card transactions; the customer churn project focuses on predicting which
subscribers may cancel using various classification models; and finally, the spam detection
project tackles the issue of unwanted messages using similar text processing techniques.
Together, these projects showcase the versatility of machine learning and its capacity to
tackle a wide array of challenges. From streamlining content management and safeguarding
financial transactions to enhancing customer relationships and improving communication,
these projects demonstrate the transformative potential of data-driven solutions, paving the
way for more efficient and secure operations across various sectors.

1
LITERATURE SURVEY

. Automated Movie Genre Categorization


This section explores the literature concerning automated movie genre categorization, focusing on
methodologies for representing textual data, handling multi-label classification, and the utility of
such systems for media analysis.
Text Representation and Feature Extraction: Research in this area explores various techniques
to convert textual descriptions into numerical features, which are essential for machine learning
algorithms to process text data. TF-IDF (Term Frequency-Inverse Document Frequency) is a
commonly used method that captures the importance of words based on their frequency within
a document and across a corpus. However, various research papers also explore word
embeddings (Word2Vec, GloVe, FastText, BERT, etc.) for richer semantic representation, and
these techniques aim to capture the context and relationships between words in a text
document. Further research on text feature representation methods such as those in the paper,
(6) Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar
Sainz, Eneko Agirre, Ilana Heintz, Dan Roth. “Recent Advances in Natural Language Processing
via Large Pre-trained Language Models: A Survey” (2023), show recent advancements in
transforming text data and should be considered.
Classification Techniques: Studies investigate a range of machine learning algorithms for multi-
label classification, with a focus on performance and scalability. Logistic Regression is a simple
yet effective baseline model, while more recent work focuses on neural networks, attention
models and transformer based methods. Research like (7) D Khurana, A Koli, K Khatter, S Singh.
“Natural language processing: State of the art, current trends and challenges” (2023) explore
current challenges and provide context to the advancements in these fields.
Multi-Label Classification: Research into techniques to handle cases where a single movie can
belong to several genres, methods to improve prediction of all possible genres are investigated.
Recommendation Systems and Media Analysis: Papers also discuss the broader implications of
movie genre classification for building recommendation systems and performing media analysis,
which often rely on accurate genre categorization to provide personalized and relevant results.

2
LITERATURE SURVEY

2. Credit Card Fraud Detection


This section reviews research on credit card fraud detection, focusing on techniques for anomaly
detection, handling imbalanced data, and the application of machine learning models.
Anomaly Detection: Many studies focus on detecting transactions that deviate significantly from
established patterns, and explore traditional statistical techniques alongside machine learning
methods such as isolation forests and autoencoders. Works such as (13) C. C. Aggarwal. "Outlier
Analysis". Springer International Publishing, 2017 provide overviews on outlier detection
techniques.
Imbalanced Data Handling: Credit card datasets are typically imbalanced, with legitimate
transactions far outnumbering fraudulent ones. Research addresses the issue of imbalanced
datasets by exploring techniques like oversampling, undersampling, cost-sensitive learning and
other techniques to provide a balanced dataset for training a machine learning model. Research
like (8) V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos. "Feature selection for
imbalanced datasets: A review and new perspectives". Progress in Artificial Intelligence, vol. 5,
pp. 1-20, 2016, specifically addresses this.
Classification Algorithms: A variety of classification models have been investigated in the
literature for fraud detection, including Logistic Regression, Support Vector Machines (SVM),
Decision Trees, Random Forests, and neural networks. Research such as (11) H. Zhao, Z. Chen,
J. Hu, Z. Han, C. Yu. "A novel approach for credit card fraud detection using machine learning".
Applied Soft Computing, vol. 120, 2022 provides modern insights on modern methods.
Feature Engineering: Studies also delve into the importance of using domain knowledge to
extract valuable features. This can include creating new features based on transaction times,
transaction amounts, transaction frequencies and user demographics.

3
LITERATURE SURVEY

3. Customer Churn Prediction


This section explores literature related to predicting customer churn, focusing on identifying
influential factors and building predictive models that help businesses to improve their customer
retention.
Customer Behavior Analysis: Research explores the significance of various factors such as
demographics, customer activities, product usage, and past interactions in predicting churn.
Research like (12) M. H. D. Wijayawardhana, K. A. T. K. Wijayawardhana, H. A. P. Rathnayake.
"Machine learning for customer churn prediction: a case study in the telecommunication
industry". International Journal of Computer Science and Network Security, vol. 22, no. 10, pp.
161-170, 2022 explores a case study of churn prediction in the telecom industry.
Predictive Models: Many research studies use classification models, such as Logistic Regression,
Decision Trees, Random Forests, Gradient Boosting machines and neural networks for
predicting churn.
Retention Strategies: Papers examine the impact of proactive retention strategies that aim to
address the needs and concerns of customers before they end their subscription to the
business. The paper provided in the original prompt that you gave me: (3) The first paper, "CRM
at the Speed of Light: Capturing and Keeping Customers in Internet Real Time," emphasizes the
value of understanding customer behavior for better retention.

4
LITERATURE SURVEY

4. SMS Spam Classification


This section reviews literature related to the categorization of SMS messages as either spam or
legitimate, with a focus on natural language processing techniques, machine learning algorithms,
and dataset evaluation practices specific to spam detection.
Natural Language Processing (NLP) Techniques: Research in this area discusses the significance
of text preprocessing methods such as tokenization, stop word removal, stemming, and
lemmatization for transforming raw text data. These methods help to reduce the dimensionality
of the data and improve model performance by removing irrelevant noise. A good reference for
text processing methods is Chapter 3: Text Preprocessing from "Speech and Language
Processing" by Daniel Jurafsky and James H. Martin (3rd ed., 2023), which provides a thorough
overview of fundamental NLP techniques for text processing in machine learning tasks.
Feature Extraction Methods: Various research papers explore techniques such as Bag-of-Words
(BoW), TF-IDF (Term Frequency-Inverse Document Frequency), and word embeddings
(Word2Vec, GloVe, FastText, etc.) to extract useful features that represent SMS messages as
numerical vectors. Research focuses on how these methods can help with classifying spam.
Machine Learning Algorithms: Studies investigate various machine learning algorithms used for
spam classification, such as Naive Bayes, Support Vector Machines (SVM), Decision Trees,
Random Forests, and Logistic Regression. Furthermore, many research projects have moved
towards using ensemble methods and deep learning models (e.g., Recurrent Neural Networks
(RNNs), Convolutional Neural Networks (CNNs)) to enhance spam classification accuracy. A
good paper that presents several classification techniques and also analyses dataset issues is:
M. Alsmirat, M. Hammad. "Classification of SMS Spam Messages Using Support Vector Machine
and Deep Learning Techniques," International Journal of Advanced Computer Science and
Applications, vol. 14, no. 1, 2023.
Research studies examine open datasets used to benchmark spam classification algorithms,
analyzing performance metrics like precision, recall, F1-score, accuracy, and ROC AUC. These
studies provide detailed model evaluation and discuss best practices for model validation. A
reference is A. Hossain, J. A. Khan, M. M. Rahman's 2022 paper.

5
SYSTEM ANALYSIS
3.1 EXISTING SYSTEM The landscape of Applied Machine Learning has evolved over time, and
the existing systems that
underpin this field face inherent challenges that hinder their effectiveness in today's dynamic
digital environment. The current analysis methodologies, rooted in older frameworks
and technologies, exhibit several shortcomings that necessitate a paradigm shift.

One of the primary issues with the existing analysis systems is their relatively slower
processing speed. As the volume of textual data generated online continues to surge, the
traditional methods struggle to keep pace with the demand for real-time analysis. This lag in
processing not only impedes timely insights but also limits the adaptability of these systems to
the rapidly changing nature of online discourse.

Accuracy is a critical concern when building predictive systems. Older or less sophisticated
models often struggle to capture the complexity and variability within real-world datasets.
Limitations in handling imbalanced datasets, noisy or sparse data, or nuanced patterns lead to
reduced reliability and precision in the models. For example, traditional methods may struggle in
capturing subtle differences between fraudulent and legitimate transactions or identifying at-risk
customers based on complex behavioral patterns, resulting in misclassifications and reduced
effectiveness in automated categorization and classification systems.
3.2 PROPOSED SYSTEM
The proposed system introduces a robust approach to address the limitations of
conventional predictive modeling methods by incorporating advanced Data Science and
Machine Learning technologies. This system aims to provide effective and accurate solutions
for prediction tasks across diverse domains, from content categorization to financial security
and customer management.
The core concept involves the design and implementation of various machine learning
models using the Python programming language. By utilizing sophisticated algorithms,
including Logistic Regression, along with feature engineering and relevant data
transformations, this system strives to provide a versatile framework capable of making
accurate predictions in various application areas.
A key feature of the system is the use of a variety of preprocessing, feature engineering,
model training, and evaluation techniques. For example, for text processing tasks, it employs
TF-IDF to transform raw text into informative features. This allows for efficient processing
and prediction, promising faster and more accurate results compared to systems that don't
use these methods. The system is built to handle a diverse array of tasks, data, and features,
ensuring adaptability and responsiveness.
The innovation extends beyond the immediate predictive objectives, envisioning applications
in diverse practical domains, such as content moderation, risk management, fraud
prevention, customer relations, and business strategy. It also has the potential to generate
key insights from patterns in the data, leading to tangible improvements in operational
effectiveness and performance.

6
3.3 SOFTWARE REQUIREMENTS
Operating System:
Windows 10 or higher
macOS (latest versions)
Linux (Ubuntu 18.04 LTS or higher, or similar distributions)
Python Version: Python 3.7 or higher
Python Libraries:
NLTK
scikit-learn
Pandas
NumPy
Integrated Development Environment (IDE):
Visual Studio Code
PyCharm
Jupyter Notebook / JupyterLab
Virtual Environment: Anaconda (optional but recommended for environment isolation)
venv (Python's built-in virtual environment module)
Web Browser: Google Chrome or Mozilla Firefox (optional, for viewing reports or
interactive visualizations)

3.4 HARDWARE REQUIREMENTS


Processor: Intel Core i3 (8th generation or newer) or equivalent AMD Ryzen processor
RAM: 8GB or higher (16GB recommended for larger datasets)
Storage: 256GB SSD or higher

7
SYSTEM DESIGN

4.1 Architecture The system architecture is designed for modularity, efficiency, and scalability,
focusing on a clear separation of data handling, processing, and prediction. The architecture can
be conceptually divided into three main components: the Data Input Layer, the Processing Layer,
and the Output Layer.
Data Input Layer: This layer is responsible for acquiring and preparing data for processing.
Depending on the project, data might be loaded from CSV files, text documents, or extracted
from databases. This layer includes data loading, initial data exploration, and the first steps
of data cleaning, which could involve removing irrelevant data or identifying and handling
missing values.
Processing Layer: This layer is where the core machine learning operations occur. The first
step involves feature engineering and preprocessing, transforming the raw data into a format
suitable for the machine learning models. This can include text vectorization (TF-IDF),
numerical feature scaling (StandardScaler), one-hot encoding of categorical variables, and
handling of missing data using imputation techniques. The next stage is model training where
a suitable algorithm (e.g., Logistic Regression) is fitted with training data. Model validation
occurs after training in which, performance metrics and plots are used to validate the quality
of the model.
Output Layer: In this layer, the trained machine learning model is used to perform predictions
based on the test or new input data. The layer also converts raw predictions into a usable
output which can be a category, boolean, or numerical result. Additionally, the output layer
handles the generation of performance reports and visualizations that showcase the quality
of the model and its performance in terms of accuracy, precision, recall, F1 score and other
metrics. The output can be used in downstream applications, or for analysis purposes.
This architectural design ensures a streamlined and modular approach, where each step has
clear inputs, transformations, and outputs. It allows for efficient data flow and easy modification
of any component to improve the prediction outcomes or the system. Each component can be
independently modified or optimized to ensure a good quality and performance.

8
SYSTEM DESIGN

4.2 Scalability and Flexibility

The system architecture is designed with scalability and flexibility as fundamental principles to
accommodate increasing data volumes and evolving project needs. As the projects mature and
the amount of data grows, the system can be adapted to efficiently process larger datasets and
handle more complex analytical tasks. The modular structure of the architecture facilitates easy
integration of new techniques, models, and data sources without requiring a major overhaul of
the entire system.
For instance, if a more computationally intensive model such as a deep learning model is needed
to improve prediction accuracy in the fraud detection, customer churn or movie genre projects,
the model training modules can be replaced or augmented without affecting other parts of the
system. Similarly, if it’s necessary to incorporate new data sources, new feature engineering
methods, or other data cleaning techniques, the data input and processing modules can be easily
modified, optimized, or expanded, with minimal disruption to the overall architecture. This
flexibility allows the system to adapt to changing data characteristics, evolving project goals, and
advancements in machine learning methodologies.
Furthermore, the design allows for both vertical and horizontal scaling. Vertically, individual
components can be optimized using more powerful hardware to increase the performance of the
system. Horizontally, additional computational resources can be allocated to the training,
validation and prediction stages to handle larger datasets, reduce processing times, and improve
model training, ensuring the system remains efficient even with substantial growth in data
volume. This ensures that the system remains robust and reliable as the complexity and data
requirements of the project grow.

9
IMPLEMENTATION
5.1 Customer Churn Prediction :
5.1.1 Module Description

Purpose:
The primary goal of this module is to build a predictive model capable of identifying
customers who are likely to churn (discontinue their subscription or service). This
predictive capability enables businesses to proactively address at-risk customers,
implement retention strategies, and ultimately reduce customer attrition.
Core Processes:
Data Acquisition: The module starts by downloading and loading customer data from a
specified source (Kaggle in this case), typically a CSV file. This data contains various
attributes that could influence churn behavior, such as demographics, account activity,
and product usage patterns.
Exploratory Data Analysis (EDA): After loading, the module conducts EDA to understand
the dataset. This step includes examining data types, handling missing values, and getting
a statistical overview of numeric data. Initial data preparation tasks and pre-processing
such as handling categorical variables with One Hot Encoding are also included.
Feature Engineering and Selection: The module selects relevant features from the dataset
and scales numerical data to standard format using StandardScaler to prepare for model
training. The selection step aims to reduce noise and focus on the most impactful aspects
of customer data.
Model Training: The module employs a Logistic Regression algorithm to build the
predictive model. The dataset is divided into training and testing portions to allow the
model to learn from training data and be evaluated on unseen test data. The chosen
model is trained and fit with the training dataset and target variable.
Model Evaluation: Post-training, the module evaluates the performance of the model on
the test set by providing a detailed classification report and also computes an ROC curve
with the area under the curve to visualize overall performance of the model.
Data Handling:
Data Format: The module reads and manipulates data in tabular format using pandas
DataFrames.
Missing Value Handling: The module identifies and fills missing values using forward
filling.
Feature Scaling: The numerical features are scaled to ensure that the model's output is
not biased due to the scale of the numerical features.
Algorithms:
Logistic Regression: This model is used because of its simplicity and effectiveness in
binary classification tasks such as churn prediction. Logistic regression models the
probability of an outcome based on a linear combination of predictor variables.
Expected Outcome:
The module produces a trained machine learning model that can predict the likelihood of
a customer churning. It also includes key performance metrics like precision, recall, F1-
score, accuracy, and ROC AUC as well as an ROC curve visualization, which helps in
gauging the model's effectiveness and aids in strategic decision making.

10
5.1.1 Performance and output

Demonstrates the initial exploratory analysis of the dataset, helping to understand the distribution, scale, and
range of the numerical data. This is crucial for identifying potentially problematic features that might need
scaling and also aids in gaining an initial understanding of the data before any transformations are made.

Shows the table produced by df.describe(). This table includes descriptive statistics such as count, mean,
standard deviation, min/max, and quartiles for numerical columns of the dataset.

The image shows a summary of a dataset's numerical columns, generated using a statistical overview like the
describe() function in Python. Here’s a brief explanation:
Columns: The dataset includes attributes such as RowNumber, CustomerId, CreditScore, Age, Tenure,
Balance, NumOfProducts, HasCrCard, and IsActiveMember.
Count: All columns have 10,000 non-missing entries.
Mean and Std (Standard Deviation): The central tendency and variability of each column. For instance, the
average CreditScore is 650.53 with a standard deviation of 96.65.
Min and Max: The range of values for each column. For example, Age ranges from 18 to 92.
Quartiles (25%, 50%, 75%): Indicate the spread of data, dividing it into four equal parts. For instance, 50%
of Tenure values are 5 or below.

The dataset includes two columns, EstimatedSalary and Exited, with 10,000 entries each. Estimated salaries
range from 11.58 to 199,992.48, with an average of 100,090 and a standard deviation of 57,510, indicating a
wide salary distribution. The Exited column is binary, where approximately 20.37% of customers exited (value
1), while the rest remained (value 0). Most customers fall below the 75th salary percentile of 149,388.

11
5.1.1 Performance and output

Shows the text-based output generated by sklearn.metrics.classification_report. It shows precision, recall, F1-
score, and support for each class (0 and 1 in this case) as well as macro average, and weighted averages

Quantifies the Logistic Regression model's performance in terms of the metrics which are essential for
determining the performance of a classification model. This snap highlights the model's strengths and
weaknesses for each class and overall performance.

This code snippet focuses on evaluating the performance of a trained Logistic Regression model. It starts by
using the trained model to predict labels for the test data, X_test_scaled. Then, it generates a detailed
classification_report which provides key metrics like precision, recall, and F1-score for each class, along with
overall accuracy and weighted averages. The report reveals that the model performs significantly better in
predicting class 0 (with precision, recall, and F1 scores of 0.83, 0.92, and 0.87 respectively) than class 1 (0.40,
0.22, and 0.28), indicating a bias towards the majority class. While the overall accuracy of 78% seems decent,
the imbalanced performance between classes suggests the model struggles at predicting the minority class,
and that further investigation and improvements should be made.

12
This shows a line graph produced using the code and libraries provided. It plots the True Positive Rate (y-axis)
against the False Positive Rate (x-axis) with the ROC curve, demonstrating how well the model can discriminate
between the two classes. The area under the curve is given, which is the ROC AUC score, along with a dashed
line which represents a random classifier.

Providing a visual evaluation of the classification performance which helps in assessing the model's ability to
distinguish churners from non-churners.

The ROC curve illustrates the performance of a binary classification model by plotting the True Positive Rate
(TPR) against the False Positive Rate (FPR) at various thresholds. The diagonal line represents random guessing
(AUC = 0.5), while the orange curve shows the model's performance.
In this example, the AUC (Area Under the Curve) is 0.62, indicating the model is slightly better than random
guessing but not highly effective at distinguishing between fraudulent and non-fraudulent transactions.
Improvements in the model or features may be needed to enhance its performance.

13
5.1.2 Libraries and Functions Used
os: Used for interacting with the operating system (e.g., setting environment variables).
os.environ: Used to set the environment variable KAGGLE_CONFIG_DIR.
pandas (pd): For data manipulation and analysis.
pd.read_csv(): Reads a CSV file into a DataFrame.
df.head(): Displays the first few rows of the DataFrame.
df.info(): Provides information about the DataFrame, like data types and non-null values.
df.describe(): Generates descriptive statistics for numerical columns.
pd.get_dummies(): Used for one-hot encoding categorical variables.
numpy (np): For numerical operations.
sklearn (scikit-learn): For machine learning tasks.
sklearn.model_selection.train_test_split: Splits data into training and testing sets.
sklearn.preprocessing.StandardScaler: Standardizes numerical features.
sklearn.linear_model.LogisticRegression: Implements the Logistic Regression model.
sklearn.metrics.classification_report: Generates a classification report.
sklearn.metrics.confusion_matrix: Computes a confusion matrix.
sklearn.metrics.roc_auc_score: Computes the area under the ROC curve.
sklearn.metrics.roc_curve: Computes the receiver operating characteristic curve.
matplotlib.pyplot (plt): For data visualization.
* plt.figure(): Creates a figure for plotting
* plt.plot(): Creates line graphs.
* plt.xlim(): Sets the x-axis limits.
* plt.ylim(): Sets the y-axis limits.
* plt.xlabel(): Sets the x-axis label.
* plt.ylabel(): Sets the y-axis label.
* plt.title(): Sets the plot title.
* plt.legend(): Creates a legend.
* plt.show(): Displays the plot.
seaborn (sns): For visualization.

5.1.3 Step-by-Step Explanation

1. Environment Setup and Data Loading:


a. The notebook sets up the environment and downloads a dataset from Kaggle about customer
churn.
b. It reads the Churn_Modelling.csv file into a pandas DataFrame.
c. The first few rows of the dataset, data types, and descriptive statistics are printed.
2. Data Preprocessing:
a. Checks for and prints the total number of null values in each column.
b. Fills any existing null values in the dataframe using the ffill method.
c. One-hot encodes categorical features such as Geography, and Gender.
d. The target variable Exited is dropped from the feature set X, and Exited is set to be our target
variable y.
e. Splits the data into training and test sets with a 80/20 ratio.
3. Feature Scaling:
a. Standardizes numerical features using StandardScaler.
4. Model Training:
a. Initializes and trains a Logistic Regression model on the scaled training data.
5. Model Evaluation:
a. Predicts labels on the scaled test dataset.
b. Generates and prints a classification report and a confusion matrix.
c. Generates and plots an ROC curve, calculating the area under the curve.

14
5.2 SMS Spam Classification :
5.2.1 Module Description

Purpose:
The primary objective of this module is to develop a system that can automatically classify
SMS messages as either spam or legitimate. This is crucial for filtering out unwanted and
often malicious content, improving user experience, and enhancing message filtering
services.
Core Processes:
Data Loading and Exploration: Similar to the Churn model, the module starts by loading
SMS message data, handling encoding to correctly read the data and exploring the data
for column names etc.
Text Preprocessing: This is the module's central component, which includes cleaning and
transforming the raw text data. Key preprocessing tasks include lowercasing the text,
removing special characters, and standardizing the text.
Feature Extraction: Here, text data is converted into numerical form that machine learning
models can understand. This uses the TF-IDF method, which measures the importance of
each word in the message.
Model Training: The module utilizes Logistic Regression as the classification algorithm. It
is trained on a labeled dataset, learning the patterns associated with spam and legitimate
messages, based on TF-IDF representations of the messages.
Model Evaluation: The model's effectiveness is assessed using a variety of metrics, such
as precision, recall, accuracy, and F1 score. A confusion matrix is computed and
visualized to better understand the different types of errors produced by the model.
Prediction Function: A prediction function is implemented which takes in any string of
text, and produces a prediction on whether the given string is classified as ham or spam.
Data Handling:
Text Data: The module focuses on handling textual data with varying lengths and
complexities.
Categorical Labels: Each SMS message is associated with a label of either 0 (ham) or 1
(spam).
Algorithms:
Logistic Regression: It's used for the same reasons as in the Customer Churn module,
which is its efficiency in binary classification and interpretability.
TF-IDF: This text feature extraction method transforms text into a numerical matrix that
captures the importance of terms in a document within a corpus.
Expected Outcome:
The module produces a trained model capable of accurately classifying new SMS
messages. It also delivers performance metrics and visualization of errors, as well as a
prediction function that can be used in future SMS classification tasks.

14
5.2.1 Performance and output

This is the text output of the classification report showing the precision, recall, F1-score, and support for each
class, along with overall accuracy. It quantifies the Logistic Regression model's ability to classify SMS messages
as spam or legitimate, highlighting the model's performance.

Confusion Matrix

Predicted

Shows a colored heatmap visualization generated using seaborn, displaying the confusion matrix. The axes are
labeled as 'Ham' and 'Spam', with the color intensity indicating the frequency of each prediction outcome.
Provides an intuitive visualization of where the model made correct and incorrect predictions, making it easy to
interpret model performance.
True Positive (TP): Correctly predicted Spam = 113
True Negative (TN): Correctly predicted Ham = 964
False Positive (FP): Predicted Spam but actually Ham = 1
False Negative (FN): Predicted Ham but actually Spam = 37

Accuracy: 96.6%
16
5.2.2 Libraries and Functions Used
os: As described in Video 1.
pandas (pd): As described in Video 1.
df.columns = [...]: Renames columns of DataFrame.
df.map(): used to map values in the column.
sklearn (scikit-learn): As described in Video 1.
sklearn.feature_extraction.text.TfidfVectorizer: Converts text to TF-IDF features.
matplotlib.pyplot (plt): As described in Video 1.
seaborn (sns): As described in Video 1.

5.2.3 Step-by-Step Explanation

1. Environment Setup and Data Loading:


a. The notebook sets up the environment and loads the SMS spam dataset from Kaggle.
b. It reads the spam.csv file into a pandas DataFrame using latin-1 encoding.
c. It renames and drops the first two columns of the Dataframe, creating a columns named label
and message
d. It maps the values in the label column to 0 and 1, for ham and spam respectively.
2. Data Preprocessing
a. The target variable label is set and is dropped from the feature set message.
b. Splits the data into training and test sets.
3. Feature Extraction:
a. Converts SMS messages into numerical features using TfidfVectorizer.
b. Transforms both training and test datasets with the fitted Vectorizer.
4. Model Training:
a. Initializes and trains a Logistic Regression model on the extracted training features.
5. Model Evaluation:
a. Predicts labels on the extracted test features.
b. Generates and prints a classification report and confusion matrix.
c. Displays a heatmap of the confusion matrix.
6. Prediction Function:
a. Defines a predict function that takes a message, transforms it using the fitted vectorizer, and
makes a prediction with the trained model.
b. Demonstrates the usage of the predict function to classify two example messages.

17
5.3 Credit Card Fraud Detection :
5.3.1 Module Description

Purpose:
The primary purpose of this module is to create a reliable fraud detection system for
credit card transactions. The goal is to identify potentially fraudulent transactions, which
is crucial for minimizing financial losses and protecting customers from unauthorized
charges.
Core Processes:
Data Acquisition: The module starts with loading transaction data (typically from CSV
files), each transaction containing various features such as timestamps, user details, and
transaction amounts.
Data Preprocessing: This step is crucial, as it includes removing unnecessary columns,
selecting numeric and categorical features for further processing.
Feature Engineering: For numeric and categorical feature processing, pipelines are
constructed which contain imputation techniques, scaling of numerical features and one-
hot encoding of categorical variables.
Model Training: The processed data is used to train a Logistic Regression model. The
training step aims to find patterns in the data that are associated with fraudulent activity.
Model Evaluation: The trained model is evaluated using metrics such as precision, recall,
F1-score, and accuracy, which are specifically important in fraud detection due to its
nature as an imbalanced problem.
Data Handling:
Tabular Data: The module handles credit card transaction data using pandas DataFrames,
which allows for easy data processing and manipulation.
Feature Selection: It includes selecting features relevant to fraud detection.
Data Imputation and Scaling: The module takes into account missing values using
imputation methods and scaling of numerical features.
Algorithms:
Logistic Regression: Selected for its efficiency in identifying patterns in binary outcome
variables, which is helpful for identifying fraudulent transactions.
Pipeline: The Pipeline API is used to chain multiple preprocessing and feature engineering
steps into a single unit.
ColumnTransformer: Used to transform specified columns of a DataFrame using specified
preprocessing pipelines.
Expected Outcome:
The module results in a trained Logistic Regression model that can accurately identify
potentially fraudulent credit card transactions. Performance metrics help in evaluating the
model's effectiveness.

18
5.3.1 Performance and output

The text output of classification_report is displayed. This includes precision, recall, F1-score, and
support. Provides insights into how the logistic regression model performed on the test dataset, in
particular with the highly imbalanced nature of this dataset.
The classification report for a logistic regression model highlights significant class imbalance. For
class 0 (majority class with 553,574 samples), the model achieves perfect precision, recall, and
F1-score of 1.00. However, for class 1 (minority class with 2,145 samples), the precision, recall,
and F1-score are all 0.00, indicating the model fails to identify any instances of this class. The
overall accuracy is 1.00 due to the dominance of class 0, but the macro average (0.50 for
precision, recall, and F1-score) reflects the poor performance for class 1. This highlights the need
for handling the class imbalance, such as using oversampling, undersampling, or class weighting.

19
5.3 Credit Card Fraud Detection :

The dataset shown appears to be a transaction dataset used for fraud detection purposes. Each
row represents an individual transaction and contains various details, including the transaction
category (e.g., grocery_pos, entertainment, personal_care), the transaction amount (amt), and
the gender of the account holder (gender, represented as M for male and F for female). Additional
attributes include the population of the city where the transaction occurred (city_pop), the
timestamp of the transaction in Unix time format (unix_time), and the merchant's location,
specified by latitude (merch_lat) and longitude (merch_long).
The dataset also includes a binary column called is_fraud, which indicates whether a transaction
is fraudulent (1) or not (0). The displayed portion of the dataset shows that all transactions are
non-fraudulent (is_fraud = 0). This type of dataset is typically used to train and evaluate machine
learning models designed to detect fraudulent activities based on transaction patterns and
associated metadata.

The dataset provides a comprehensive view of transaction details, combining demographic,


geographic, and transactional metadata. It enables a deeper analysis of transaction patterns,
such as identifying trends in specific categories or locations, understanding spending behavior
across different city populations, and detecting anomalies in transaction amounts or timestamps.
This rich metadata serves as valuable input for machine learning models, allowing for nuanced
fraud detection based on complex correlations and patterns within the data.

20
5.3.2 Libraries and Functions Used
os
pandas (pd)
sklearn (scikit-learn)
sklearn.preprocessing.OneHotEncoder: For one-hot encoding categorical features.
sklearn.compose.ColumnTransformer: Applies transformers to different columns of an array
or DataFrame.
sklearn.impute.SimpleImputer: Handles missing values using a strategy (e.g., mean, most
frequent).
sklearn.pipeline.Pipeline: Chains multiple processing steps together into a single unit.
matplotlib.pyplot (plt)
seaborn (sns)

5.3.3 Step-by-Step Explanation

1. Environment Setup and Data Loading:


a. The notebook sets up the environment and loads the fraud detection data from Kaggle.
b. It reads the fraudTrain.csv and fraudTest.csv into pandas DataFrames.
2. Data Cleaning:
a. Defines a list of columns to drop as they aren't needed for this particular problem.
b. Drops columns specified from the train and test sets.
c. The target variable is_fraud is dropped from the feature set.
d. is_fraud is specified as target variable for both train and test datasets.
3. Feature Transformation:
a. Selects numerical and categorical features based on their data types.
b. Creates pipelines for numerical (impute missing values with the mean, then scale features) and
categorical features (impute missing values with the most frequent, then perform one-hot
encoding).
c. Combines numerical and categorical transformers using ColumnTransformer.
4. Model Training:
a. Creates a pipeline with the preprocessor and a Logistic Regression classifier.
b. Trains the pipeline using the processed data.
5. Model Evaluation:
a. Predicts on the test set.
b. Prints a classification report of the model on test set.

21
5.4 Movie Genre Prediction :
5.4.1 Module Description

Purpose:
The main goal of this module is to create a system that can accurately predict movie
genres based on their plot summaries. This system enables automated content
categorization which allows for efficient organization, tagging, and recommendation of
movie content based on descriptive text.
Core Processes:
Data Loading and Exploration: The module begins by loading movie data, which includes
movie titles, descriptions, and their associated genres from text data.
Text Preprocessing: This core step prepares the text data for the model by removing
unwanted characters and standardizing the text. This also includes the removal of stop
words, and stemming to reduce the amount of noise in the feature space.
Feature Extraction: The module converts movie descriptions into numerical
representations using TF-IDF vectorization. This method measures the importance of
words in each description.
Model Training: The module trains a Logistic Regression model using the TF-IDF features.
This is where the model learns the relationships between text patterns and movie genres.
Model Prediction: Once trained, the module tests the model on new descriptions and
generates a prediction on which genre the movie falls under.
Data Handling:
Text Data: The module primarily handles text-based data in the form of movie
descriptions.
Text Preprocessing: The module transforms text into cleaned, lowercase, and stemmed
words to prepare it for further feature extraction.
Algorithms:
Logistic Regression: Used for its efficiency in classifying data.
TF-IDF: This is used to transform the processed text into a numerical form, capturing the
importance of words in the movie descriptions, as was used in the SMS Spam Module.
Expected Outcome:
The module should give a model that can accurately predict movie genres from the given
descriptions with high accuracy. It also provides a demonstration on how to use the
model with custom descriptions to classify new movies.

22
5.4.1 Performance and output

Table showing the output of test_df.head(), which are the first few rows of the test dataset, including the
original descriptions, and the predicted genres by the machine learning model. This shows the
transformation of the testing data after being preprocessed and also shows the predictive power of the
trained model on the test data.

Shows an output from the code, showing that when a single movie description was provided to the model,
it was correctly classified. It also depicts the practical application of the trained machine learning model
by classifying a single custom string of text.

23
5.4.2 Libraries and Functions Used
os: As described in Video 1.
pandas (pd): As described in Video 1.
re: Used for regular expressions.
re.sub: used to remove non-aplphabetic characters.
sklearn (scikit-learn): As described in Video 1.
sklearn.feature_extraction.text.TfidfVectorizer: As described in Video 2.
nltk: Used for natural language processing.
nltk.download('stopwords'): Downloads stopwords
nltk.download('punkt'): downloads punkt tokenizer
nltk.corpus.stopwords: used to get english stop words
nltk.PorterStemmer(): Used for stemming.
matplotlib.pyplot (plt): As described in Video 1.
seaborn (sns): As described in Video 1.

5.4.3 Step-by-Step Explanation

1. Environment Setup, Library Import, and Data Loading:


a. The notebook sets up the environment, imports libraries, and downloads a dataset containing
movie descriptions and genres from Kaggle.
b. It then reads the first few lines from train_data.txt and prints them.
c. Additional libraries for NLTK data are downloaded.
2. Text Preprocessing:
a. Defines a function that uses regex to remove non-alphabetic characters.
b. The text is converted to lowercase and split to words using the split function.
c. Stop words and stemming are used to clean text.
d. Cleaned words are then re joined together.
3. Data Loading and Preprocessing:
a. It loads and preprocesses the training data from train_data.txt into a list by splitting each line,
extracting the title, description, genre and preprocessing the text.
b. It similarly does this for the test data from test_data.txt.
c. The training and testing lists are converted into pandas DataFrames.
4. Feature Extraction:
a. Converts the text description into numerical TF-IDF features.
b. The feature are fitted and transformed using TfidfVectorizer.
5. Model Training:
a. A Logistic Regression model is trained on the extracted features and the specified target
variable.
6. Model Prediction:
a. A prediction is made on the test set and printed to display predicted genres.
7. Single Prediction:
a. A single example is specified, preprocessed and a final prediction is made with a print statement
to demonstrate the working of the model.

24
CONCLUSION AND FUTURE WORK

7.1 CONCLUSION
In conclusion, these four projects collectively represent a significant step forward in the
application of machine learning to address diverse real-world challenges. Each project, focusing
on distinct tasks such as predicting movie genres, detecting fraudulent transactions, forecasting
customer churn, and classifying spam messages, showcases the power of data-driven solutions.

Through the application of robust methodologies encompassing data preprocessing, feature


engineering, and model training, these projects have successfully created reliable and effective
predictive systems. They provide a practical demonstration of how machine learning can be
leveraged to enhance content management, ensure financial security, improve customer
retention, and filter unwanted communications.

The use of algorithms such as Logistic Regression, alongside text processing techniques like TF-
IDF, underscores the adaptability and effectiveness of machine learning in handling varied data
formats and problem sets.

7.2 FUTURE WORK

Several avenues exist for future exploration and refinement of these projects. In particular,
enhancing the predictive accuracy of each model through the integration of additional data and
the exploration of more sophisticated machine learning algorithms is a key area for further
development.

For the movie genre prediction module, exploring more advanced NLP techniques such as
transformer-based models, could provide a richer understanding of movie descriptions. For the
fraud detection and customer churn models, researching ensemble methods and deep learning
approaches could enhance detection rates and improve predictive power.

The SMS spam model would also benefit from further analysis through deep learning methods.
Additionally, expanding the range of available preprocessing and feature engineering methods
would enable better handling of edge cases and outlier data. Expanding the practical
applications of each model is also crucial. Integration of the movie genre predictor with content
recommendation systems would provide valuable value to movie distribution platforms, and the
fraud detection and customer churn models could be integrated into their respective platforms
to make real-time decision-making. The SMS spam classifier can be integrated in messaging
platforms to provide more reliable spam detection services.

25
CONCLUSION AND FUTURE WORK

Further exploration could include working with experts in different fields to bring domain-
specific insights to data analysis and feature engineering, in order to further enhance
performance.

In summary, these projects, while individually impactful, provide a solid foundation for further
development and innovation in machine learning applications. Each one is a testament to the
possibilities that arise when data science is applied to address practical and real world
problems, and sets the stage for a future where machine learning provides more efficient, more
robust, and more secure operational capabilities across a wide variety of domains.

26
REFERENCES

Zhao, H., Chen, Z., Hu, J., Han, Z., & Yu, C. (2022). "A novel approach for credit card fraud
detection using machine learning." Applied Soft Computing, 120.
ResearchGate
Affective Computing: Recent Advances, Challenges, and Future Trends. (2023). iComputing.
Science Magazine
Graph Convolutional Networks (GCNs):
Duvenaud, D., Maclaurin, D., Aguilera-Iparraguirre, J., Gómez-Bombarelli, R., Hirzel, T.,
Aspuru-Guzik, A., & Adams, R. P. (2015). "Convolutional networks on graphs for learning
molecular fingerprints." Advances in Neural Information Processing Systems (NIPS).
GitHub
Zhao, H., Chen, Z., Hu, J., Han, Z., & Yu, C. (2022). "A novel approach for credit card fraud
detection using machine learning." Applied Soft Computing, 120.
ResearchGate
Deep Learning:
LeCun, Y., Bengio, Y., & Hinton, G. (2015). "Deep learning." Nature, 521(7553), 436-444.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). "Deep Learning." MIT Press.
Long Short-Term Memory (LSTM) Networks:
Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural
Computation, 9(8), 1735-1780.
Logistic Regression:
Brownlee, J. (2022). "Logistic Regression for Machine Learning." Machine Learning Mastery.
Feature Selection for Imbalanced Datasets:
Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2016). "Feature selection
for imbalanced datasets: A review and new perspectives." Progress in Artificial Intelligence,
5, 1-20.

27
REFERENCES

Optimization Algorithms:
Kingma, D., & Ba, J. (2015). "Adam: A Method for Stochastic Optimization." International
Conference on Learning Representations.
Credit Card Fraud Detection:
Zhao, H., Chen, Z., Hu, J., Han, Z., & Yu, C. (2022). "A novel approach for credit card fraud
detection using machine learning." Applied Soft Computing, 120.
ResearchGate
Customer Churn Prediction:
Wijayawardhana, M. H. D., Wijayawardhana, K. A. T. K., & Rathnayake, H. A. P. (2022).
"Machine learning for customer churn prediction: a case study in the telecommunication
industry." International Journal of Computer Science and Network Security, 22(10), 161-
170.
Outlier Analysis:
Aggarwal, C. C. (2017). "Outlier Analysis." Springer International Publishing.
Transformer Models:
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., &
Polosukhin, I. (2017). "Attention is All you Need." Advances in Neural Information
Processing Systems.

28

You might also like