KEMBAR78
Major Projectfinal | PDF | Internet Of Things | Computer Security
0% found this document useful (0 votes)
39 views59 pages

Major Projectfinal

The document presents a major project report on an AI framework designed to identify anomalous network traffic related to Mirai and BASHLITE IoT botnet attacks, submitted by students for their Bachelor of Technology degree. It discusses the limitations of existing detection systems and proposes a machine learning approach that enhances detection accuracy and efficiency by analyzing network data in real-time. The report includes sections on system analysis, design, implementation, and testing, along with acknowledgments and a declaration of originality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views59 pages

Major Projectfinal

The document presents a major project report on an AI framework designed to identify anomalous network traffic related to Mirai and BASHLITE IoT botnet attacks, submitted by students for their Bachelor of Technology degree. It discusses the limitations of existing detection systems and proposes a machine learning approach that enhances detection accuracy and efficiency by analyzing network data in real-time. The report includes sections on system analysis, design, implementation, and testing, along with acknowledgments and a declaration of originality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 59

A

MAJOR-PROJECT REPORT
on
AI FRAME WORK FOR IDENTIFYING ANOMALOUS NETWORK
TRAFFIC IN MIRAI AND BASHLITE IOT BOTNET ATTACKS

Submitted in partial fulfillment of the requirements for the award of the


degree of

BACHELOR OF TECHNOLOGY

in

Computer Science and Engineering


Submitted
by

B. Bhavani (21UP1A05D6)
T.Mamatha (21UP1A05I4)
V.Chandana (21UP1A05J0)

Under the Guidance

of

Dr. M.ShalimaSulthana

(Associate Professor)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


VIGNAN’S INSTITUTE OF MANAGEMENT AND TECHNOLOGY
FOR WOMEN
(An Autonomous Institution)
(Affiliated to Jawaharlal Nehru Technological University Hyderabad, Accredited by NBA, NAAC
with A+)
Kondapur (Village), Ghatkesar (Mandal), Medchal (Dist.)
Telangana-501301
(2021-2025)
Department of Computer Science and Engineering

CERTIFICATE

This is to certify that the project work entitled “AI FRAMEWORK FOR
IDENTIFYING ANOMALOUS NETWORK TRAFFIC IN MIRAI AND BASHLITE
IOT BOTNET ATTACKS” submitted by B. Bhavani (21UP1A05D6), T.
Mamatha (21UP1A05I4), V. Chandana (21UP1A05J0) in the partial
fulfillment of the requirements for the award of the degree of Bachelor of
Technology in Computer Science and Engineering, Vignan’s Institute of
Management and Technology for Women is a record of Bonafide work carried
by them under my guidance and supervision. The results embodied in this project
report have not been submitted to any other University or institute for the award
of any degree.

PROJECT GUIDE THE HEAD OF DEPARTMENT


Dr. M. ShalimaSulthana Mrs. M. Parimala
(Associate Professor) (Associate
Professor)
(External Examiner)

DECLARATION

We hereby declare that the results embodied in the project entitled “AI FRAME
WORK FOR IDENTIFYING ANOMALOUS NETWORK TRAFFIC IN MIRAI AND
BASHLITE IOT BOTNET ATTACKS” is carried out by us during the year 2024-
2025 in partial fulfillment of the award of Bachelor of Technology in Computer
Science and Engineering from Vignan’s Institute of Management and
Technology for Women is an authentic record of our work under the guidance of
Dr. M. Shalima Sulthana. We have not submitted the same to any other institute or
university for the award of any other Degree.

B. Bhavani(21UP1A05D6)
T. Mamatha(21UP1A05I4)
V. Chandana(21UP1A05J0)
ACKOWLEDGEMENT

We would like to express sincere gratitude to Dr G. APPARAO NAIDU Principal,


Vignan’s Institute of Management and Technology for Women for his
timely suggestions which helped us to complete the project in time.

We would also like to thank our madam Mrs. M. Parimala, Head of the
Department and Associate Professor, Computer Science and Engineering
for providing us with constant encouragement and resources which helped us to
complete the project in time.

We would also like to thank our Project guide Dr.M.ShalimaSulthana Associate


Professor, Computer Science and Engineering, for providing us with
constant encouragement and resources which helped us to complete the project
in time with her valuable suggestions throughout the project. We are indebted to
her for the opportunity given to work under her guidance.

Our sincere thanks to all the teaching and non-teaching staff of Department
of Computer Science and Engineering for their support throughout our project
work.

B. Bhavani(21UP1A05D6)
T.
Mamatha(21UP1A05I4)
V. Chandana(21UP1A05J0)
INDEX

Contents Page No
Abstract 1
1.INTRODUCTION 2-4
1.1Objective 2
1.2 Existing System 2-3
1.2.1 Limitations of Existing System 3
1.3 Proposed System 4
1.3.1 Advantages of Proposed Systems 4
2.LITERATURE SURVEY 5
3.SYSTEM ANALYSIS 6-10
3.1 Purpose 6
3.2 Scope 6
3.3 Feasibility Study 6
3.3.1 Economic Feasibility 7
3.3.2 Technical Feasibility 7
3.3.3 Social Feasibility 8
3.4 Requirement Analysis 8
3.4.1 Functional Requirements 8
3.4.2 Non-Functional Requirements 9
3.5 Requirements Specifications 10
3.5.1 Hardware Requirements 10
3.5.2 Software Requirements 10
3.5.3 Language Specifications 10
4.SYSTEM DESIGN 11-17
4.1 System Architecture 11
4.2 Description 11
4.3 UML Diagrams 12-17
4.3.1 Use case diagrams 13
4.3.2 Activity Diagram 14-15
4.3.3 Class Diagram 15-16
4.3.4 Sequence Diagram 17
5.IMPLEMENTATION AND RESULTS 18-21
5.1 Methods/Algorithms 18
5.2 Sample Code 19-21
6.SYSTEM TESTING 22
7.SCREENSHOTS/OUTPUT 23-26
8.CONCLUSION 27
9.FUTURE SCOPE 28
10.BIBLOGRAPHY 29
10.1 References and Websites 29
11.PAPER PUBLISH

1
LIST OF FIGURES

Fig Figure name Page


no. no.
4.1 System Architecture 11
4.2 Use Case Diagram 13
4.3 Activity Diagram 15
4.4 Class Diagram 16
4.5 Sequence Diagram 17
5.1 Input code 1 19
5.2 Input code 2 19

5.3 Input code 3 20


5.4 Input code 4 20
5.5 Input code 5 21
5.6 Input code 6 21

7.1 AI Framework 23
7.2 Upload Dataset 23

7.3 Count Plot Graph 24

7.4 Train Test Splitting Dataset 24


7.5 Existing Bernouli NBC Confusion Matrix 25
7.6 Existing RFC Confusion Matrix 25
7.7 Prediction From Dataset 26

2
Abstract:

In recent years, the proliferation of Internet of Things (IoT) devices has


significantly increased the volume of network traffic and the complexity of
network environments. According to the Global IoT Security Market Report, the
number of IoT devices surged from approximately 8.4 billion in 2017 to an
estimated 30.9 billion by 2025. This exponential growth has led to a
corresponding rise in sophisticated network attacks, with Mirai and BASHLITE
botnets being prominent examples. The Mirai botnet, which first emerged in 2016,
exploited IoT vulnerabilities to launch large-scale Distributed Denial of Service
(DDoS) attacks, while BASHLITE, identified in 2014, has been known for its
effective use of compromised devices in various cyber-attacks. Traditional
methods for detecting anomalies in network traffic typically involve manual
inspection and rule-based systems. These approaches face several challenges,
including the high volume of data, the dynamic nature of network traffic, and the
evolving tactics of attackers. Manual methods are often Labor-intensive, prone to
human error, and struggle to keep pace with sophisticated attack techniques.
Rule-based systems, while useful, are limited by their inability to adapt to new
and previously unseen attack patterns. Machine Learning (ML) offers a promising
alternative for addressing these limitations. By leveraging algorithms capable of
learning from historical data and adapting to new patterns, ML models can
identify anomalous traffic indicative of Mirai and BASHLITE attacks more
effectively. These models can analyze vast amounts of network data in real time,
detect subtle deviations from normal behavior, and improve detection accuracy
over time. This approach not only enhances the ability to identify emerging
threats but also reduces the dependency on manual intervention, leading to more
efficient and scalable network security solutions.

1
CHAPTER 1

1.INTRODUCTION:

1.1 Objective:

In the contemporary digital landscape, the proliferation of Internet of Things


(IoT) devices has revolutionized various industries, leading to unprecedented
growth in network traffic and complexity. According to the Global IoT Security
Market Report, the number of IoT devices surged from approximately 8.4 billion in
2017 to an estimated 30.9 billion by 2025. This exponential increase has
simultaneously escalated the risk and frequency of sophisticated network attacks,
with the Mirai and BASHLITE botnets being particularly notorious. The Mirai
botnet, first identified in 2016, exploited vulnerabilities in IoT devices to
orchestrate large-scale Distributed Denial of Service (DDoS) attacks, while the
BASHLITE botnet, discovered in 2014, has been associated with various cyber-
attacks leveraging compromised devices. Traditional anomaly detection methods
in network traffic, which often rely on manual inspection and rule based systems,
are increasingly inadequate in this evolving threat landscape. These conventional
approaches face significant challenges due to the vast volume of data, the
dynamic nature of network environments, and the continuously evolving tactics of
attackers. Manual methods are labour - intensive, prone to human error, and
struggle to keep pace with the sophistication of modern cyber threats. Rule based
systems, although useful, lack the flexibility to adapt to new and previously
unseen attack patterns. In response to these limitations, Machine Learning (ML)
presents a promising solution. ML algorithms, with their capacity to learn from
historical data and adapt to new patterns, offer enhanced capabilities for
identifying anomalous traffic indicative of Mirai and BASHLITE botnet activities. By
analysing extensive network data in real time, ML models can detect subtle
deviations from normal behaviour, thereby improving detection accuracy and
response times. This approach not only bolsters the ability to identify emerging
threats but also minimizes reliance on manual intervention, paving the way for
more efficient and scalable network security solutions.

2
1.2 Existing System:

The existing system for detecting anomalous network traffic associated with
Mirai and BASHLITE IoT botnet attacks predominantly relies on traditional
intrusion detection systems (IDS) that utilize signature based and rule-based
techniques. These systems are integrated into network infrastructures and aim
to identify and mitigate malicious activities by matching incoming traffic
patterns against known attack signatures or predefined rules. However, these
methods face significant limitations when dealing with the complex and
evolving nature of modern IoT botnets like Mirai and BASHLITE.

1.2.1 Limitations of Existing Systems

Static Signatures: Since these systems rely on static signatures, they can
only detect known attacks. Any new variant or previously unseen attack pattern
will bypass detection. Rule-Based Detection: Rule-based IDS relies on predefined
rules that are configured to trigger alerts when certain conditions are met. Lack of
accessibility: Traditional methods may not be accessible to everyone, particularly
those who are visually impaired, deaf, or have limited English proficiency. Lack of
Adaptability: Rule-based systems are rigid and unable to adapt to evolving
threats. As attackers modify their tactics, techniques, and procedures (TTPs),
existing rules may become obsolete. Complex Configuration: Setting up and
maintaining an extensive rule set is complex and requires continuous updates to
stay effective, making it resource-intensive. Network Flow Analysis: Some existing
systems incorporate network flow analysis, which examines metadata from
network traffic(e.g., source/destination IPs, port numbers, packet counts) to
detect anomalies. This method can identify traffic that deviates from normal
patterns, such as unusually high volumes of traffic or unexpected communication
between devices. Scalability Issues: With the massive influx of IoT devices,
network flow analysis becomes more challenging due to the increased volume
and diversity of traffic. Limited Granularity: Network flow data may lack the

3
granularity needed to pinpoint specific malicious activities, especially when
encrypted traffic is involved.
1.3 Proposed System

We examine internet traffic from devices for unusual patterns. We clean


up the traffic data, translate attack labels into numbers, and find the most
important clues using a "heatmap."Subsequently, the preprocessed data is
split into 70% training and 30% testing sets, and performance metrics
including accuracy, precision, recall, F1 score, and confusion matrix are
defined. Machine learning models, like Naive Bayes and Random Forest, are
trained models are evaluated using the testing data, and the Random Forest
model, recognized for its reliability, is selected for predicting attack types in
new data, effectively detecting IoT botnet attacks like Mirai and BASHLITE.

1. Library Imports: Essential libraries are imported for data handling


(numpy, pandas), visualization (matplotlib, seaborn), machine learning
(sklearn), and model persistence (joblib).

2. Data Loading and Exploration: The dataset is loaded from a CSV file.
o Basic exploratory data analysis (EDA) is conducted, including displaying the
first and last few rows, summary statistics, and checking for unique values in
the 'Attack' column.

3. Data Preprocessing: Missing values are identified. o A heatmap is


generated to visualize correlations among features. o Categorical columns
like 'Device_Name' and 'Attack_subType' are dropped, and the 'Attack' labels
are encoded using LabelEncoder.

4. Data Visualization: A count plot visualizes the distribution of different


attack categories in the dataset.

5. Feature and Target Variable Separation: The feature set (x) is


separated from the target variable (y), which represents attack classes.

4
6. Data Splitting: The dataset is split into training and testing sets using
an 70/30 ratio for model evaluation.

7. Performance Metrics Function: A function (performance_metrics) is


defined to calculate and print accuracy, precision, recall, F1 score, and display
a confusion matrix.

8. Model Training: Bernoulli Naive Bayes Classifier: The model is trained


on the training data, and predictions are made on the test set. If a saved
model exists, it is loaded instead of retraining. Random Forest Classifier:
Similar logic applies. The model is either trained from scratch or loaded from a
saved state.

9. Model Evaluation: Predictions from both classifiers are evaluated


using the previously defined performance metrics function.

10. Prediction on New Data: oA new test dataset is loaded and


preprocessed similarly. Predictions are made using the Random Forest model,
and results are printed, labeling each entry with the corresponding attack
type.

Data Preprocessing and Splitting: The process begins with the collection and
loading of the dataset into the environment, specifically targeting traffic data
related to IoT devices under attack by Mirai and BASHLITE botnets. This data is
typically rich with various network parameters, including packet size,
transmission time, source and destination IPs, etc.

Data Preprocessing:
Null Value Removal: The dataset undergoes initial preprocessing where null or
missing values are identified and removed. Missing data can skew the results
or lead to inaccuracies in model predictions. Techniques like removing rows

5
with missing data or imputing values based on statistical measures may be
used.

Label Encoding: After handling null values, categorical variables are converted
into numerical formats using label encoding. In this specific context, attack
labels (e.g., Normal, BASHLITE, Mirai) are encoded into numerical values to
make them interpretable by machine learning models.
Heatmap Visualization: A correlation heatmap is generated to visualize the
relationships between different features in the dataset. This step helps in
identifying which features contribute most significantly to the classification
task, guiding the selection of important variables for the model.

Data Splitting: Training and Testing Set Creation: The preprocessed data is
split into training and testing sets, usually in a 70-30 ratio. The training set is
used to build the model, while the testing set is reserved for evaluating its
performance. This step is critical to ensure the model can generalize well to
new, unseen data.

Random Forest Classifier Overview: Random Forest is an ensemble learning


method that constructs multiple decision trees during training and outputs the
mode of their predictions for classification (or average for regression). It
improves accuracy and controls overfitting by aggregating predictions from
several trees.

Key Characteristics:

• Ensemble Method: Combines the predictions of multiple trees, which


helps reduce overfitting compared to individual decision trees.

• Feature Importance: Can identify the importance of different features


in making predictions, aiding in model interpretability.

6
• Robustness: Handles non-linear relationships and interactions
between features effectively.

Performance: Generally provides higher accuracy and robustness than Naive


Bayes, especially with complex datasets and when features are correlated.
1.3.1 Advantages of the Proposed System

1. Enhanced Detection Accuracy: Utilizing ensemble methods like Random Forest


improves accuracy in identifying
complex attack patterns compared to simpler models, leading to more reliable
detection of IoT botnet activities.
2. Robustness to Noise and Outliers: The Random Forest classifier's inherent
ability to mitigate the impact of noise and outliers results in more stable
predictions, enhancing overall system reliability.
3. Real-Time Anomaly Detection: The system can be deployed for real-time
monitoring of network traffic, enabling immediate detection and response to
potential threats, thereby minimizing damage.
4. Feature Importance Analysis: The Random Forest model provides insights into
feature importance, allowing stakeholders to understand which attributes
contribute most significantly to attack detection, aiding in network security
strategy formulation.
5. Scalability: The proposed system can scale to accommodate larger datasets as
IoT environments expand,ensuring consistent performance and adaptability to
increasing data volumes.
6. Flexibility in Handling Diverse Data Types: The approach can handle a variety of
feature types (categorical, numerical) effectively,making it suitable for diverse IoT
network configurations.

7
CHAPTER 2
2.LITERATURE SURVEY

Author Year Title Objective Methodology Key


Findings

Abomhar 2015 Cyber To Analysis of Inherent


a and Security examine inadequate vulnerabilitie
koien and the the various security s stem from
Internet of cybersecuri measures in inadequate
Things: ty device design security. A
Vulnerabili challenges and holistic
ties, posed by deployment. approach to
Threats, IoT devices IoT security
Intruders and is crucial
and highlight
Attacks inherent
vulnerabiliti
es in IoT
ecosystems
.
Andrea et 2015 Internet of To discuss Proposed a Key security
al Things: the security layered concerns
Security vulnerabiliti security model include data
Vulnerabili es and recommendin privacy,
ties and challenges g device
Challenge associated encryption,sec authenticatio
s with IoT, ure n, and secure
outlining bootstrapping, communicati
key and on.
security continuous
concerns. monitoring.

8
Deogirik 2017 Security To explore Categorized Machine
ar and attacks in various attacksinto learning and
IoT:a security physical anomaly
Vidhate
survey attacks attacks, detection
targeting network play a role in
IoT attacks, and identifying
systems. software and
attacks, mitigating att
acks.

CHAPTER 3

3. SYSTEM ANALYSIS
3.1 Purpose

The primary purpose of this AI framework is to enhance the cybersecurity


posture of IoT networks by proactively and accurately detecting anomalous
network traffic indicative of Mirai and Bashlite botnet attacks.

This aims to:

Minimize the impact of IoT botnet attacks: By enabling early detection, the
framework can help prevent or mitigate Distributed Denial of Service (DDoS)
attacks, data exfiltration, and other malicious activities launched by compromised
IoT devices.Protect vulnerable IoT devices: Many IoT devices have inherent
security weaknesses (e.g., default credentials, unpatched vulnerabilities). The
framework acts as an external layer of defense to identify when these devices are
being exploited. Improve network resilience: By identifying and alerting about
botnet activities, network administrators can take timely action to isolate infected
devices and prevent further propagation.Automate threat detection: Leverage
AI/ML to move beyond signature-based detection, which can be ineffective against
new or polymorphic botnet variants, towards anomaly-based detection that can
identify novel attack patterns.

9
3.2 Scope

Targeted Botnets: Specifically focuses on the detection of network traffic


anomalies related to Mirai and Bash lite botnets due to their significant impact
and distinct attack patterns. Network Traffic Analysis: The framework will primarily
analyze network traffic data (e.g., flow data, packet headers) collected from
various points within an IoT network or at network gateways. Anomaly Detection:
The core functionality involves identifying deviations from normal network
behavior that signal malicious activity. IoT Device Focus: While the network traffic
is the primary input, the ultimate goal is to protect and identify compromised IoT
devices within the network. Machine Learning/Deep Learning Techniques: The
framework will utilize various AI/ML algorithms for classification, clustering, and
pattern recognition. Data Sources: Utilizes datasets containing both benign and
botnet-infected IoT network traffic (e.g., N-Ba IoT dataset).

3.3 Feasibility Study


A feasibility study assesses whether the project is achievable and
worthwhile. Three key considerations involved in the feasibility analysis are
 ECONOMICAL FEASIBILITY
 TECHNICAL FEASIBILITY
 SOCIAL FEASIBILITY

3.3.1 Economic Feasibility

Cost of Implementation:
Hardware: Servers/cloud infrastructure for data processing, AI model training, and
deployment. Software: AI/ML libraries, data analytics tools, network monitoring
tools. Personnel: Data scientists, AI engineers, cyber security analysts. Data
Acquisition: Cost of acquiring or generating relevant datasets for training and
testing.
Cost of Inaction: Financial losses from DDoS attacks: Business disruption, revenue
loss, recovery costs. Reputational damage: Loss of customer trust. Data breaches:
Regulatory fines, intellectual property theft. Resource consumption: Increased
bandwidth usage, device performance degradation.
Potential Benefits:

10
 Reduced financial losses due to botnet attacks.
 Improved system uptime and reliability.
 Enhanced brand reputation.
 Compliance with security regulations.
 Potential for commercialization as a security product/service.

3.3.2 Technical Feasibility

Existing Technologies:
Machine Learning/Deep Learning Libraries: TensorFlow, Py Torch, Scikit-
learn, etc., are mature and well-documented. Network Monitoring Tools:
Wireshark, tcp dump, Net Flow collectors provide the necessary data. Big Data
Platforms: Apache Spark, Had oop can handle large volumes of network traffic
data. Cloud Computing: Offers scalable infrastructure for processing and analysis.
Data Availability: Datasets like N-Ba IoT, which specifically contain Mirai and Bash
lite traffic, are publicly available and can be used for model training and
evaluation. Algorithm Suitability: AI algorithms (e.g., SVM, Random Forests, Neural
Networks, Autoencoders, Isolation Forests, One-Class SVM) have proven effective
in anomaly detection and botnet classification in network traffic. Integration
Challenges: Integrating the framework with existing network infrastructure and
security systems (e.g., SIEM, IDS) is a technical challenge that needs to be
addressed, but is generally achievable. Computational Resources: Training
complex AI models requires significant computational power, but this can be
addressed through cloud computing or specialized hardware (GPUs/TPUs).

3.3.3 Social Feasibility

User Acceptance: Cybersecurity professionals and network administrators


are generally receptive to advanced tools that improve threat detection and
reduce manual effort. Privacy Concerns: Handling network traffic data raises
privacy concerns. The framework must be designed with data anonymization,
aggregation, and strict access controls to comply with privacy regulations (e.g.,

11
GDPR, CCPA). Ethical Implications of AI: Ensuring the AI models are fair,
transparent, and do not introduce bias in detection is important. Explainable AI
(XAI) techniques can contribute to trust and understanding. Skill Gap: A potential
challenge is the availability of personnel with expertise in both cybersecurity and
AI/ML. Training programs or expert consultation might be required. Public
Perception: Addressing public concerns about AI surveillance and data privacy in
network monitoring is crucial for broader adoption.

3.4 Requirement Analysis

This section outlines the essential characteristics and functionalities the AI


framework must possess.

3.4.1 Functional Requirements

Network Traffic Data Ingestion: The system shall be able to collect and ingest
network traffic data (e.g., NetFlow, IPFIX, packet captures) from various network
devices or sensors.

Data Preprocessing and Feature Extraction: The system shall preprocess raw
network traffic data, including cleaning, normalization, and extracting relevant
features (e.g., packet size, inter-arrival time, flow duration, protocol types, port
numbers).

Anomaly Detection Model Training: The system shall support training AI/ML
models (e.g., classifiers, clustering algorithms, autoencoders) on historical
network traffic datasets containing both benign and known botnet attack patterns
(Mirai, Bashlite).

Real-time Anomaly Detection: The system shall be able to analyze incoming live
network traffic streams and identify anomalous patterns in real-time or near real-
time.

12
Botnet Attack Classification: The system shall be able to classify detected
anomalies specifically as Mirai or Bashlite botnet attacks (or general IoT botnet
attacks if specific classification is not feasible with high accuracy).

Alert Generation and Notification: Upon detecting a botnet attack, the system
shall generate alerts and send notifications to designated security personnel or
integrated security systems (e.g., SIEM, ticketing system).

Visualization and Reporting: The system shall provide a user interface or reporting
capabilities to visualize detected anomalies, attack trends, and key performance
metrics (e.g., detection rate, false positive rate).

Model Management: The system shall allow for managing, updating, and
retraining AI/ML models as new attack patterns emerge or network behavior
changes.
Data Storage: The system shall securely store network traffic data, extracted
features, and model training results for analysis and auditing.

Output Design Outputs from computer systems are required primarily to


communicate the results of processing to users. They are also used to provides a
permanent copy of the results for later consultation. The various types of outputs
in general are: External Outputs, whose destination is outside the organization
Internal Outputs whose destination is within organization and they are the User’s
main interface with the computer.

Output Definition
The outputs should be defined in terms of the following points:
• Type of the output
• Content of the output
• Format of the output
• Location of the output
• Frequency of the output

13
• Volume of the output
• Sequence of the output

Input Design
Input design is a part of overall system design. The main objective during the
input design is as given below:
• To produce a cost-effective method of input.
• To achieve the highest possible level of accuracy.
• To ensure that the input is acceptable and understood by the user.

Input Types It is necessary to determine the various types of inputs. Inputs can be
categorized as follows:
• External inputs, which are prime inputs for the system.
• Internal inputs, which are user communications with the system.
• Operational, which are computer department’s communications to the system?
• Interactive, which are inputs entered during a dialogue.

Input Media
At this stage choice has to be made about the input media. To conclude about the
input media consideration has to be given to

• Type of input
• Flexibility of format
• Speed
• Accuracy
• Verification methods
• Rejection rates
• Ease of correction
• Storage and handling requirements
• Security
• Easy to use
• Portability

14
Keeping in view the above description of the input types and input media, it can
be said that most of the inputs are of the form of internal and interactive. As Input
data is to be the directly keyed in by the user, the keyboard can be considered to
be the most suitable input device.

Data Validation
Procedures are designed to detect errors in data at a lower level of detail. Data
validations have been included in the system in almost every area where there is
a possibility for the user to commit errors. The system will not accept invalid data.
Whenever an invalid data is keyed in, the system immediately prompts the user
and the user has to again key in the data and the system will accept the data only
if the data is correct. Validations have been included where necessary. The system
is designed to be a user friendly one. In other words the system has been
designed to communicate effectively with the user. The system has been
designed with popup menus.

Computer-Initiated Interfaces

The following computer – initiated interfaces were used:

• The menu system for the user is presented with a list of alternatives and the
user chooses one; of alternatives.
• Questions – answer type dialog system where the computer asks question and
takes action based on the basis of the users reply.

Right from the start the system is going to be menu driven, the opening menu
displays the available options. Choosing one option gives another popup menu
with more options. In this way every option leads the users to data entry form
where the user can key in the data.

15
Performance is measured in terms of the output provided by the application.
Requirement specification plays an important part in the analysis of a system.
Only when the requirement specifications are properly given, it is possible to
design a system, which will fit into required environment. It rests largely in the
part of the users of the existing system to give the requirement specifications
because they are the people who finally use the system. This is because the
requirements have to be known during the initial stages so that the system can
be designed according to those requirements. It is very difficult to change the
system once it has been designed and on the other hand designing a system,
which does not cater to the requirements of the user, is of no use.

3.4.2 Non-Functional Requirements

Performance: Detection Latency: The system shall detect critical anomalies within
a specified time frame (e.g., milliseconds for real-time alerts).Throughput: The
system shall be able to process a high volume of network traffic (e.g., gigabits per
second) without significant performance degradation .

Accuracy: The system shall achieve a high detection rate (True Positive Rate) for
Mirai and Bash lite attacks (e.g., >95%) while maintaining a low false positive rate
(e.g., <5%).

Scalability: The framework shall be scalable to handle increasing network traffic


volumes and a growing number of IoT devices.

Reliability: The system shall be robust and operate continuously without


significant downtime. Redundancy mechanisms should be considered.

Security: The system itself shall be secured against unauthorized access, data
tampering, and internal/external threats. Data privacy and anonymization
techniques must be implemented.

16
Maintainability: The framework's codebase and architecture shall be modular,
well-documented, and easy to maintain and update.

Usability: The user interface for configuration, monitoring, and reporting shall be
intuitive and user-friendly for cybersecurity analysts.

Interoperability: The framework shall be able to integrate with existing network


infrastructure, security tools, and data platforms using standard protocols and
APIs. Resource Efficiency: The framework should optimize its use of computational
resources (CPU, memory, storage) while maintaining performance.

3.5 Requirements Specifications


3.5.1 Hardware Requirements

• For Development and Testing


• Processor: Multi-core processor (Intel i5/i7, AMD Ryzen5 , or better)
• RAM: At least 8GB (16GB or more recommended for handling larger
datasets)
• Storage:
-256GB HDD or SSD (for datasets and software)
-SSD preferred for faster data access

3.5.2 Software Requirements

• Operating System
• Windows 10/11
• PYTHON IDLE 3.7 VERSION

3.5.3 Language Specifications

17
The choice of programming languages and technologies.
Core Programming Language:

Python: Highly recommended due to its extensive libraries for AI/ML (TensorFlow,
PyTorch, Scikit-learn, Keras), data manipulation (Pandas, NumPy), and network
programming.

Data Processing and Big Data: Python with Spark/Pandas: For large-scale data
ingestion, cleaning, and feature engineering.

CHAPTER 4

4. SYSTEM DESIGN

4.1 System Architecture

Figure 4.1: System Architecture

18
4.2 Description

The architecture is designed to identify malicious IoT network traffic using


machine learning models. It consists of multiple interconnected components:

1. User Device: The front-end or client interface.Users interact with the system
via a web-based dashboard to upload traffic data or view detection results.
2. Web Server: Acts as a communication bridge between the user and the
backend system.Sends input data (e.g., captured network traffic) from the user
to the Application Server. Receives performance metrics or detection results
from the backend and displays them to the user.

3. Database Server: Contains the Botnet Dataset which includes both training
and testing data for known IoT attacks (Mirai, Bashlite). Provides training and
testing data to the Application Server for model training and evaluation.

4. Application Server: The core processing unit of the framework.


It includes:Data Preprocessing Module: Cleans, normalizes, and prepares raw
network data for analysis.Performance Metrics Module: Evaluates models based
on accuracy, precision, recall, etc.Sends the processed data to machine learning
models and returns results.

5. Model Storage Server Stores pre-trained ML models such as:


 Bernoulli Naive Bayes
 Random Forest

4.3 UML Diagrams

4.3.1 Use case diagram


This use case diagram illustrates the major interactions between system
users (actors) and the core functionalities of the AI framework used for
identifying anomalous IoT botnet traffic such as Mirai and Bashlite.

19
1. Attacker Represents malicious entities responsible for initiating or simulating
IoT botnet attacks.Engages with the system to generate botnet activity.

2. Data Source Refers to traffic-generating devices, logs, or tools that provide


network traffic data (both normal and malicious). Initiates multiple actions within
the system.

Use Cases (System Functions):

1. IoT Botnets Represents the occurrence or simulation of IoT-based botnet


attacks (like Mirai and Bashlite).Attack traffic is captured and analyzed.

2. Data Collection Involves gathering raw network traffic from various devices or
simulations. This includes both normal traffic and malicious botnet traffic.

3. Data Preprocessing Raw traffic data is cleaned, filtered, and transformed.

Prepares the dataset for machine learning model training.

4. Random Forest Model Training The preprocessed data is used to train a


Random Forest classifier. Model learns to differentiate between normal and
malicious traffic.

5. Performance Evaluation The trained model is tested using metrics like


accuracy, precision, recall, and F1-score. Helps in validating model
effectiveness.

6. Results Display The outcome of detection and evaluation is displayed to the


user or analyst. Shows if the traffic was normal or anomalous.

20
Figure 4.2: Use Case diagram

21
4.3.2 Activity Diagram

Start Node: The process begins with a solid black circle, indicating the start
of the activity. Load Dataset: The first activity is to "load Dataset." This
represents fetching the data that will be used for machine learning. Remove Null
Values: After loading, the next step is to "Remove Null Values," which is a
common data preprocessing task to handle missing data. Encode Labels:
Following null value removal, "Encode Labels" suggests converting categorical
labels into a numerical format suitable for machine learning algorithms. Visualize
Heatmap: "Visualize Heatmap" implies a step for data exploration, likely to
understand correlations between features in the dataset. Split Data: The dataset
is then "Split Data," typically into training and testing sets. Decision Node
(RandomForest Model Exists): This is a diamond shape representing a decision
point. It checks whether a "RandomForest Model Exists." If Yes (Model Exists):
The flow proceeds to "Load RandomForest Model." If No (Model Does Not Exist):
The flow proceeds to "Train RandomForest Model." Merge Node: After either
loading or training the model, the two paths merge before proceeding to the
next step. Predict RandomForest Model: The trained or loaded Random Forest
model is then used to "Predict RandomForest Model," meaning it makes
predictions on new or unseen data (likely the test set). Evaluate RandomForest
Performance: The predictions are then compared against actual values to
"Evaluate RandomForest Performance," using metrics like accuracy, precision,
recall, etc. Complete Results: "Complete Results" implies a step where the
evaluation outcomes are finalized or aggregated. Display Results: Finally,
"Display Results" indicates that the performance metrics and other relevant
outcomes are presented to the user. End Node: The process concludes with a
solid black circle surrounded by an outer circle, representing the end
of the activity.

22
Figure 4.3: Activity Diagram

23
4.3.3 Class Diagram

The class diagram is used to refine the use case diagram and define a detailed
design of the system. The class diagram classifies the actors defined in the use
case diagram into a set of interrelated classes. The relationship or association
between the classes can be either an “is-a” or “has-a” relationship. Each class in
the class diagram may be capable of providing certain functionalities. These
functionalities provided by the class are termed “methods” of the class. Apart
from this, each class may have certain “attributes” that uniquely identify the class

Data Preprocessing:Manages raw and cleaned traffic data, with methods for tasks
like removing null values, encoding labels, visualizing data (heatmap), and
splitting datasets. It feeds processed data to the classifiers.

BernoulliNBClassifier:Represents a Bernoulli Naive Bayes model, holding trained


features and labels. Its methods include loading, training, predicting, and
evaluating model performance. It sends evaluation data to performanceMetrics.

RandomForestClassifier:Represents a Random Forest model, similar to the


BernoulliNBClassifier in its attributes (trained features, labels) and methods (load,
train, predict, evaluate performance). It also sends evaluation data to
performanceMetrics.

performanceMetrics:This class is responsible for evaluating the classifiers. It holds


true and predicted labels and provides methods to calculate standard metrics like
accuracy, precision, recall, and to generate a confusion matrix. It receives input
from both classifier classes.

24
Figure 4.4: Class Diagram

25
4.3.4 Sequence Diagram
This UML Sequence Diagram illustrates the interactive flow for a machine
learning pipeline involving data processing, model training, prediction, and
evaluation. The process begins with a user initiating loadDataset() in the
DataPreprocessing component, which then sequentially performs
removeNullValues(), encodeLabels(), visualizeHeatmap(), and splitData().
Subsequently, the user triggers the workflow for two distinct classifiers: the
BernoulliNBClassifier and the RandomForestClassifier

Figure 4.5: Sequence Diagram

CHAPTER 5

26
5. IMPLEMENTATION AND RESULTS

5.1 Methods/Algorithms

Data Loading and Exploration: The dataset is loaded from a CSV file ncluding
displaying the first and last few rows, summary statistics, and checking for unique
values in the 'Attack' column.
Data Preprocessing: Missing values are identified. o A heatmap is generated to
visualize correlations among features. o Categorical columns like 'Device_Name'
and 'Attack_subType' are dropped, and the 'Attack' labels are encoded using
LabelEncoder. .
Data Visualization: A count plot visualizes the distribution of different attack
categories in the dataset.
Feature and Target Variable Separation: o The feature set (x) is separated
from the target variable (y), which represents attack classes.
Data Splitting: The dataset is split into training and testing sets using an 70/30
ratio for model evaluation.
Performance Metrics Function:A function (performance_metrics) is defined to
calculate and print accuracy, precision, recall, F1 score, and display a confusion
matrix.
Model Training:
Random Forest Classifier: Similar logic applies. The model is either trained from
scratch or loaded from a saved state.
Model Evaluation: Predictions from both classifiers are evaluated using the
previously defined performance metrics function.
Prediction on New Data: A new test dataset is loaded and preprocessed
similarly. Predictions are made using the Random Forest model, and results are
printed, labeling each entry with the corresponding attack type.

5.2 Sample Code

27
Fig 5.1 Input code 1

28
fig 5.2 Input code 2

29
fig 5.3 Input code 3

30
fig 5.4 Input code 4

31
fig 5.5 Input code 5

32
fig 5.6 Input code 6

CHAPTER 6

33
6.SYSTEM TESTING
System testing for an "AI Framework for Identifying Anomalous Network
Traffic in Mirai and Bashlite IoT Botnet Attacks" is the final, comprehensive stage
of testing, ensuring the entire integrated solution meets its specified
requirements before deployment. Its primary goal is to validate the system's
ability to reliably and accurately detect the specific attack patterns of Mirai and
Bashlite botnets within diverse network traffic.This involves:

End-to-End Functional Validation: Verifying that data ingestion, preprocessing


(like null value handling and feature extraction), AI model execution (training,
prediction), anomaly detection, and alert generation all work seamlessly together.

Performance Evaluation: Assessing the framework's throughput (how much


traffic it can process), latency (how quickly it detects and alerts), and resource
consumption (CPU, memory) under realistic network loads, crucial for real-time
threat detection.

Accuracy and Robustness: Rigorously testing the AI models against both


known botnet signatures and variations, as well as a wide range of benign traffic
to ensure high true positive rates and very low false positive rates.

Security Testing: Ensuring the framework itself is secure from attacks,


protecting the integrity of its data and models.

Reliability and Stability: Confirming the system can operate continuously


without crashes or degradation over extended periods.

Usability: Evaluating the clarity of alerts, dashboards, and reporting for


cybersecurity analysts.In essence, system testing for this AI framework is about
proving that the entire automated threat identification system is fit for purpose,
delivering effective and dependable protection against sophisticated
IoT botnet attacks.

CHAPTER 7

34
7.SCREENSHOTS/OUTPUT

Fig 7.1 AI Framework

35
Fig.7.2 Upload Dataset

36
Fig.7.3. Count Plot Graph

37
Fig.7.4. Train Test Splitting Dataset

38
Fig.7.5 Existing Bernouli NBC Confusion Matrix

39
Fig 7.6 : Existing RFC Confusion Matrix

40
Fig 7.7 Prediction from Dataset

41
CHAPTER 8

8.CONCLUSION
This research successfully demonstrates the application of machine learning
techniques to enhance the detection of sophisticated IoT-based botnet attacks.
By leveraging the Bernoulli Naive Bayes and Random Forest classifiers, the
framework addresses the challenges posed by the increasing volume and
complexity of network traffic due to the proliferation of IoT devices. The
implementation focuses on two prevalent botnets, Mirai and BASHLITE, which
have historically exploited vulnerabilities in IoT devices to orchestrate large-scale
Distributed Denial of Service (DDoS) attacks and other cyber threats. The use of
machine learning enables the system to identify anomalous patterns in network
traffic, which traditional rule-based systems might overlook. The Random Forest
classifier, in particular, provides a high degree of accuracy and robustness,
making it a suitable choice for real-world deployment in dynamic network
environments.This AI-driven approach offers significant improvements over
manual inspection and traditional anomaly detection methods. The ability to
process and analyze vast amounts of data in real time, coupled with the
adaptability of machine learning models to evolving attack patterns, ensures that
the framework remains effective even as cyber threats become more
sophsticated. Additionally, the integration of performance metrics like accuracy,
precision, recall, and F1-score provides a comprehensive evaluation of the
models, ensuring they meet the necessary standards for deployment.

42
CHAPTER 9

9.FUTURE SCOPE
Even if the existing system is reliable, it can yet be expanded and
improved. The following improvements could be investigated in further iterations
Integration with Real-Time Systems: One of the primary future directions for
this project is to integrate the trained machine learning models into real-time
network monitoring systems. This would involve deploying the models within 5
network security infrastructure to continuously monitor and analyze incoming
traffic for anomalies, providing instant alerts and automated responses to
potential threats. Real-time deployment would also require optimizing the
models for speed and efficiency, ensuring minimal latency in detection. more
scalable and infrastructure expenses can be decreased by optimizing the model
to run on edge devices such as the Raspberry Pi or Jetson Nano.
Incorporating Advanced Machine Learning Techniques: While Bernoulli
Naive Bayes and Random Forest classifiers provides a solid foundation, future
enhancements could explore more advanced machine learning techniques, such
as deep learning. Convolutional Neural Networks (CNNs) or Recurrent Neural
Networks (RNNs) could be employed to capture more complex patterns in
network traffic data, potentially improving detection rates for sophisticated or
novel attack vectors.
Expanding the Scope to Include Other Botnets: The current framework
focuses on detecting Mirai and BASHLITE botnets. Future work could expand the
scope to include other emerging IoT botnets, such as Hajime, Amnesia, or
Reaper. By incorporating a broader range of attack types, the framework could
be made more versatile and capable of handling a wider array of threats.

43
CHAPTER 10
10.BIBLOGRAPHY
10.1 References
1. Meidan, Y., et al. "N-BaIoT: Network-based Detection of
IoT Botnet Attacks Using Deep Autoencoders”. arXiv preprint
arXiv:1805.03409 (2018).
2. Haq, M. A., & Khan, M. A. R. "DNNBoT: Deep Neural
Network-Based Botnet Detection and Classification."
Computers, Materials & Continua 71.1 (2022):45394.
3. Nguyen, T. D., et al. "DÏoT: A Federated Self learning
Anomaly Detection System for IoT." arXiv preprint arXiv:
1804.07474 (2018).
4.Koroniotis, N., & Moustafa, N. "Enhancing Network
Forensics with Particle Swarm and Deep Learning: The Particle
Deep Framework." arXiv preprint arXiv: 2005.00722 (2020).
5. Hezam, M. A., et al. "Combining Deep Learning Models for
Enhancing the Detection of Botnet Attacks in Multiple Sensors
Internet of Things Networks." JOIV: International Journal on
Informatics Visualization 6.2 (2022):733.
6. Kumar, A., & Lim, T. J. "EDIMA: Early Detection of IoT
Malware Network Activity Using Machine Learning
Techniques." arXiv preprint arXiv: 1906.09715 (2019).
7.Researchers. "Intelligent Detection of IoT Botnets Using
Machine Learning and Deep Learning." Applied Sciences
10.19 (2020):7009.
8. Researchers. "A Deep Learning Method for Lightweight
and Cross-Device IoT Botnet Detection." Applied Sciences
13.2 (2023):837.
9. Researchers. "IoT Botnet Anomaly Detection Using
Unsupervised Deep Learning." Electronics 10.16 (2021): 1876.
10.Researchers. "Hybrid Deep-Learning Model to Detect
Botnet Attacks over Internet of Things Environments." Soft
Computing (2022).

44
45
46
47
48
49
50

You might also like