Malware Detection Using Machine Learning
Malware Detection Using Machine Learning
11
978-1-6654-2087-7/21/$31.00 ©2021 IEEE
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:25:59 UTC from IEEE Xplore. Restrictions apply.
2021 International Conference on Technological Advancements and Innovations (ICTAI)
Nearly 3,50,000 new types of malicious codes harm various methods were not beneficial when virus mutator kits
applications. They aim to present the literature work of all appeared, as these mutation kits made the virus appear very
the previously written papers and all the existing works different from its true form.
which have been done in the field of malware detection
using machine learning. Pramod Subramanyan, Zhixing Xu, Sayak Ray and Sharad
Malik [2] proposed a different malware scenario where one
In the D parameter is responsible for controlling the model is used for each application that separates the
strictness of the system for the classification process as legitimate executions from executions infected with
Benign or malware. The value of N was varied as 2, 4, 6 and malware. The algorithms used are logistic regression, SVM
8. The detection ratio best achieved was voice 74.37% (support vector machine) and random forest. Histogram bin
where the value of N was 4, K was 17 as well as D was 17. size needs to be chosen carefully.
According to [7] scanners of first-generation use In [6], the approaches from machine learning as well as data
fundamental approaches to detect viruses. These methods mining majorly text classification have been used. The N-
involve scanning for provided sequences of bytes known as grams also have been deduced from different executable in
strings. Wildcards supported by scanners are allowed to the form of a Boolean attribute.
miss bytes or byte ranges. Simple string matching detection
5 2008 Learning and Classification of Malware Behavior Support Vector Machine Relies on single program execution of a
malware binary.
6 2006 Learning to Detect and Classify Malicious Naive bayes, Support vector The relative performance of methods
Executable in the Wild machine used in this paper was not as good as the
previous one.
7 2008 Metamorphic Virus: Analysis and Detection Random decryption algorithm Some viruses cannot be detected even in
(RDA) an emulated environment
8 2006 Machine Learning for Computer Security Adaptive statistical compression The adversary can defeat the computer
algorithms that learns how to extract signatures for
detecting computer worms.
9 2019 Malware Detection using Machine Learning and Random forest and KNN Does not apply any recurrent neural
Deep Learning Algorithm networks for malware detection.
10 2017 Malware detection using Machine Learning SVM, Decision Tree, Naive Bayes Use of signature-based method which is
Algorithms and Multi-Naive Bayes Algorithm traditional.
11 2017 Malware Detection and Evasion with Machine Heuristic, Artificial Intelligence, Use of traditional methods for malware
Learning Techniques: A Survey Behavior, Signature Based detection.
Methods
12 2012 Malware Detection Module using Machine Learning Decision Tree, Random Forest, Some methods of machine learning are
Algorithms to Assist in Centralized Security in Naive Bayes not appropriate due to heavy processors.
Enterprise Networks
Our system is basically divided into three major modules the The user interface module is the front-end module and this
first one is the user interface the second one is the train module basically contains the front-end architecture of the
module and the third one is the malware test module.
12
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:25:59 UTC from IEEE Xplore. Restrictions apply.
2021 International Conference on Technological Advancements and Innovations (ICTAI)
system. It basically provides an interface to the user for IV. IMPLEMENTATION
entering the file that is to be checked for malicious content.
The implementation for the project has been done by
The next module is the train module. This module is used to making the use of machine learning technologies. The
train as well as test the selected models. The model to be programming language that has been used for the
used is selected according to the accuracy of each. implementation is Python 3. The back and technology that
has been used is machine learning. The front-end
This module is the main module and is responsible for the technology that has been used is tkinter GUI.
final classification result. In this module the classifier for the
model Also gets generated. The implementation involves working upon three major
modules of the project which includes two back-end
The third module is the malware test module. This module modules and one front-end module.
is used to extract the data from the file that has been
uploaded by the user through the user interface. The backend modules our malware test and train. The
frontend module is the user interface module.
It is basically responsible for the extraction and
determination of the data from the file, uploading and as The proper implementation for the project can be explained
well as the dividing of the data into various sections or in a series of steps which have been described in a flowchart
features. which is the process of understanding.
Then the final step of the architecture is classification of the It means the features which have the most impact on the
results, in proposed approach Random forest, Decision Tree, database or the system. After this the data set is split into the
Linear Regression, Adaboost detect malware with much two sets which are:
accuracy and improve the efficiency.
Training dataset (80% of the dataset):
13
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:25:59 UTC from IEEE Xplore. Restrictions apply.
2021 International Conference on Technological Advancements and Innovations (ICTAI)
This shows that the accuracy achieved for our system is
• This portion of the dataset is basically used for about 99 percent which is good accuracy in order to detect
training the dataset. Using this dataset, the model the malware. So it can be described that the results that were
basically learns. produced by your system are the accuracy which is 99%,
false positive rate which is 0.104%, false negative rate
Testing dataset (20% of the dataset): which is 0.154%.
This portion of the dataset is basically used for testing the A classification approach can be additionally implemented
dataset. Using this dataset, the model is tested. The accuracy for the malware detection system presented which will
of the model is thus determined using the testing dataset. involve the correct identification of the type of malware that
has attacked the file and can be used as a base for different
The percentage for the same is 80% for training data set and researches in order to identify the most commonly attacking
20% for testing data set. So, the test size is kept as 0.2. malwares. So, this presents and idea about the future work
for the project that can be implemented.
REFERENCES
[1] Sanjay K. Sahay, C. Rama Krishna, Sanjay Sharma1,“Detection
of Advanced Malware by Machine Learning Techniques:, 2019
[2] Pramod Subramanyan, ZhixingXu, Sayak Ray, Sharad Malik,
“Malware Detection using Machine Learning Based Analysis of
Virtual Memory Access Patterns”,2017
[3] R Mohanasundaram, P Harsha Latha, “Classification of
Malware Detection using Machine Learning Algorithms”, 2020
[4] Y. K. Penya, Santos, J. Devesa, P. G. Garcia, “N-Grams based
file signatures for malware detection”, 2009
According to our dataset the algorithm with the maximum [5] Thorsten Holz, Konrad Rieck, Carsten Willems, Patrick D¨ussel,
accuracy was Decision Tree. Pavel Laskov , “Learning and Classification of Malware
Behavior”, 2008
[6] J. Zico Kolter, Marcus A. Maloof, Learning to Detect and
Hence, it was selected to be used in the system. After this Classify Malicious Executable in the Wild”, 2006
that model is trained using the dataset. [7] Evgenios Konstantinou, “Metamorphic Virus: Analysis and
Detection”, 2008
[8] Philip K. Chan, Richard P. Lippmann “Machine Learning for
Then two files were generated. They were: Computer Security”, 2006
[9] Hemant Rathore, Swati Agarwal, Sanjay K. Sahay and Mohit
• classifier.pkl Sewak, "Malware Detection using Machine Learning and Deep
Learning",2019
• features.pkl [10] Mohd Tanveer Shaikh, Rafia Ansari, Mahenoor Suriya, Sonalii
Suryawanshi, “Malware detection using Machine Learning
After the classifier is ready we select the testing sample. The Algorithms”, Mohammad Danish Khan, 2017
testing sample is the selected from the testing data sent. [11] Jhonattan J. Barriga A. and Sang Guun Yoo, "Malware
Detection and Evasion with Machine Learning Techniques: A
Then the testing of the features is done by the help of Survey ", 2017
classifier. [12] Priyank Singhal, Nataasha Raul, "Malware Detection Module
using Machine Learning Algorithms to Assist in Centralized
If the file is malicious then the output is displayed as Security in Enterprise Networks", 2012
malicious otherwise they output is displayed as legitimate.
14
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:25:59 UTC from IEEE Xplore. Restrictions apply.