KEMBAR78
Malware Dectection Using Machine learning | ODP
Malware Detection using Machine Learning
By:
Shubham Dubey(14ucs114)
Malware overview
●Malicious software that tries to damage or perform unauthorized
access to your system.
●Can be of different type:
Virus | Trojan | Adware | Worm etc
●More then 1 Lacs new samples found by AV companies every day.
●Most of them are Variant of each other or some old samples.
Current status of Detection
●Currently Antivirus company use signature based detection.
●Signature can be anything from strings to assembly code
snippets.
Problem with current method
●Polymorphic malware can change their code on every execution.
●Most malware can encrypt or Pack themselves using packers.
●Detecting those malware using signature doesn’t work all the time.
Solution using ML
●API sequence features can be used to detect if a file is malicious
or not.
●API calls are robust way of analysis as they cannot be alter easily.
●They outline everything happening to the operating system,
including the operations on the files,registry, mutexes, processes
and other features mentioned earlier.
●For example, OpenFile, CreateFile define the file operations,
OpenMutex, CreateMutex and describe mutexes opened/created.
Our System Description
●Cuckoo sandbox is used to analyze and record all Api calls.
●File report get saved into json format file.
●Calls get parsed to save inside csv file in matrix vector format.
●Samples with less then 10 calls(or any other user set value) get
ignored.
Feature extraction
● The frequency representation approach has been taken.
●S1,S2..are sample number. API1,API2 are Calls made to an API.
Redundant subsequence removal
methods
●There are large number of useless api call sequence present.
●They can be removed using N-gram sample subsequence
extraction.
●Match if some api calling pattern is present in many sample then
remove it.(Works like sliding window)
Redundant subsequence removal
methods
●Other method can be using information gain
●C entropy of the malware detection system
● H(C) is the information entropy
●The information gain of the subsequence T to class C is:
●p(ti) is the probability that the feature appear and p(tj) is opposite.
Using machine learning methods
●After the features were extracted and selected, we can apply the
machine learning methods to the data that we obtained.
●The packages used for the implementation of algorithms are:
Random Forest – randomForest
K-Nearest Neighbours – class
Support Vector Machines – kernlab
J48 Decision Tree – RWeka
Using machine learning methods
●After the features were extracted and selected, we can apply the
machine learning methods to the data that we obtained.
●The packages used for the implementation of algorithms are:
Random Forest – randomForest
K-Nearest Neighbours – class
Support Vector Machines – kernlab
J48 Decision Tree – RWeka
Comparison method
●The Cuckoo analysis score is an indication of how malicious an
analyzed file is.
●In total, there are three levels of severity and all levels have their
score of severity: 1 for low, 2 for medium and 3 for high.
●It is hard to measure the accuracy of the detection since there is
no threshold value indicating whether the sample is malicious or
not.
●This can be compare with the result received by ML algorithms.
Results
●The accuracy of detection is measured as the
percentage of correctly identified instances:
Support Vector Machines Results
●The overall accuracy achieved was 87.6% for multi-
class classification and 94.6% for binary classification.
Random Forest Results
●The algorithm resulted in a good accuracy of
predictions, 95.69% for multi-class classification and
96.8% for binary classification.
KNN Results
●As it can be seen, the best accuracy was achieved with
k=1. The algorithm resulted in a good accuracy of 87%
for multi-class classification and 94.6% for two-class
classification.
Conclusion
Experiments show that the integrated Machine learning
classifier has a better performance than the separate
signature based Detection.
Conclusion
In classification problems, different models gave different
results. The lowest accuracy was achieved by Naive Bayes
(72.34% and 55%), followed by k-Nearest-Neighbors and
Support Vector Machines (87%, 94.6% and 87.6%,
94.6% respectively). The highest accuracy was achieved
with the J48 and Random Forest models, and it was equal
to 93.3% and 95.69% for multi-class classification and
94.6% and 96.8% for binary classification respectively.
Thank You

Malware Dectection Using Machine learning

  • 1.
    Malware Detection usingMachine Learning By: Shubham Dubey(14ucs114)
  • 2.
    Malware overview ●Malicious softwarethat tries to damage or perform unauthorized access to your system. ●Can be of different type: Virus | Trojan | Adware | Worm etc ●More then 1 Lacs new samples found by AV companies every day. ●Most of them are Variant of each other or some old samples.
  • 3.
    Current status ofDetection ●Currently Antivirus company use signature based detection. ●Signature can be anything from strings to assembly code snippets.
  • 4.
    Problem with currentmethod ●Polymorphic malware can change their code on every execution. ●Most malware can encrypt or Pack themselves using packers. ●Detecting those malware using signature doesn’t work all the time.
  • 5.
    Solution using ML ●APIsequence features can be used to detect if a file is malicious or not. ●API calls are robust way of analysis as they cannot be alter easily. ●They outline everything happening to the operating system, including the operations on the files,registry, mutexes, processes and other features mentioned earlier. ●For example, OpenFile, CreateFile define the file operations, OpenMutex, CreateMutex and describe mutexes opened/created.
  • 6.
    Our System Description ●Cuckoosandbox is used to analyze and record all Api calls. ●File report get saved into json format file. ●Calls get parsed to save inside csv file in matrix vector format. ●Samples with less then 10 calls(or any other user set value) get ignored.
  • 7.
    Feature extraction ● Thefrequency representation approach has been taken. ●S1,S2..are sample number. API1,API2 are Calls made to an API.
  • 8.
    Redundant subsequence removal methods ●Thereare large number of useless api call sequence present. ●They can be removed using N-gram sample subsequence extraction. ●Match if some api calling pattern is present in many sample then remove it.(Works like sliding window)
  • 9.
    Redundant subsequence removal methods ●Othermethod can be using information gain ●C entropy of the malware detection system ● H(C) is the information entropy ●The information gain of the subsequence T to class C is: ●p(ti) is the probability that the feature appear and p(tj) is opposite.
  • 10.
    Using machine learningmethods ●After the features were extracted and selected, we can apply the machine learning methods to the data that we obtained. ●The packages used for the implementation of algorithms are: Random Forest – randomForest K-Nearest Neighbours – class Support Vector Machines – kernlab J48 Decision Tree – RWeka
  • 11.
    Using machine learningmethods ●After the features were extracted and selected, we can apply the machine learning methods to the data that we obtained. ●The packages used for the implementation of algorithms are: Random Forest – randomForest K-Nearest Neighbours – class Support Vector Machines – kernlab J48 Decision Tree – RWeka
  • 12.
    Comparison method ●The Cuckooanalysis score is an indication of how malicious an analyzed file is. ●In total, there are three levels of severity and all levels have their score of severity: 1 for low, 2 for medium and 3 for high. ●It is hard to measure the accuracy of the detection since there is no threshold value indicating whether the sample is malicious or not. ●This can be compare with the result received by ML algorithms.
  • 13.
    Results ●The accuracy ofdetection is measured as the percentage of correctly identified instances:
  • 14.
    Support Vector MachinesResults ●The overall accuracy achieved was 87.6% for multi- class classification and 94.6% for binary classification.
  • 15.
    Random Forest Results ●Thealgorithm resulted in a good accuracy of predictions, 95.69% for multi-class classification and 96.8% for binary classification.
  • 16.
    KNN Results ●As itcan be seen, the best accuracy was achieved with k=1. The algorithm resulted in a good accuracy of 87% for multi-class classification and 94.6% for two-class classification.
  • 17.
    Conclusion Experiments show thatthe integrated Machine learning classifier has a better performance than the separate signature based Detection.
  • 18.
    Conclusion In classification problems,different models gave different results. The lowest accuracy was achieved by Naive Bayes (72.34% and 55%), followed by k-Nearest-Neighbors and Support Vector Machines (87%, 94.6% and 87.6%, 94.6% respectively). The highest accuracy was achieved with the J48 and Random Forest models, and it was equal to 93.3% and 95.69% for multi-class classification and 94.6% and 96.8% for binary classification respectively.
  • 19.