KEMBAR78
Hybrid ML-DL Approach For Android Malware Detection | PDF | Machine Learning | Receiver Operating Characteristic
0% found this document useful (0 votes)
7 views9 pages

Hybrid ML-DL Approach For Android Malware Detection

This research presents a hybrid machine learning and deep learning approach for detecting Android malware, utilizing app permissions as a key feature for classification. Various models, including Random Forest, Logistic Regression, and Artificial Neural Networks, were evaluated for their effectiveness in distinguishing between benign and malicious applications, with Random Forest achieving the highest accuracy of 96.62%. The study emphasizes the need for advanced detection systems to enhance mobile security against evolving malware threats.

Uploaded by

goyalsaab1312
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views9 pages

Hybrid ML-DL Approach For Android Malware Detection

This research presents a hybrid machine learning and deep learning approach for detecting Android malware, utilizing app permissions as a key feature for classification. Various models, including Random Forest, Logistic Regression, and Artificial Neural Networks, were evaluated for their effectiveness in distinguishing between benign and malicious applications, with Random Forest achieving the highest accuracy of 96.62%. The study emphasizes the need for advanced detection systems to enhance mobile security against evolving malware threats.

Uploaded by

goyalsaab1312
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Hybrid ML-DL Approach for Android Malware Detection

Harsh Kumar*
Student, Akal University, Talwandi Sabo, Bathinda, Punjab
Abstract: Since the proliferation of Android malware has posed a major threat to mobile security in recent years, robust
detection solutions need to be created. The main tool being utilized in this research to detect Android malware is app
permissions. Information about the permissions that Android applications request is the primary signal in the dataset
for distinguishing between malicious and benign applications. Through the use of a machine learning model developed
by analyzing the connection between specific permissions and malware activity, apps are categorized as either safe or
possibly dangerous. By using supervised learning approaches to evaluate the prediction potential of permission patterns,
the proposed method offers a portable and efficient malware detection solution. Experimental results show that the
model may accurately detect malware, providing a potential avenue to enhance permission-based Android security
solutions.

Key Words: Android Security, Android Malware Detection, Machine Learning, Deep Learning
1. INTRODUCTION:

The rapid development of mobile technology has been revolutionized the way individuals interact with digital
information, resulting in the widespread use of smartphones and the mobile apps throughout the world.
However, there has also been a worrying increase in mobile malware(Peiravian & Zhu, 2013), which puts users'
security and privacy at considerable danger, along with the surge in the use of mobile apps. Traditional methods
of identifying malware, which mostly rely on signature analysis, are becoming less and less effective due to its
continual changes and frequent use of obfuscation measures to evade detection.

To get around these problems, machine learning (ML) has come to light as a potentially effective technique for
enhancing the identification of mobile malware. Machine learning models may be trained on large datasets to
identify complex patterns and relationships, which makes them valuable for identifying malware versions that
have not yet been identified(Bulut & Yavuz, 2017). In this research, we examine the application of four distinct
machine learning techniques—Random Forest, Logistic Regression, Artificial Neural Networks (ANN), and
Classification Trees—for the detection of mobile malware.

The method of group learning Because of its robustness and ability to handle large datasets with high-
dimensional features, Random Forest is well known. By constructing multiple decision trees and combining
their predictions, Random Forest offers a powerful technique for spotting minute trends that may indicate
malware.

A well-liked statistical model that is easy to understand and apply is logistic regression. Despite its linear nature,
Logistic Regression can reliably determine if an app is benign or malicious when combined with well-chosen
features.

Inspired by the neural architecture of the human brain, artificial neural networks, or ANNs, have shown
impressive promise in a variety of applications related to pattern detection and categorization. Because ANNs
are able to extract complex, non-linear correlations from data(Rathore et al., 2021), they are particularly useful
for identifying sophisticated malware that may defy traditional detection techniques.

Another decision tree-based method is the Classification Tree model, which creates a tree-like decision model
by dividing the dataset into subsets according to feature values. By showing how features affect the final
classification choice, this strategy works well for classification tasks and yields findings that are easy to
understand.

This study examines different machine learning models based on features extracted from mobile applications
to determine their efficacy in identifying mobile malware(Duan et al., 2024). By utilizing the advantages of
each model, we want to develop a scalable and trustworthy mobile malware detection solution that will
ultimately increase the security of mobile ecosystems.

2. Literature Review
Globally, mobile malware has grown to be a serious threat to mobile users' privacy and security. The number
of malware variants targeting mobile platforms has increased as a result of the growing use of smartphones and
mobile applications. The demand for reliable and effective detection methods grows as these dangerous
programs change.
Traditional Detection Methods
Traditionally, a large portion of mobile malware detection has been dependent on signature-based techniques,
in which known malware is recognized by distinctive patterns or signatures. These techniques work well against
known threats, but they are not as strong against zero-day attacks. This is addressed by heuristic-based
detection, which looks for possible dangers by examining code structures and behaviors. Nevertheless, this
method's static nature frequently leads to significant false-positive rates. While behavior-based detection, which
keeps an eye on an application's runtime activity, provides greater detection rates, it is less appropriate for
mobile devices due to its higher computational resource requirements.
Machine Learning Approaches
Malware detection has changed dramatically with the introduction of machine learning (ML)(Sahs & Khan,
2012). Supervised learning methods(Yerima et al., 2014), in which models are trained on labeled datasets, have
demonstrated potential for detecting malware that has never been seen before. Methods such as random forests,
decision trees, and support vector machines have been applied extensively. Approaches to unsupervised and
semi-supervised learning are increasingly becoming more popular, especially in situations when there is a lack
of labeled data. In recent times, deep learning models, such as recurrent neural networks (RNNs) and
convolutional neural networks (CNNs), have proven to be more effective in identifying intricate and obfuscated
malware due to their ability to learn complex patterns straight from raw data.
Among the first to advocate for using machine learning in mobile malware detection were Peiravian & Zhu
(2013)(Peiravian & Zhu, 2013). To categorize mobile applications as benign or malicious, they investigated a
variety of machine learning (ML) algorithms, such as decision trees, support vector machines (SVM), and k-
nearest neighbors (KNN). Their research showed that machine learning models may reach high detection
accuracy when trained on parameters including network traffic, permissions, and API calls. Nevertheless, the
study also brought attention to issues like feature selection and the requirement for sizable(Khalifa et al., 2024),
labeled datasets in order to develop reliable models.
Building on these findings, Yerima, Sezer, and Muttik (2014)(Yerima et al., 2014) carried out an extensive
assessment of machine learning methods for the detection of malware on Android devices. They explored
various classifiers such as Random Forests, Support Vector Machines, and Naive Bayes, and highlighted the
significance of feature engineering in enhancing detection efficiency. According to their research, ensemble
techniques—Random Forests in particular—offered better detection accuracy and robustness against various
malware kinds. They also looked into the application of static and dynamic analytic methods, coming to the
conclusion that a hybrid strategy combining the two could improve detection rates even further.
Hybrid Methods
Combining techniques from static and dynamic analysis has become a common tactic to take use of each
approach's advantages. By examining both code and behavior, hybrid models enhance detection rates for
undiscovered malware in addition to successfully identifying known threats. The accuracy of ensemble
approaches, which blend several machine learning models, has been substantially improved, making them an
effective weapon against mobile malware.
Recent Advances
Deep learning, and specifically the application of CNNs and RNNs for malware detection, has been the focus
of recent research. These models are quite good at picking out tiny trends that point to malicious activity (Sahs
& Khan, 2012) from huge datasets. Another method that has shown promise is federated learning, which allows
models to be trained across devices while maintaining user privacy. Moreover, adversarial learning—where
models are trained to oppose probable malware evasion techniques—is receiving notice as a means of boosting
detection systems' resilience.
A more recent work by Liu et al. (2018)(Zhao et al., 2018) suggested a secure transfer learning method that is
similar to mobile malware detection for identifying malware in Internet of Things environments. Their
methodology entailed fine-tuning a pre-trained model on a smaller, domain-specific dataset after it had been
trained on a larger, generic dataset. The problem of data scarcity, which is prevalent in applications connected
to security, is addressed by this technique. In order to analyze data closer to the source, minimize latency, and
enhance real-time detection capabilities, they also included edge computing. Furthermore, the increasing worry
about protecting sensitive data in ML-based detection systems is highlighted by their focus on security and
privacy through encryption and secure data transmission.
Challenges in Mobile Malware Detection
Mobile virus detection still has a number of difficulties, despite progress(Bulut & Yavuz, 2017). To evade
detection, malware makers frequently use evasion tactics like polymorphism and code obfuscation. Another
major issue is the restricted computational capacity of mobile devices, since many sophisticated detection
techniques demand a high processing power. A universally applicable solution is also a challenge due to the
variety of mobile malware and the quick development of attack vectors.
3. Problem Statement

The quantity and diversity of mobile viruses have increased exponentially as a result of the quick spread of
mobile applications. Traditional detection techniques have not been able to keep up with the sophistication of
these threats, especially when it comes to recognizing novel and unidentified malware variants. Advanced
detection systems are essential for protecting mobile users' privacy and security.

In this study, we use machine learning models to detect mobile malware in an effort to overcome this difficulty.
In particular, we have a dataset with a variety of attributes taken from malicious and benign mobile applications.
Using several machine learning models, including Random Forest, Logistic Regression, and Artificial Neural
Networks (ANN), to train on this dataset and reliably identify applications as benign or dangerous is the main
challenge.
The successful use of these models will aid in the creation of more resilient and efficient mobile malware
detection systems that can change with the rapidly evolving mobile threat landscape. The performance of these
models will also be compared in this study to determine which strategy is best for implementation in mobile
security frameworks in the real world.

4. Objectives of the Study


This study's main goal is to create and assess machine learning models for mobile virus detection. The research
is concentrated on accomplishing the subsequent particular goals:
1. In order to make sure the data is appropriate for training machine learning models, it is necessary to
examine and preprocess a dataset that includes features taken from both harmful and benign mobile
applications.
2. To put various machine learning models—Random Forest, Logistic Regression, Classification Tree,
and Artificial Neural Networks (ANN)(Yerima et al., 2014)—to use to evaluate how well they
perform in accurately categorizing mobile applications as benign or harmful.
3. To evaluate how well various feature engineering and selection strategies work to increase the
malware detection models' robustness and accuracy(Xiong & Zhang, 2024).
4. To identify the best model for identifying mobile malware by analyzing the performance of the
trained models using a variety of assessment criteria, including accuracy, precision, recall, F1-score,
and the area under the receiver operating characteristic curve (AUC-ROC).
5. To evaluate each machine learning model's advantages and disadvantages in relation to the detection
of mobile malware and offer suggestions for how these models(Zhao et al., 2018) might be used in
practical mobile security frameworks.
6. To make suggestions for further development and expansion of the research, such adding
sophisticated deep learning methods or using the models on bigger and more varied datasets.
5. Proposed Methodology
This study's methodology entails creating, honing, and assessing machine learning models for mobile malware
detection in an organized manner. The following are the steps to follow:
1. Dataset Acquisition
The study's dataset came from the well-known data science competition and dataset website Kaggle. Numerous
elements that have been taken from mobile applications and classified as benign or harmful are included in the
dataset. These characteristics can differentiate between legitimate and malicious activity, and they include
network traffic, API requests, application permissions, and other pertinent details.
2. Data Preprocessing
• Data Cleaning: First, any inconsistent or missing values in the dataset are either deleted or
imputed.
• Feature Encoding: To transform categorical information into a numerical format appropriate
for machine learning models, encoding approaches like one-hot encoding are employed.
• Normalization/Scaling: To make sure that all inputs are on a similar scale, features are either
scaled or normalized. This is crucial for algorithms like Logistic Regression and Neural
Networks.
• Data Splitting: Eighty percent of the dataset is used for training the models and twenty percent
is set aside for testing. The dataset is split into training and testing subsets. To fine-tune
hyperparameters, the training data may also be further divided into a validation set.
3. Feature Engineering and Selection
• Feature Selection: To determine which properties are most pertinent for malware detection, a
comprehensive feature selection process is conducted. To choose a subset of features that best
enhance the prediction potential of the model, methods like Random Forest significance scores
and Recursive Feature Elimination (RFE) are employed.
• Dimensionality Reduction: When necessary, the dataset's dimensionality is reduced using
methods like Principal Component Analysis (PCA)(Bulut & Yavuz, 2017), which improves model
efficiency without compromising accuracy.
4. Model Implementation
Three machine learning models are implemented for the detection of mobile malware:
• Random Forest (RF): a strong ensemble learning technique that constructs several decision
trees and combines them to increase accuracy and avoid overfitting. The chosen features are
used to train the RF model, and hyperparameters like the maximum depth and number of trees
are adjusted using cross-validation. On the testing dataset, the accuracy of the RF model was
96.62462%.

Figure 1: Random Forest-Important Features In Model Training


• Artificial Neural Network (ANN): a multi-layered neural network deep learning model that
can recognize intricate patterns in the input. The input, hidden, and output layers of the ANN are
created, and backpropagation is used for training. A validation set is used to optimize the
architecture and hyperparameters (such as the number of hidden layers, activation functions, and
learning rate). A 96.42005 % accuracy rate was likewise attained using the ANN model.
• Logistic Regression (LR): A logistic function is used in this statistical model to represent the
likelihood of a binary result. Regularization techniques are used in the implementation of the
Logistic Regression model in order to prevent overfitting. With a competitive accuracy of
95.90863%, the LR model outperformed RF and ANN despite being more straightforward.
• Classification Tree (CL): A machine learning model called a classification tree divides data
according to features in order to forecast a target class. With 92.36% accuracy, the model predicts
most of the time properly, perhaps as a result of appropriate tree depth, feature importance, and
high-quality data. The model's great capacity to generalize across the dataset is shown in its high
accuracy.
5. Model Evaluation
• Accuracy: The percentage of correctly identified occurrences in the test set is measured by
accuracy, which is the main statistic used to assess the performance of the model.
• Precision, Recall, and F1-Score: The model's sensitivity and specificity are evaluated using
several metrics, including accuracy, precision (the ratio of true positive predictions to all
predicted positives), recall (the ratio of true positives to all actual positives), and F1-score (the
harmonic mean of precision and recall).

• ROC-AUC Curve: For every model, the trade-off between the true positive rate and the false
positive rate across various threshold values is assessed by plotting and analyzing the Area Under
the Receiver Operating Characteristic Curve (ROC-AUC).
Figure 2: ROC Curves of all model applied
6. Comparison and Analysis
• Based on the criteria mentioned above, the three models' performances—Random Forest, ANN,
and Logistic Regression—are contrasted. The investigation looks at a number of aspects,
including feature interactions, model complexity, and the capacity to identify non-linear
relationships in the data, to determine why RF and ANN obtained slightly higher accuracy
(96.62462% and 96.42005 %) than LR (95.90863%).
Model Proposed Accuracy Precision Obtained Recall Obtained F1-Score
Obtained Obtained
Artificial Neural 96.42005 % 96.57064% 96.2406% 96.40534%
Network
Random Forest 96.62462% 96.38024% 96.90476% 96.64179%
Logistic Regression 95.90863% 95.27163% 96.63265% 95.94732%
Classification Tree 92.36277% 90.19355% 95.10204% 92.58278%
Comparison Of Model Metrics Obtained
98.00%

96.00%

94.00%

92.00%

90.00%

88.00%

86.00%
Artificial Neural Network Random Forest Logistic Regression Classification Tree

Accuracy Obtained Precision Obtained Recall Obtained F1-Score Obtained

Figure 3: Comparison of model metrics obtained


• Each model's computational efficiency is also examined, taking into account the trade-offs
between accuracy and resource usage—especially when deploying on mobile devices.

6. Conclusion

This study examined machine learning models (i.e., Random Forest (RF), Artificial Neural Network
(ANN), Classification Tree, and Logistic Regression (LR)) for mobile virus detection using a Kaggle
dataset. The major goal was to develop, refine, and test these models to see how successfully they
identified mobile applications as benign or harmful.

The results demonstrated that the ANN and RF models both obtained 96.42005% and 96.62462%
accuracy, respectively, while the LR model lagged somewhat behind with 95.90863% accuracy. The
accuracy of the Classification Tree model was 92.36277%. These findings show how well-suited
ANN and RF are for handling complex datasets and seeing the minute patterns required for effective
malware detection. Despite its simplicity, the LR model performed well, which makes it a good option
when computational economy is critical.

The study's findings demonstrate how mobile security may be enhanced by machine learning
algorithms, which provide accurate and reliable virus detection. ANN and RF models have the
potential to improve malware prevention and detection in mobile security frameworks because of their
excellent accuracy rates. Moreover, the competitive performance of the LR model shows how well-
suited it is for deployment in resource-constrained environments, such mobile devices, where speed
and simplicity are critical.

Nevertheless, the study acknowledges certain limitations, including the challenge of detecting
malware that evades detection and the potential for variations in model performance across datasets.
Future research may be able to get around these limitations by looking at more intricate deep learning
architectures, combining hybrid detection techniques, and applying these models in real-world
scenarios.
In summary, this study establishes the foundation for future research and development in this vital
area of mobile security by demonstrating how machine learning effectively detects mobile malware.
As the threat landscape shifts, continuous machine learning research and innovation will be needed to
maintain the security and integrity of mobile devices.

References:

Bulut, I., & Yavuz, A. G. (2017). Mobile malware detection using deep neural network. 2017 25th Signal Processing and
Communications Applications Conference (SIU), 1–4. https://doi.org/10.1109/SIU.2017.7960568

Duan, G., Liu, H., Cai, M., Sun, J., & Chen, H. (2024). MaDroid: A maliciousness-aware multifeatured dataset for
detecting android malware. Computers & Security, 144, 103969. https://doi.org/10.1016/j.cose.2024.103969

Khalifa, M. A., Elsayed, A., Hussien, A., & Hussainy, A. S. (2024). Android Malware Detection and Prevention Based on
Deep Learning and Tweets Analysis. 2024 6th International Conference on Computing and Informatics (ICCI), 153–
157. https://doi.org/10.1109/ICCI61671.2024.10485022

Peiravian, N., & Zhu, X. (2013). Machine Learning for Android Malware Detection Using Permission and API Calls. 2013
IEEE 25th International Conference on Tools with Artificial Intelligence, 300–305.
https://doi.org/10.1109/ICTAI.2013.53

Rathore, H., Sahay, S. K., Rajvanshi, R., & Sewak, M. (2021). Identification of Significant Permissions for Efficient Android
Malware Detection (pp. 33–52). https://doi.org/10.1007/978-3-030-68737-3_3

Sahs, J., & Khan, L. (2012). A Machine Learning Approach to Android Malware Detection. 2012 European Intelligence
and Security Informatics Conference, 141–147. https://doi.org/10.1109/EISIC.2012.34

Xiong, S., & Zhang, H. (2024). A Multi-model Fusion Strategy for Android Malware Detection Based on Machine
Learning Algorithms. Journal of Computer Science Research, 6(2), 7–17. https://doi.org/10.30564/jcsr.v6i2.6632

Yerima, S. Y., Sezer, S., & Muttik, I. (2014). Android Malware Detection Using Parallel Machine Learning Classifiers. 2014
Eighth International Conference on Next Generation Mobile Apps, Services and Technologies, 37–42.
https://doi.org/10.1109/NGMAST.2014.23

Zhao, L., Li, D., Zheng, G., & Shi, W. (2018). Deep Neural Network Based on Android Mobile Malware Detection System
Using Opcode Sequences. 2018 IEEE 18th International Conference on Communication Technology (ICCT), 1141–
1147. https://doi.org/10.1109/ICCT.2018.8600052

You might also like