Fake URL Detection Using Machine Learning
Algorithms
Ms. Neha Gupta                                                       Nitesh Kumar
Department of Information Technology                                 Department of Information Technology
Greater Noida Institute of Technology                                Greater Noida Institute of Technology
(Engineering Institute)                                              (Engineering Institute)
                                                                     niteshkumar909206@gmail.com
Abstract - Phishing is a common scam in which people are             more and more successful While professionals can identify
misled into supplying personal information through fraudulent        fraudulent websites, others are less fortunate and fall
websites. Phishing website URLs are used to get usernames,           victim to phishing assaults [12]. The attacker's primary
passwords, and online banking credentials. Phishers use websites
                                                                     purpose is to obtain passwords for bank accounts. Because
that appear and function similarly to legitimate websites.
Phishing strategies have gotten more complex as technology has
                                                                     clients are becoming less aware, phishing attacks are
improved. To address this, phishing attempts must be detected by     becoming increasingly effective. Phishing attacks are
anti-phishing software. Machine learning is an efficient way to      getting more and more successful. Because phishing
avoid phishing attacks. This study examines the feature sets used    assaults take use of human vulnerabilities, they are hard to
in machine learning-based detection techniques. Attackers            thwart must be maintained. Phishing is a common kind of
regularly Phishing is the practice of deceiving an individual into   cyberbullying in which a malicious website poses as a
clicking on a harmful link that appears to be authentic.             reliable source in real life[11]. To avoid open
                                                                     programming and frameworks, phishers consider inventive
Keywords - Keywords: website authenticity, phishing websites,        and hybrid strategies. These include techniques for
fake websites, and website content analysis.                         identifying phishing content online and identifying
                                                                     potential phishing attempts in interactions [7]. Phishing is
                                                                     a fraudulent technique that obtains sensitive data, such as
                       I. INTRODUCTION
                                                                     passwords and open-ended credit card numbers, by means
                                                                     of social engineering.
Security experts are increasingly concerned about phishing
due of the simplicity with which a counterfeit website that
closely mimics the genuine one may be constructed. While
                                                                     while posing as a trustworthy person or corporation via
professionals can identify fraudulent websites, others are
                                                                     electronic contact[2]. Phishing is a tactic in which links
less fortunate and fall victim to phishing assaults. The
                                                                     from phishing websites are used to trick consumers into
attacker's primary purpose is to obtain passwords for bank
                                                                     visiting fraudulent websites. The fake mails are aimed
accounts [12]. Because clients are becoming less aware,
                                                                     toward authentic sources, such online business objectives
phishing attacks are becoming increasingly effective
                                                                     or financial organizations, and are made to appear
Phishing attacks are getting more and more successful.
                                                                     authentic. Systems for detecting phishing still need to be
Because phishing assaults take use of human
                                                                     updated [11]. Phishing is an effective tactic that a
vulnerabilities, they are hard to thwart and require
                                                                     malicious actor might use to demand ransom from a large
updating. Phishing is a frequent method of blackmail when
                                                                     number of people. The phony emails are made to seem like
a rogue website impersonates legitimate source in a real-
                                                                     communications from respectable businesses, banks, or
world setting [1]. While professionals can detect
                                                                     internet marketers. Systems for detecting phishing still
fraudulent websites, some are less fortunate and fall victim
                                                                     need to be updated.Phishing is a tactic that works similarly
to Phishing assaults are Phishing assaults are difficult to
                                                                     to mass extortion and involves a malicious website
stop because they play on human weaknesses, yet phishing
                                                                     impersonating a reliable source in real life. Although
detection systems upgraded. Phishing is a kind of
                                                                     experts are able to spot fraudulent websites, the general
cybercrime when a hostile website impersonates a
                                                                     population is more susceptible and can be duped by
trustworthy source in an effort to defraud large amounts of
                                                                     phishing techniques. The main objective of the attacker is
money. Because consumers are increasingly less informed,
                                                                     to obtain bank account credentials. There are fewer
phishing attacks are becoming more effective. Phishing
                                                                     consumers since phishing efforts are more successful
assaults are difficult to stop because they take use of user
                                                                     systems need to be improved.
weaknesses, although phishing detection systems [4].
While professionals can identify fraudulent websites,
others are less fortunate and fall victim to phishing assaults
[12]. The attacker's primary purpose is to obtain passwords
for bank accounts. Because clients are becoming less
aware, phishing attacks are becoming increasingly
effective. Phishing attacks are getting.
                                                            :
               II. LITERATURE REVIEW                        1. Address Bar-based Functionalities.
In this episode, we went over earlier talks on machine      2. Atypical Based Elements.
learning techniques for spotting fraudulent websites.
Neural networks are used to classify URLs as phishing       3. JavaScript Base and HTML
or not. To improve the accuracy of the model, binary
visualization techniques are applied. This method may       1. Address Bar based Features
make the most of the phishing detection frameworks
that are now in place and be used to ascertain if a         1.1 When someone types an IP address (125.98.3.123,
website is phishing or not. The trial's little dataset      for example) instead of a space title to access a
provided the researchers with understanding of an           website, they are more likely to experience identity
impact on the model's prediction and efficacy. As a         theft.
consequence, this technique might be enhanced by
employing other prediction models and additional            1.2. Extended URL to mask the questionable aspect
datasets for training and testing. The reference authors    Phishers can utilize long URLs to hide malicious
[4]. This part. According to the findings in [5],           material within the address bar.
reviewed used supervised machine learning                   .
techniques, with deep neural networks being the most
popular methodology. This information was acquired          1.3. Using URL shortening strategies. TinyURL A
through a systematic review of the literature. The          URL can be greatly abbreviated while maintaining its
findings indicate that, Deep Learning (DL) approaches       link to the intended location using a process known as
for phishing detection were the only ones examined in       "URL shortening," which is done over the Internet.
the study; these algorithms have the potential to
improve online system security. All things considered,      1.4. URLs that start with the @ sign
this work [5] significantly advances the field of           The browser will disregard everything that comes
phishing detection, especially with regard to the           before the @ mark in a URL. The actual address.
effectiveness of deep learning algorithms. To address
the shortcomings of previous research, future research      2. Data Preprocessing
may look at alternative machine learning techniques.
In another study, convolutional neural networks             Pre-processing is a vital first step in getting data ready
(CNN) and machine learning were used [6]. Although          for machine learning algorithms. The first stage in
the results of this study were limited to Deep Learning     cleaning up the original data is to remove any
(DL) techniques for phishing detection, these               unnecessary information, missing digits, or duplicate
algorithms may enhance online system security. All          data. It is the primary reason for our model's great
things considered, this study [5] makes a substantial       accuracy. To clean up data, we employ a variety of
contribution to the field of phishing detection,            techniques, such eliminating redundant URLs and
especially with regard to the                               missing information from a row. We also eliminate
  studies may look into different machine learning          irrelevant data, such as URLs that have no effect on
methodologies to address the shortcomings of                the neurological system and malware, phishing,
previous research. A separate research [6] used             manipulation, and innocuous content. details following
convolutional neural networks (CNN) and machine             cleaning was organized such that machine learning
learning.                                                   algorithms could use it. In this case, feature
                                                            engineering is used, which is the process of
          III. PROJECT DESCRIPTION                          discovering and choosing significant qualities from
                                                            datasets [11].
As part of the project, we created a website that acts as
a platform for every customer. This dynamic, flexible
website refers to discern between legitimate and            3. Word Cloud
counterfeit sites [12]. This website was built with a
number of web development languages, including              Word clouds may be used to assess phrase distribution
HTML, CSS, Javascript, and Django. The website's            in certain data categories [13]. Figure 1 depicts a word
core foundation is built using HTML. CSS may be             cloud for each of the four classes examined in this
used to enhance the look and feel of a website. It's        study.     Benign      URLs       frequently     include
important to remember that the website is intended to       frequently .used tokens like HTML, com, and org.
be accessible to everyone, thus everyone should be
able to use it without any problems.
The dataset includes several elements that should be
taken into account when determining if a URL on the
internet is phishing or legitimate.
The following elements are used to identify .
                                                                        IV. RESULT ANALYSIS        AND   DISCUSSION
                                                               Table I shows that the RF algorithm outperformed the
                                                               other two, with an accuracy of 97%. In addition,
                                                               various metrics such as F1 score, recall, and accuracy
                                                               are used to assess the algorithm's entire
                                                               implementation.
                                                                                     Table No. I
                                                                       PERFORMANCE OF Our PROPOSED
                                                                                MODEL
  Figure 1. Methodology for Detecting Fake Website
                       URLs
                                                               During our analysis, we observed that RF
                                                               outperformed the other two in terms of accuracy rate,
                                                               achieving an astounding 97%. This suggests that the
                                                               system is accurate in identifying bogus URLs, which
                                                               makes it a useful tool for guarding against phishing
                                                               schemes and other internet dangers. The entire RF
(a) Benign URLs.              (b) Phishing URLs.               performance we were able to get throughout our
                                                               investigation is displayed in Table II.
                                                               Figure 4 depicts the network of RF disruptions.
                                                               . For classification issues, the hat strategy generates a
                                                               forest of decision trees and delivers the average
                                                               forecast from each one. During training, it builds a
   (c) Malware URLs.        (d) Defacement URL                 huge number of decision trees from which it computes
   Figure 2 shows word clouds for each of the four             the class mode (classification) or class mean
                courses collection.                            (regression).
                                                                                     Table No II
Phishing URLs can use highlight tokens like www,
file, tools, ietf, and fight to trick visitors into thinking   EVALUATI ON REPORT FOR RANDOM FOREST
they are legitimate URLs (see Figure 3b).
Malware URLs frequently contain high-frequency
tokens like exe, E7, BB, and MOZI.
The executable records in Figure 3c are trojans, which
are used to distribute these tokens. Figure 3d shows
defacement URLs, which typically employ
development terminology (php, list, itemid, etc.) and
attempt to modify the original website's code. Specific
lexical characteristics are extracted from raw URLs
during the highlight creating handle and used as input
highlights to set up the machine learning
demonstration. The elements listed in Table I are
supposed to help identify.
  These tokens are spread via trojans, which are
represented by the executable files in Figure 3. Figure
2 shows defacement URLs that seek to change the
original website's code and typically use development
terms (php, index, itemid, etc.). During the feature
engineering process, specific lexical qualities from
raw URLs are extracted and used as input features for
the machine learning model. Table I's components are
meant to help identify
A. LightGBM
Our investigation revealed that LightGBM has a 96%
accuracy rate in identifying phony URLs, as seen
in Table IV. This demonstrates the algorithm's
ability to identify phony websites and the reasons
it should be used in various cybersecurity
applications. Because of its remarkable speed and
scalability, Light GBM has become widely used
and is a great option for real-time applications
that demand quick responses [12].
                                                          Figure 3 shows the Random Forest's confusion matrix.
A. XGBoos
XGBoost is well-known for its high accuracy, quick
performance, has a special aptitude for handling
imbalanced datasets, missing values, and other
problems that come with data from the real
world. We found that the XGBoost algorithm
performed with a precision rate of 96.2%, which
suggests that it could be a useful tool for
accurately identifying URLs. This is not
surprising, as the algorithm is well-known for its
high precision, speed of execution, and special
ability to handle uneven datasets and missing
values, which are common in real-world data.
Our research revealed that the XGBoost
computation had a precision rate of 96.2%,                  Figure 4 shows the LightGBM's confusion matrix.
indicating that it could be used as a tool for
precisely identifying URLs. Both of these                 Gentile or evil, with great precision. summarizes the
algorithms are well-known for their high                  assessment results, while Figure 4 shows the XGBoost
accuracy, speed of execution, and special                 confusion matrix. that require quick projections due to
                                                          their amazing speed and scalability, which have
capacity to handle lost values and imbalanced
                                                          resulted in widespread acceptance. Table IV
datasets, which are frequent in real-world data.          demonstrates that LightGBM has an incredible 96%
Our findings demonstrated that the XGBoost                accuracy rate in detecting counterfeit URLs. Our
[11].                                                     findings demonstrate the breadth of cybersecurity
                                                          applications that our algorithm may be used for, as
                     TABLE III                            well the usefulness of our creation in detecting
                                                          fraudulent websites. Figure 4 displays the topology for
   EVALUATI ON REPORT FOR LIG HTGBM.                      LightGBM disarray. Light GBM has significantly
                                                          increased in quality because to its remarkable speed
                                                          and versatility, which makes it an excellent choice for
                                                          real-time applications that require quick predictions.
                                                          Table IV reveals an astonishing 96%.
                                                          A. Comparative Results
                                                          The primary motivation for this is to use cutting-
              Accuracy: 0.96 (130239)                     cutting-edge machine learning computations to
                                                          determine websites that are bogus. To make plans to
                                                          do this, we evaluated the suitability of three distinct
                                                          algorithms
                                                          utilized in our investigation of several historical
                                                          factors in Table VI. The machine learning exhibition
The Train Using AutoML program makes use of               we've suggested, which integrates LightGBM,
LightGBM is a decision tree-based gradient-boosting       XGBoost, and RF surpasses the other models in terms
collecting method. Similar to other decision tree-based   of algorithms. RF functionsthe finest of all of them
techniques, LightGBM may be applied to both               .
categorization and relapse.
                                                                    D.. Real Time Prediction and Results
                                                            really precise about whether something is harmful or
                                                            benign. The complete assessment findings are shown
                                                            in Table V, and the confusion matrix of the XGBoost
        Figure 5: LightGBM Confusion Matrix                 is shown in Figure 5.
accurately assess whether a given object is dangerous       E. Comparative Results
or benign. Table V displays the assessment findings in
their entirety, and Figure 5 displays the XGBoost           The main goal of this research is to use cutting-edge
confusion matrix.                                           machine learning techniques to identify bogus
                                                            websites[13]. In order to do this, we have contrasted
              V. ALGORITHMS USED                            the results of three different algorithms that we
                                                            employed in our research with a few earlier studies, a
There are currently two methods for figuring out if a       list of which may be seen in Table V. By contrasting
URL is real or not. A forest made up of many decision       our proposed machine learning model's output with
trees is created by the random forest algorithm [12]. A     that of other models, we can see that our model—
high tree count results in excellent detection accuracy.    which makes use of the RF, LightGBM, and XGBoost
The bootstrap approach is the foundation for tree           algorithms—performed better overall, with RF
creation[11]. The bootstrap method's characteristics.       exhibiting the best performance [1].
Selecting the best splitter—the root of the tree—
among the qualities that are available for
categorization. The algorithm keeps building the tree
until it comes to a leaf node Figure 7 Each leaf node
of the tree belongs to a class label, and each node
inside the tree corresponds to a characteristic [14].
. Decision trees are used to generate training models
that are used to predict target values or classes in tree
representations [11].
To build a single tree, randomly selected dataset
samples are substituted. The random forest method
will pick among characteristics that are randomly
selected[14]. A pop-up window alerting users to
phishing websites will show up if they enter a URL
that leads there. A user has the option to 'CONFIRM'
when they want to access data from a website. They
will be sent to the previous page if not. Random
selection and replacement are employed to produce a               Fig.7. Working Random Forest Algorithm
single tree[15]. Using a collection of randomly chosen
qualities, the random forest method makes selections.       The previous page. Random selection and replacement
A pop-up window will appear to warn users when              result in the production of a single tree [18]. To
they Enter a URL that will take them to a fake internet     generate choices, the random forest approach employs
gateway.                                                    a collection of randomly chosen qualities. A pop-up
                                                            window alerting users to phishing websites will show
                                                            up if they enter a URL that takes them there. Users
                                                            may only browse websites that are on the blacklist and
                                                            whitelist, which are seldom updated, and are unable to
                                                            access any other websites [19].
We acquired unstructured URL data from a variety of          A pop-up window appears when a user enters a URL
websites, including as Alexa, Kaggle, and Phishtank.         that leads to a phishing website. will appear to inform
                                                             them. When a user wishes In a long time. Our
• Each detail is given a paired (0,1) value, which is        proposed solution incorporates three approaches:
then fed into classifiers in line with the dataset's         blacklists and whitelists, heuristics, and visual
design.                                                      similarities. Our proposed system employs the
                                                             following algorithm[15].
• We use Random Forest and Decision Tree
approaches to train three different classifiers and          1. Create a browser extension that monitors all "http"
evaluate their accuracy[1].                                  traffic from the end user's system. Using an extension
                                                             rather than an application or program enables real-time
                                                             processing and dynamic output delivery.
                                                             2. Compare each URL's domain against trusted and
                                                             illegitimate domain lists. Data required [2].
                                                             3.Furthermore, the website will be evaluated based on
                                                             a variety of characteristics. We studied the following
                                                             In summary, phishing poses a serious risk to online
                                                             security and safety, which makes phishing detection a
                                                             crucial concern. Our examination of conventional
                                                             phishing detection methods, such heuristic assessments
  Figure 6: The Decision Tree Algorithm in Action
                                                             and blacklists [6].
Length, amount of hyperlinks Furthermore, the
website will be evaluated on a range of criteria. We
investigated the following characteristics: website
protocol (secure or unsecure). to access data from a
website, they can utilize the 'CONFIRM' option.
Otherwise, they will be sent to the
phishing detection technology to end users[11].
                    VII. RESULTS
The Scikit-Learn software was used to load the
machine learning algorithms. After being trained on
one set of data, classifiers are evaluated on a different
set Fig 8. To load the machine learning algorithms,
Scikit-Learn was utilized. Classifiers are tested on a             Fig.8. Random Forest Algorithm Accuracy
different set of data after being trained on the first. An
accuracy score was used to evaluate the effectiveness
of a classifier.
                      VI. WORKING
        We collected unstructured URL data from
         many websites, including Phishtank, Kaggle,
         and Alexa.
        After structuring the dataset, each detail is
         assigned paired (0,1) values, which are then
         input into classifiers.
        We train three distinct classifiers and test
         their accuracy with Decision Tree and
         Random         Forest    approaches[1].
                                                                 Fig.9. Accuracy with Decision Tree Algorithm
Phishers have figured out how to change URLs so
they can avoid detection, even if lexical features by
themselves yield a high level of accuracy (about 97%).
Combining these attributes with those possess is the
most effective tactic. In the future, we must expand the
phishing location architecture, leverage online
learning to better understand contemporary attack
strategies, and increase the model's accuracy by
extracting highlights.
In conclusion, phishing presents a significant risk to
online security and safety, making its detection a
crucial concern. Our examination of conventional
phishing detection methods, such heuristic
assessments and blacklists[6]. We might be able to
automatically identify fraudulent websites and stop a
variety of attacks, such as phishing, malware, and
defacement, by using machine learning algorithms.
Future studies might examine the application of more
sophisticated deep learning methods to enhance. This
study evaluated three popular algorithms: XGBoost,
LightGBM, and RF. We found that they yielded
impressive results on the dataset. With an accuracy
rate of more than 97%, our proposed method
successfully distinguishes between bogus and real
websites. According to a feature significance study,
the URL's size and the number of dots in it, and the
presence of certain keywords are some of the most
important characteristics that reveal phony websites.
comparing the results of learning algorithms to the
"Phishing Websites Dataset" To detect bogus
websites, we developed a Chrome plugin using the
quickest algorithm. With machine learning techniques,
we may be able to automatically detect bogus
websites. that it has the ability to distinguish
between trustworthy and fraudulent websites. The
size of the URL, the amount of dots in the URL,
and other attributes are some of the most crucial
ones for recognizing bogus websites, according
to feature significance research. and the inclusion
of particular keywords. By automatically spotting
phony websites, machine learning algorithms may be
able to stop phishing, ransomware, and vandalism
among other cyberthreats[9]. Future research projects
might focus on applying cutting-edge deep learning
methods.