KEMBAR78
Phishing Detection Using Named Entity Recognition | PDF | Phishing | Email
0% found this document useful (0 votes)
185 views24 pages

Phishing Detection Using Named Entity Recognition

Phishing is a way of attempting to acquire sensitive information such as usernames, passwords and credit card details by masquerading as a trustworthy entity in an electronic communication

Uploaded by

Csea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views24 pages

Phishing Detection Using Named Entity Recognition

Phishing is a way of attempting to acquire sensitive information such as usernames, passwords and credit card details by masquerading as a trustworthy entity in an electronic communication

Uploaded by

Csea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Phishing Detection Using Named Entity Recognition

ABSTRACT
Phishing is a way of attempting to acquire sensitive information such as
usernames, passwords and credit card details by masquerading as a trustworthy entity in
an electronic communication. Phishing is a major security threat to the online
community. Phishing scams have been escalating in number and sophistication by the
day. A phishing attack today targets audience by using mass-mailings to millions of email
addresses around the world, as well as by communicating with highly targeted groups of
customers that have been enumerated through security faults in small clicks-and-mortar
retail websites .
This project proposes a methodology to detect phishing attacks and to discover the
entity/organization that the attackers impersonate during phishing attacks. The
methodology first discovers
(i)

named entities, which includes names of people, organizations, and


locations; and

(ii)

hidden topics.

Utilizing topics and named entities as features, the next stage classifies each
message as phishing or non-phishing. For messages classified as phishing, the final stage
discovers the impersonated entity. The automatic discovery of impersonated entity from
phishing helps the legitimate organization to take down the offending phishing site. This
project also proposes a technique to discriminate phishing e-mails from the legitimate emails using the distinct structural features present in them. The derived features can be
used to efficiently classify phishing emails before it reaches the users inbox .

Dept of Computer Science And Engineering,SJCET,Palai

Page 1

Phishing Detection Using Named Entity Recognition

TABLE OF CONTENTS
SL.NO

TITLE

1.

Introduction

2.

Proposed System

3.

Proposed System architecture

4.

Literature Survey

5.

Objectives

6.

Statement of how the objectives are to be


tackled

7.

Time Schedule

8.

References

Dept of Computer Science And Engineering,SJCET,Palai

PAGE NO

Page 2

Phishing Detection Using Named Entity Recognition

1. INTRODUCTION
Phishing is a new word produced from 'fishing', it refers to the act that the attacker
allure users to visit a faked Web site by sending them faked e-mails (or instant messages),
and stealthily get victim's personal information such as user name, password, and
national security ID, etc. This information then can be used for future target
advertisements or even identity theft attacks (e.g., transfer money from victims' bank
account). The frequently used attack method is to send e-mails to potential victims, which
seemed to be sent by banks, online organizations, or ISPs. In these e-mails, they will
make up some causes, e.g. the password of your credit card had been mis-entered for
many times, or they are providing upgrading services, to allure you visit their Web site to
conform or modify your account number and password through the hyperlink provided in
the e-mail. If you input the account number and password, the attackers then successfully
collect the information at the server side, and is able to perform their next step actions
with that information (e.g., withdraw money out from your account).Phishing itself is not
a new concept, but it's increasingly used by phishers to steal user information and
perform business crime in recent years. Within one to two years, the number of phishing
attacks increased dramatically.
Phishing is a type of deception designed to steal your valuable personal data, such as
credit card numbers, passwords, account data, or other information. It is a form of social
engineering that is executed via electronic means and can lead to identity threat and
fraud. Phishing email messages take a number of forms:

They might appear to come from your bank or financial institution, a company you
regularly do business with, such as Microsoft, or from your social networking site.

They might appear to be from someone you in your email address book.

They might ask you to make a phone call. Phone phishing scams direct you to call
a phone number where a person or an audio response unit waits to take your

Dept of Computer Science And Engineering,SJCET,Palai

Page 3

Phishing Detection Using Named Entity Recognition

account number, personal identification number, password, or other valuable


personal data.

They might include official-looking logos and other identifying information taken
directly from legitimate websites, and they might include convincing details about
your personal history that scammers found on your social networking pages.

They might include links to spoofed websites where you are asked to enter
personal information.

Dept of Computer Science And Engineering,SJCET,Palai

Page 4

Phishing Detection Using Named Entity Recognition

2. PROPOSED SYSTEM
In this project we propose a method for classifying emails as legitimate or not using
named entity recognition. We use email features for detecting phished mails. Emails that
are labeled as spam, ham or phishing are then classified using a classifier. The classifier
identifies the mails as phishing or not.

Phishing

Emails

Feature
comparison

Classifier
Non -phishing

Dept of Computer Science And Engineering,SJCET,Palai

Page 5

Phishing Detection Using Named Entity Recognition

3. PROPOSED SYSTEM ARCHITECTURE

B-OnGuaRd
Phisher sends
Inbox

e-mail

Alert

Legitimate

User

Phished mail

Classifier

Feature Comparison

Dept of Computer Science And Engineering,SJCET,Palai

Page 6

Phishing Detection Using Named Entity Recognition

4. LITERATURE SURVEY

Topic: Phishing Detection: A Literature Survey[1]


This article surveys the literature on the detection of phishing attacks. Phishing
attacks target vulnerabilities that exist in systems due to the human factor.This paper aims
at surveying many of the recently proposed phishing mitigation techniques. A high-level
overview of various categories of phishing mitigation techniques is also presented, such
as: detection, offensive defense, correction, and prevention.
The phishing detection survey begins by

defining the phishing problem

categorizing anti-phishing solutions from the perspective of phishing campaign


life cycle

presenting evaluation metrics that are commonly used in the phishing domain to
evaluate the performance of phishing detection techniques

presenting a literature survey of anti-phishing detection techniques


presenting a comparison of the various proposed phishing detection techniques in
the literature.

Definition
The definition of phishing attacks is not consistent in the literature, which is due to the
fact that the phishing problem is broad and incorporates varying scenarios.According to
Phishtank:
Phishing is a fraudulent attempt, usually made through email, to steal your personnel
information

Dept of Computer Science And Engineering,SJCET,Palai

Page 7

Phishing Detection Using Named Entity Recognition

Categorizing anti-phishing solutions

Fig 1:Life cycle of phishing campaign[1]

Detection approaches
User training approaches end-users can be educated to better understand the
nature of phishing attacks, which ultimately leads them into correctly
identifying phishing and non-phishing messages
Software classification approaches these mitigation approaches aim at
classifying phishing and legitimate messages on behalf of the user in an attempt
to bridge the gap that is left due to the human error or ignorance.

Dept of Computer Science And Engineering,SJCET,Palai

Page 8

Phishing Detection Using Named Entity Recognition

Fig 2:Overview of phishing detection approaches[1]

Evaluation Metrics
Based on our review of the literature, the following are the most
commonly used evaluation metrics:
True Positive (TP) rate measures the rate of correctly detected phishing attacks
in relation to all existing phishing attacks.

False Positive (FP) rate measures the rate of legitimate instances that are
incorrectly detected as phishing attacks in relation to all existing legitimate
instances.

Dept of Computer Science And Engineering,SJCET,Palai

Page 9

Phishing Detection Using Named Entity Recognition

True Negative (TN) ratemeasures the rate of correctly detected legitimate


instances in relation to all existing legitimate instances.

False Negative (FN) rate measures the rate of phishing attacks that are
incorrectly detected as legitimate in relation to all existing phishing attacks.

Precision (P) measures the rate of correctly detected phishing attacks in


relation to all instances that were detected as phishing.

Recall (R) equivalent to TP.

f1 score Is the harmonic mean between P and R.

Accuracy (ACC) measures the overall rate of correctly detected phishing and
legitimate instances in relation to all instances.

Weighted Error (WErr) measures the overall weighted rate of incorrectly


detected phishing and legitimate instances in relation to all instances.

Dept of Computer Science And Engineering,SJCET,Palai

Page 10

Phishing Detection Using Named Entity Recognition

Topic: Multi-tier phishing detection and filtering approach[2]


Phishing attacks continue to pose serious risks for consumers and businesses as
well as threatening global security and the economy. Therefore, developing
countermeasures against such attacks is an important step towards defending critical
infrastructures such as banking. This paper presents a phishing email filtering approach
using multi-tier classification technique that combines multiple classification algorithms.
The major contributions are summarised as follows:
Proposes a new method for extracting the features of phishing email based on
weighting of message content and message header and select the features
according to priority ranking.
Presents a new approach called multi-tier classification model for filtering
phishing emails.
Examines the impact of rescheduling the classifier algorithms in a multi-tier
classification process to classify the phishing email and to find out the optimum
scheduling.
Provides an empirical evidence that the proposed approach reduces the false
positive problems substantially with lower complexity.

The multi-tier model


In this approach, the email message will be classified in a sequential
fashion by using the first two tier ML algorithms and the outputs will be sent to the
analyser section. The analyser will analyse the outputs and send them to the
corresponding mail- boxes based on the labeling of the ML algorithms. If the email
messages are misclassified by any of the first two tier(T1 and T2) ML algorithms, then
the analyser will invoke the tier-3(T3) ML algorithm. The T3 ML algorithm will classify

Dept of Computer Science And Engineering,SJCET,Palai

Page 11

Phishing Detection Using Named Entity Recognition

the misclassified email messages and send them to the corresponding mail boxes based
on the identification.

Fig 3: Block diagram for multi-tier classification model[2]

Feature construction
Features are extracted from each email based on weighting of message content and
message header and select the features according to priority ranking. Each phishing email
is parsed as text file to identify each header element to distinguish them from the body of
the message. Every substring within the subject header and the message body that was
delimited by white space was considered to be a token, and an alphabetic word was
defined as a token delimited by white space that contains only English alphabetic
characters (AZ, az) or apostrophes.

Dept of Computer Science And Engineering,SJCET,Palai

Page 12

Phishing Detection Using Named Entity Recognition

Category 1: features from the message subject header


Binary feature indicating 3 or more repeated characters.
Number of words with all letters in uppercase.
Number of words with at least 15 characters.
Number of words with at least two of letters J, K, Q, X, Z.
Number of words with no vowels.
Number of words with non-English characters, special characters such as
punctuation, or digits at beginning or middle of word.

Category 2: features from the priority and content-type headers


Binary feature indicating whether the priority had been set to any level
besides normal or medium.
Binary feature indicating whether a content-type header appeared within the
message header.

Category 3: features from the message body


Proportion of alphabetic words with no vowels and at least 7 characters
Proportion of alphabetic words with at least two of letters J, K, Q, X, Z
Proportion of alphabetic words at least 15 characters long
Binary feature indicating whether the strings From: and To: were
both present
Number of HTML opening comment tags
Number of hyperlinks (href)
Number of clickable images represented in HTML

Dept of Computer Science And Engineering,SJCET,Palai

Page 13

Phishing Detection Using Named Entity Recognition

Binary feature indicating whether a text colour was set to white


Number of URLs in hyperlinks with digits or &, %, or @
Number of colour element (both CSS and HTML format)
Binary feature indicating whether JavaScript has been used or not
Binary feature indicating whether CSS has been used or not
Binary feature indicating opening tag of table

Multi-tier filtering algorithm

Dept of Computer Science And Engineering,SJCET,Palai

Page 14

Phishing Detection Using Named Entity Recognition

Topic: An efficacious method for detecting phishing webpages through target


domain identification[3]
Any anti-phishing technique becomes incomplete without identification of the
phishing target. Hence, there is a need for a holistic approach that can identify the right
phishing target even when attackers use any masquerading techniques .Such a method
would gain significant importance among anti-phishing techniques as it alerts the target
owners to take necessary counter measures and enhance security.
In this paper, a novel approach to detect the phishing webpages is proposed. The
webpage is taken under scrutiny and identify all the direct and indirect links associated
with the page and generate domain group sets S1 and S2 respectively. From these sets the
target domain set is identified , which is given as input to Target Identification (TID)
algorithm to identify the phishing target. Using DNS lookup, the domains of suspicious
webpage and phishing target are mapped to corresponding IP addresses. On comparing
both the IP addresses, the authenticity of the suspicious webpage can be concluded. As
this approach depends only on content of the suspicious webpage it requires neither a
prior knowledge about the site nor requires the training data.

System overview
This system identifies phishing websites based on the following certainty that for a
phishing website, the target will be a legitimate site, whereas for a genuine website, the
system will point to the genuine site itself as its own target. On this stand the phishing
webpage is identified by comparing the suspicious webpage with its target.
For a given suspicious page, our method first identifies all the direct and indirect
links associated with that page. The links which are directly associated with the webpage
are extracted from the HTML source of the page and grouped based on their domains, as
a set of domain S1. The indirectly associated links of the page are then retrieved by first
extracting the keywords in the webpage and feeding these keywords to a search engine.
The first n links returned by the search engine as indirectly associated links are retrieved
Dept of Computer Science And Engineering,SJCET,Palai

Page 15

Phishing Detection Using Named Entity Recognition

and group them as a second set of domain S2. A reduced domain set S3 is constructed by
extracting only the common domains present in both S1 and S2. This set S3 is fed as an
input to a TID algorithm, to identify the phishing target domain. DNS lookup is used to
map the domain of the identified phishing target to its corresponding IP address.
Similarly, the domain of the suspicious webpage to its corresponding IP address is also
mapped. On comparing the two IP addresses the authenticity of the suspicious webpage
can be concluded.

Fig. 4. System design (A1A3: Extract links present in webpage; group links according to
domains;domain set S1 given for set comparison; B1B5: Extract keywords; keywords feed to
search engine;extract the results; group links according to domains; domain set S2 given for set
comparison; C1C4: Identified target domain set; input target domain set to TIDalgorithm;
identify the target domain;supply domain name of the target domain to third-party DNS server;
D1: Supply domain name of thesuspicious webpage to third-party DNS server; E1: Label
generation based on DNS comparison (phishing = 0, legitimate = 1).[3]

Identifying the target domain


The target domain is identified from the target domain set (S3) the authenticity of
the suspicious webpage is checked. The set S3 contains the predicted target domains and
depending on the number of domains in it two scenarios are possible.

Dept of Computer Science And Engineering,SJCET,Palai

Page 16

Phishing Detection Using Named Entity Recognition

Fig 5:TID Algorithm

Phishing detection using DNS lookup


Here the target domain and the domain of the suspicious webpage P is taken ,and
perform third-party DNS lookup. As a result the corresponding IP addresses isobtained
for both the domains. On comparing these two sets of IP addresses thelegitimacy of the
webpage of P can be concluded. If the IP addresses of the domain P are matched with
those retrieved for the target domain P is declared to be a legitimate webpage. Otherwise,
it can be concluded as a phishing webpage. Third party DNS lookup is used to avoid
pharming attack (The user is redirected to a phished page even though he enters a correct
URL. Attackers carry out this by exploiting the vulnerability in DNS server software). In
identifying the legitimacy of a webpage IP address is used in comparison instead of
domain names, to overcome the discrepancies in domain names.

Dept of Computer Science And Engineering,SJCET,Palai

Page 17

Phishing Detection Using Named Entity Recognition

Topic: Intelligent phishing detection and protection scheme for online


transactions[4].
Phishing is an instance of social engineering techniques used to deceive users into
giving their sensitive information using an illegitimate website that looks and feels
exactly like the target organization website.Most phishing detection approaches utilizes
Uniform Resource Locator (URL) blacklists or phishing website features combined with
machine learning techniques to combat phishing. Despite the existing approaches that
utilize URL blacklists, they cannot generalize well with new phishing attacks due to
human weakness in verifying blacklists, while the existing feature-based methods suffer
high false positive rates and insufficient phishing features. As a result, this leads to an
inadequacy in the online transactions.
To address the problem robustly, it is important to build a state of-the-art model
using Neuro-Fuzzy scheme with five inputs. Neuro-Fuzzy is a Fuzzy Logic and a Neural
Network.

Methodologies
The proposed approach utilized Neuro-Fuzzy with five inputs to detect phishing
website in online transaction while maximizing the accuracy of performance and
minimizing false positive and operation time.
Neuro-Fuzzy
Neuro-Fuzzy is a combination of a Fuzzy Logic and a neural network with ability
of reasoning and learning .This combination allows the use of numeric and linguistic
properties. The advantage of Neuro-Fuzzy approach is that it has universal
approximations with ability to use Fuzzy IF...THEN rules. While Neural Network
performs well when dealing with raw data, Fuzzy Logic deals with reasoning on a higher
level, using numerical and linguistic information from domain expert. Neuro-Fuzzy was
chosen because it has capabilities of data learning from Neural Network view point, and

Dept of Computer Science And Engineering,SJCET,Palai

Page 18

Phishing Detection Using Named Entity Recognition

forms linguistic rules from Fuzzy Inference point of view ,thus allowing the power of
intelligent systems to be used .
Five Inputs
Five inputs are five tables where features are extracted and stored for reference.
These includes:
1. Legitimate site rules Legitimate site rules is a summary of law covering
phishing crime
2. User-behavior profile User-behavior profile is a list of peoples behavior when
interacting with phishing and legitimate websites.
3. PhishTank PhishTank is a free community website operated by Open Domain
Names where suspected websites are verified and voted as phish by the
community experts
4. User-specific site User-specific site contains binding requirements between a
user and online transaction service providers
5. Pop-Ups from Email Pop-Ups from Email are regular phrases that are used by
phishers as appears on screen.
These five inputs are used because they are wholly representative of phishing attack
techniques and strategies. From the five inputs, 288 features are extracted which are used
as training and testing input data into the Neuro-Fuzzy system to generate Fuzzy
IF...THEN rules, and to discriminate between phishing, suspicious and legitimate sites
accurately in real-time. If a phishing website is detected, then a voice alarm is generated.
For a suspicious website, the system generates red.

Dept of Computer Science And Engineering,SJCET,Palai

Page 19

Phishing Detection Using Named Entity Recognition

Fig :6 Intelligent phishing detection system[4]

Fuzzy Inference Structure


For simplicity, a Fuzzy inference system has two inputs x and y and one output.
The most commonly used zero-order Sugeno Fuzzy model applies Fuzzy rules in the
form of:

where x1,x2,...,xm are input variables; A1,A2,...,Am are


Fuzzy sets; and y is either a constant or a linear function of the input variables. When y is
Dept of Computer Science And Engineering,SJCET,Palai

Page 20

Phishing Detection Using Named Entity Recognition

constant, we obtain zero-order Sugeno Fuzzy model in which the consequent of a rule is
specified by a singleton.When y is a first order polynomial (y = k0 + k1x1 + k2x2 + ... +
kmxm), we get a first-order Sugeno Fuzzy model.
Layer 1 is the input layer. Neurons in this layer easily transmit external crisp
indications straight to the next layer. Neurons in this layer undertake fuzzification.
Fuzzification neurons contain a bell activation function. The activation of a membership
function is a set that specifies the Fuzzy set. Thus, the activation for the neuron in layer 2
is set to generalization bell (gbell) membership functions. Layer 3 is the rule base. This
layer gets inputs from the individual fuzzification nodes and calculates the firing strength
of the rule it represents. Layer 4 is the normalization. Every neuron based in this layer is
connected to individual normalization neuron. The Neuron gets inputs from every neuron
in the rule layers and calculates the normalized firing strength of a given rule. The
normalized firing strength is the percentage of the firing strength of a given rule to the
sum of firing strengths of every rule.Layer 5 is defuzzification. This neuron computes the
sum of outputs of every combined neurons and produces the overall Adaptive NeuroFuzzy Inference System output, y.

Fig:7 Intelligent phishing detection fuzzy inference system structure[4]

Dept of Computer Science And Engineering,SJCET,Palai

Page 21

Phishing Detection Using Named Entity Recognition

5. OBJECTIVES
The aim of this project is to provide the anti-phishing industry with a solution that can
detect more sophisticated phishing attacks as well as detecting simple phishing attacks.
To achieve this project aim, there are some detailed objectives and tasks that are required
to be performed:
To survey and examine the current techniques and solutions of anti-phishing and
gain further knowledge through the understanding of these techniques.
To conduct an investigation of new phishing attacks and potential threats.
To collect the proposed system requirements.
To design the proposed systems architecture.
To implement the designed architecture into a working program.
To evaluate the resulting system.
.

6. STATEMENT OF HOW THE OBJECTIVES ARE TO BE


TACKLED
Phishing is a continual threat that keeps growing to this day. The damage caused
by phishing ranges from denial of access to email to substantial financial loss.
To achieve the objectives a survey of various papers are done through which
different phishing techniques, anti-phishing techniques are identified and studied. A new
system , B-OnGuaRd was proposed that discriminates phished mails and legitimate
emails before it reaches the users inbox after comparing features present in the emails and
through classification. The architecture of the proposed system specifies the overall
working of the system. Improvements are done in the architecture for better results.

Dept of Computer Science And Engineering,SJCET,Palai

Page 22

Phishing Detection Using Named Entity Recognition

7.TIME SCHEDULE

Dept of Computer Science And Engineering,SJCET,Palai

Page 23

Phishing Detection Using Named Entity Recognition

8. REFERENCES
[1].

Phishing Detection: A Literature Survey


Mahmoud Khonji, Youssef Iraqi, Senior Member, IEEE, and Andrew Jones

[2].

A multi-tier phishing detection and filtering approach


Rafiqul Islam , Jemal Abawajy

[3].

An efficacious method for detecting phishing webpages through target


domain identification
Gowtham Ramesh , Ilango Krishnamurthi , K. Sampath Sree Kumar

[4].

Intelligent phishing detection and protection scheme for online transactions


P.A. Barraclough , M.A. Hossain , M.A. Tahir , G. Sexton , N. Aslam

[5].

Learning to Detect Phishing Emails


Ian Fette, Norman Sadeh, Anthony Tomasic

Dept of Computer Science And Engineering,SJCET,Palai

Page 24

You might also like