KEMBAR78
Improving spam detection with automaton | PDF
1/ 17
®
Improving
SPAM detection
1 de março 2016
®
2/ 17
®
Whois
● Antonio Costa – Cooler
● Just another System analyst
● Github CoolerVoid
●
● https://github.com/CoolerVoid
Contact: acosta@conviso.com.br
coolerlair@gmail.com
3/ 17
®
How it works
● Anti-Spam - The common way
● Get E-mails POP3 / IMAP ...
● Validate
● Clean all and tokenization
● BoW (Bag-of-words), SoW(Set-of-Words)...
● tf–idf (term frequency–inverse document
frequency)...
● Supervised learning
● Classification (SVM, KNN, NB, Random forest... )
4/ 17
®
How it works
● Anti-Spam - The common way
● Get E-mails POP3 / IMAP
● Validate
– Country-based filtering
– DNS-based blacklists
– Enforcing RFC standards
– SMTP callback verification
5/ 17
®
● DNS-based blacklists
6/ 17
®
Wake UP
7/ 17
®
How it works
● Anti-Spam - The common way
● Get E-mails POP3 / IMAP ... - INPUT STRING
● Validate
● Clean all and tokenization
● BoW (Bag-of-words), SoW(Set-of-Words), tf–idf
(term frequency–inverse document frequency)...
Create MATRIX
● Supervised learning – USING MATRIX
● Classification (SVM, KNN, NB, Random forest... )
8/ 17
®
Bag-of-words
[ 1 ] - “Luan likes to make hacking. Josimar likes to make
hacking too.”
[ 2 ] - “Luan also likes to web hacking.”
● Create array of words ( tokenize... )
{ “Luan”,”likes”,”to”,”make”,”hacking”,”Josimar”,”too”,
”also”,”web”} Total of 9 elements
● Count number of appers !
[0] – { 1, 2, 2, 2, 2, 1, 1, 0, 0 }
[1] – { 1, 1, 1, 0, 1, 0, 0, 1, 1 }
9/ 17
®
The common way
Look this following
10/ 17
®
The common way
Why naive bayes ?
● At my tests !
KNN 96% Slow
Super simple, you're just doing a bunch of counts. Naive Bayes is
an eager learning classifier and it is much faster than KNN.
Nodaways it could be used for prediction in real time.
Classifier Accuracy Performance
SVM 92% Medium
NB 94% Fast
11/ 17
®
My way
Automatos like a Match Rules
● Gain Accuracy !
● Gain Performance !
● Because can match to SPAM before to use classifier !
● www.site.com/www.bank.com/
● URL/malware.exe rule like URL/[a-zA-Z]*.exe ...
● Rule like to detect IP at URL
● Deterministic finite automaton to detect
● Use ranking !
NB 94% +4% Fast
12/ 17
®
My way
Automatos like a Match Rules
● Gain Accuracy !
● Gain Performance !
● Because can match to SPAM before to use classifier !
● Deterministic finite automaton at Rules to detect
● www.site.com/www.bank.com/
● URL/malware.exe rule like URL/[a-zA-Z]*.exe ...
● Rule like to detect IP at URL
● Rule to detect Phishing
● Use Ranking !
NB 94% +4% Fast
13/ 17
®
Why Ranking ?
Automatos like a Match Rules
● Gain Accuracy !
NB 94% +4% Fast
14/ 17
®
E-mail audit
The project !
● C++ at all source code ! 100% Open Source !
● IMAP – communication
● Blacklists – DNS, bad domains, e-mail address...
● Deterministic Finite Automaton – Filters
● Tf–idf (term frequency–inverse document
frequency)
● Naive bayes – classifier
15/ 17
®
My way
Automatos like a Match Rules
● Gain Accuracy !
● Gain Performance !
● Because can match to SPAM before to use classifier !
● www.site.com/www.bank.com/
● URL/malware.exe rule like URL/[a-zA-Z]*.exe ...
● Rule like to detect IP at URL
● Deterministic finite automaton to detect
● Use ranking !
NB 94% +4% Fast
16/ 17
®
E-mail audit
The project !
● At the future, using GPU to use KNN and automatons...
● Results with GPU turns all fast...
● Next step 100% of accuracy ?
https://github.com/CoolerVoid/email_audit
17/ 17
®
Thanks
● https://github.com/CoolerVoid

Improving spam detection with automaton

  • 1.
  • 2.
    2/ 17 ® Whois ● AntonioCosta – Cooler ● Just another System analyst ● Github CoolerVoid ● ● https://github.com/CoolerVoid Contact: acosta@conviso.com.br coolerlair@gmail.com
  • 3.
    3/ 17 ® How itworks ● Anti-Spam - The common way ● Get E-mails POP3 / IMAP ... ● Validate ● Clean all and tokenization ● BoW (Bag-of-words), SoW(Set-of-Words)... ● tf–idf (term frequency–inverse document frequency)... ● Supervised learning ● Classification (SVM, KNN, NB, Random forest... )
  • 4.
    4/ 17 ® How itworks ● Anti-Spam - The common way ● Get E-mails POP3 / IMAP ● Validate – Country-based filtering – DNS-based blacklists – Enforcing RFC standards – SMTP callback verification
  • 5.
  • 6.
  • 7.
    7/ 17 ® How itworks ● Anti-Spam - The common way ● Get E-mails POP3 / IMAP ... - INPUT STRING ● Validate ● Clean all and tokenization ● BoW (Bag-of-words), SoW(Set-of-Words), tf–idf (term frequency–inverse document frequency)... Create MATRIX ● Supervised learning – USING MATRIX ● Classification (SVM, KNN, NB, Random forest... )
  • 8.
    8/ 17 ® Bag-of-words [ 1] - “Luan likes to make hacking. Josimar likes to make hacking too.” [ 2 ] - “Luan also likes to web hacking.” ● Create array of words ( tokenize... ) { “Luan”,”likes”,”to”,”make”,”hacking”,”Josimar”,”too”, ”also”,”web”} Total of 9 elements ● Count number of appers ! [0] – { 1, 2, 2, 2, 2, 1, 1, 0, 0 } [1] – { 1, 1, 1, 0, 1, 0, 0, 1, 1 }
  • 9.
    9/ 17 ® The commonway Look this following
  • 10.
    10/ 17 ® The commonway Why naive bayes ? ● At my tests ! KNN 96% Slow Super simple, you're just doing a bunch of counts. Naive Bayes is an eager learning classifier and it is much faster than KNN. Nodaways it could be used for prediction in real time. Classifier Accuracy Performance SVM 92% Medium NB 94% Fast
  • 11.
    11/ 17 ® My way Automatoslike a Match Rules ● Gain Accuracy ! ● Gain Performance ! ● Because can match to SPAM before to use classifier ! ● www.site.com/www.bank.com/ ● URL/malware.exe rule like URL/[a-zA-Z]*.exe ... ● Rule like to detect IP at URL ● Deterministic finite automaton to detect ● Use ranking ! NB 94% +4% Fast
  • 12.
    12/ 17 ® My way Automatoslike a Match Rules ● Gain Accuracy ! ● Gain Performance ! ● Because can match to SPAM before to use classifier ! ● Deterministic finite automaton at Rules to detect ● www.site.com/www.bank.com/ ● URL/malware.exe rule like URL/[a-zA-Z]*.exe ... ● Rule like to detect IP at URL ● Rule to detect Phishing ● Use Ranking ! NB 94% +4% Fast
  • 13.
    13/ 17 ® Why Ranking? Automatos like a Match Rules ● Gain Accuracy ! NB 94% +4% Fast
  • 14.
    14/ 17 ® E-mail audit Theproject ! ● C++ at all source code ! 100% Open Source ! ● IMAP – communication ● Blacklists – DNS, bad domains, e-mail address... ● Deterministic Finite Automaton – Filters ● Tf–idf (term frequency–inverse document frequency) ● Naive bayes – classifier
  • 15.
    15/ 17 ® My way Automatoslike a Match Rules ● Gain Accuracy ! ● Gain Performance ! ● Because can match to SPAM before to use classifier ! ● www.site.com/www.bank.com/ ● URL/malware.exe rule like URL/[a-zA-Z]*.exe ... ● Rule like to detect IP at URL ● Deterministic finite automaton to detect ● Use ranking ! NB 94% +4% Fast
  • 16.
    16/ 17 ® E-mail audit Theproject ! ● At the future, using GPU to use KNN and automatons... ● Results with GPU turns all fast... ● Next step 100% of accuracy ? https://github.com/CoolerVoid/email_audit
  • 17.