Improving spam detection with automaton

1/ 17
®
Improving
SPAM detection
1 de março 2016
®

2/ 17
®
Whois
● Antonio Costa – Cooler
● Just another System analyst
● Github CoolerVoid
●
● https://github.com/CoolerVoid
Contact: acosta@conviso.com.br
coolerlair@gmail.com

3/ 17
®
How it works
● Anti-Spam - The common way
● Get E-mails POP3 / IMAP ...
● Validate
● Clean all and tokenization
● BoW (Bag-of-words), SoW(Set-of-Words)...
● tf–idf (term frequency–inverse document
frequency)...
● Supervised learning
● Classification (SVM, KNN, NB, Random forest... )

4/ 17
®
How it works
● Get E-mails POP3 / IMAP
● Validate
– Country-based filtering
– DNS-based blacklists
– Enforcing RFC standards
– SMTP callback verification

5/ 17
®
● DNS-based blacklists

7/ 17
®
How it works
● Get E-mails POP3 / IMAP ... - INPUT STRING
● Validate
● Clean all and tokenization
● BoW (Bag-of-words), SoW(Set-of-Words), tf–idf
(term frequency–inverse document frequency)...
Create MATRIX
● Supervised learning – USING MATRIX
● Classification (SVM, KNN, NB, Random forest... )

8/ 17
®
Bag-of-words
[ 1 ] - “Luan likes to make hacking. Josimar likes to make
hacking too.”
[ 2 ] - “Luan also likes to web hacking.”
● Create array of words ( tokenize... )
{ “Luan”,”likes”,”to”,”make”,”hacking”,”Josimar”,”too”,
”also”,”web”} Total of 9 elements
● Count number of appers !
[0] – { 1, 2, 2, 2, 2, 1, 1, 0, 0 }
[1] – { 1, 1, 1, 0, 1, 0, 0, 1, 1 }

9/ 17
®
The common way
Look this following

10/ 17
®
The common way
Why naive bayes ?
● At my tests !
KNN 96% Slow
Super simple, you're just doing a bunch of counts. Naive Bayes is
an eager learning classifier and it is much faster than KNN.
Nodaways it could be used for prediction in real time.
Classifier Accuracy Performance
SVM 92% Medium
NB 94% Fast

11/ 17
®
My way
Automatos like a Match Rules
● Gain Accuracy !
● Gain Performance !
● Because can match to SPAM before to use classifier !
● www.site.com/www.bank.com/
● URL/malware.exe rule like URL/[a-zA-Z]*.exe ...
● Rule like to detect IP at URL
● Deterministic finite automaton to detect
● Use ranking !
NB 94% +4% Fast

12/ 17
®
My way
● Gain Accuracy !
● Deterministic finite automaton at Rules to detect
● Rule to detect Phishing
● Use Ranking !
NB 94% +4% Fast

13/ 17
®
Why Ranking ?
● Gain Accuracy !
NB 94% +4% Fast

14/ 17
®
E-mail audit
The project !
● C++ at all source code ! 100% Open Source !
● IMAP – communication
● Blacklists – DNS, bad domains, e-mail address...
● Deterministic Finite Automaton – Filters
● Tf–idf (term frequency–inverse document
frequency)
● Naive bayes – classifier

15/ 17
®
My way
● Gain Accuracy !
● Deterministic finite automaton to detect
● Use ranking !
NB 94% +4% Fast

16/ 17
®
E-mail audit
The project !
● At the future, using GPU to use KNN and automatons...
● Results with GPU turns all fast...
● Next step 100% of accuracy ?
https://github.com/CoolerVoid/email_audit

17/ 17
®
Thanks
● https://github.com/CoolerVoid

Improving spam detection with automaton

More Related Content

Viewers also liked

Similar to Improving spam detection with automaton

More from Antonio Costa aka Cooler_

Recently uploaded

Improving spam detection with automaton