KEMBAR78
Using lexigraphical distancing to block spam | PDF
Using Lexigraphical Distancing to
Block Spam
Jonathan Oliver
Spam Conference
January 21, 2005
2
Bayesian Filters for Spam
ƒ Personal Bayesian Filters
– Can be re-trained on a regular basis
– Trained for the individual
– Can focus on terms which reflect good mail
ƒ Server-side Bayesian Filters
– Trained for a group of people (probably for a organization)
– Less focused on terms which reflect good mail
– Often used as part of a layered approach – configured for a very
low false positive rate
ƒ Differences
– Personal and server-side filters are solving different problems
– Effectiveness rates – typically server-side filters catch less spam
than personal Bayesian filters
3
Evolution of Spam 2004
ƒ Spammers started sending spam like:
Genierc Viagr and Sepur Viarga (Caiils)
available onlnie!
Most trsuted onilne source!
Vagira & Cilais takes afefct right away & lasts
24-36 huors!
FOR SUEPR VAIRGA TOCUH HERE
4
People can read jumbled text
I cdnuolt blveiee taht I cluod aulaclty
uesdnatnrd waht I was rdgnieg. The
phaonmneal pweor of the hmuan mnid.
Aoccdrnig to a rscheearch at Cmabrigde
Uinervtisy, it deosn't mttaer in waht oredr
the ltteers in a wrod are, the olny iprmoatnt
tihng is taht the frist and lsat ltteer be in the
rghit pclae.
http://www.mrc-cbu.cam.ac.uk/personal/matt.davis/Cmabrigde/
5
Evolution of Spam 2004 (cont.)
http://cockeyed.com/lessons/viagra/viagra.html
6
Variants of “Viagra”
V&iagra
V iagr a
vi-@gr@
vi@gr*@
vi**agra
Viag&ra
vi@|g|r@
Viag)ra
V|i|a|g|r|a
Viarga
Viag%ra
Vi/agra
Viaoygra
Vi.ag.ra
via---gra
Viag$ra
ViaJ1gra
Vi$agra
ViaaPrga
V1@grA
V-i.a-g*r-a
vi@g*r@
Viag&ra
Viag@ra
Viagara
Vi Ἧtd>
Viagr^a
Viagr(a
ViaVErga
ViaTagra
vi@gr|@|
Viaggra
VIxAGRA
V/i/a/g/r/a
VIA7GRA
V l A G R A
v-ii-a=g-ra
via.gra
Vkiagra
vigra
Vi-ag.ra
V-I-A-G-R-A
Viagvra
vi(@)gr@
ViagWra
Vii-agra
ViagrYa
Viargvra
ViaZUgra
via_gra
viagdra
Viagzra
'V 1 @ G' Ra
via-gra
viagrga
via---gra
VyAGRA
V l a g r a
ViagDrHa
Viagorea
7
Applying Text Classification to Spam
ƒ Classification is difficult if you cannot identify key
terms
ƒ It would be very useful to identify variants of
“Viagra” as “Viagra”
ƒ Possible approaches
– Regular expressions
– Spell checking
– Edit distance (Lexigraphical Distancing)
8
Regular Expressions
ƒ Difficult to write a regular expression which covers
600,426,974,379,824,381,952 variants of “Viagra”
ƒ Potentially time consuming
ƒ Can be error prone
ƒ Possibly computationally intensive
9
Spell Checking
ƒ Spell checking does not cope well with letters
constructed from other elements:
 / 1 /- G R @
ƒ Spell checking does not cope well with split words:
Mort gage ra tes
ƒ Spell checking does not cope well with words run
together:
BuyViagaraNow
ƒ Microsoft Word spell check caught 24 of the
variants listed
10
Lexigraphical Distance
ƒ Algorithm estimates the probability that the content
being inspected is an “edited” version of the term in
question
ƒ Edits allowed:
– Insertion
– Deletion
– Substitution
ƒ Probability of insertions / deletions / substitutions
estimated from data
ƒ If the probability is greater than some threshold
then the term is determined to be a variant
ƒ Identified 51 out of 60 as variants of Viagra
11
Experiment
ƒ Trained up a “Plan for Spam” Naïve Bayes Classifier
ƒ Training set:
20,000 spam and 20,000 good mail from a diverse set of
people
ƒ Test set:
3,459 spam sent on Jan. 11th
25,555 good mail from different set of people
ƒ Edit distance caught spammers transforming spam-
terms in 27% of the spam
ƒ Edit distance incorrectly identified spam phrase
variants in 48 of the good mail (0.19%). Close to
no impact on false positive rate.
12
Results
(considered 3 thresholds for separating spam from good mail)
0.04%
0.11%
0.32%
11
27
83
Naïve Bayes
0.04%
0.11%
0.36%
11
29
91
false positives
Naïve Bayes with lexigraphical
distance
GOOD MAIL (25,555)
13.50%
14.20%
11.80%
Improvement
46.10%
53.10%
63.90%
1,596
1,836
2,209
Naïve Bayes
59.60%
67.30%
75.70%
2,063
2,328
2,618
spam caught
Naïve Bayes with lexigraphical
distance
SPAM (3,459 spam from Jan 11
2005)
13
Lexigraphical Distance in Use
ƒ Edit distance is a very useful step in server-side
Bayesian filters, especially when a part of the
layered approach to spam filtering
ƒ MailFrontier uses edit distance as a pre-processing
step before statistical text analysis
ƒ Computationally feasible – the software using edit
distance was rated in the fastest 20% of spam
filters by Network World
http://www.nwfusion.com/reviews/2004/122004spampkg.html
14
Conclusions
ƒ Comparison of Lexigraphical distance with regular
expressions
– More suited to certain tasks
– More robust
– More intuitive – easier for the end-user
ƒ Other applications
– Content filtering/compliance
– Intellectual property protection
– Adversarial search

Using lexigraphical distancing to block spam

  • 1.
    Using Lexigraphical Distancingto Block Spam Jonathan Oliver Spam Conference January 21, 2005
  • 2.
    2 Bayesian Filters forSpam ƒ Personal Bayesian Filters – Can be re-trained on a regular basis – Trained for the individual – Can focus on terms which reflect good mail ƒ Server-side Bayesian Filters – Trained for a group of people (probably for a organization) – Less focused on terms which reflect good mail – Often used as part of a layered approach – configured for a very low false positive rate ƒ Differences – Personal and server-side filters are solving different problems – Effectiveness rates – typically server-side filters catch less spam than personal Bayesian filters
  • 3.
    3 Evolution of Spam2004 ƒ Spammers started sending spam like: Genierc Viagr and Sepur Viarga (Caiils) available onlnie! Most trsuted onilne source! Vagira & Cilais takes afefct right away & lasts 24-36 huors! FOR SUEPR VAIRGA TOCUH HERE
  • 4.
    4 People can readjumbled text I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdgnieg. The phaonmneal pweor of the hmuan mnid. Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae. http://www.mrc-cbu.cam.ac.uk/personal/matt.davis/Cmabrigde/
  • 5.
    5 Evolution of Spam2004 (cont.) http://cockeyed.com/lessons/viagra/viagra.html
  • 6.
    6 Variants of “Viagra” V&iagra Viagr a vi-@gr@ vi@gr*@ vi**agra Viag&ra vi@|g|r@ Viag)ra V|i|a|g|r|a Viarga Viag%ra Vi/agra Viaoygra Vi.ag.ra via---gra Viag$ra ViaJ1gra Vi$agra ViaaPrga V1@grA V-i.a-g*r-a vi@g*r@ Viag&ra Viag@ra Viagara Vi Ἧtd> Viagr^a Viagr(a ViaVErga ViaTagra vi@gr|@| Viaggra VIxAGRA V/i/a/g/r/a VIA7GRA V l A G R A v-ii-a=g-ra via.gra Vkiagra vigra Vi-ag.ra V-I-A-G-R-A Viagvra vi(@)gr@ ViagWra Vii-agra ViagrYa Viargvra ViaZUgra via_gra viagdra Viagzra 'V 1 @ G' Ra via-gra viagrga via---gra VyAGRA V l a g r a ViagDrHa Viagorea
  • 7.
    7 Applying Text Classificationto Spam ƒ Classification is difficult if you cannot identify key terms ƒ It would be very useful to identify variants of “Viagra” as “Viagra” ƒ Possible approaches – Regular expressions – Spell checking – Edit distance (Lexigraphical Distancing)
  • 8.
    8 Regular Expressions ƒ Difficultto write a regular expression which covers 600,426,974,379,824,381,952 variants of “Viagra” ƒ Potentially time consuming ƒ Can be error prone ƒ Possibly computationally intensive
  • 9.
    9 Spell Checking ƒ Spellchecking does not cope well with letters constructed from other elements: / 1 /- G R @ ƒ Spell checking does not cope well with split words: Mort gage ra tes ƒ Spell checking does not cope well with words run together: BuyViagaraNow ƒ Microsoft Word spell check caught 24 of the variants listed
  • 10.
    10 Lexigraphical Distance ƒ Algorithmestimates the probability that the content being inspected is an “edited” version of the term in question ƒ Edits allowed: – Insertion – Deletion – Substitution ƒ Probability of insertions / deletions / substitutions estimated from data ƒ If the probability is greater than some threshold then the term is determined to be a variant ƒ Identified 51 out of 60 as variants of Viagra
  • 11.
    11 Experiment ƒ Trained upa “Plan for Spam” Naïve Bayes Classifier ƒ Training set: 20,000 spam and 20,000 good mail from a diverse set of people ƒ Test set: 3,459 spam sent on Jan. 11th 25,555 good mail from different set of people ƒ Edit distance caught spammers transforming spam- terms in 27% of the spam ƒ Edit distance incorrectly identified spam phrase variants in 48 of the good mail (0.19%). Close to no impact on false positive rate.
  • 12.
    12 Results (considered 3 thresholdsfor separating spam from good mail) 0.04% 0.11% 0.32% 11 27 83 Naïve Bayes 0.04% 0.11% 0.36% 11 29 91 false positives Naïve Bayes with lexigraphical distance GOOD MAIL (25,555) 13.50% 14.20% 11.80% Improvement 46.10% 53.10% 63.90% 1,596 1,836 2,209 Naïve Bayes 59.60% 67.30% 75.70% 2,063 2,328 2,618 spam caught Naïve Bayes with lexigraphical distance SPAM (3,459 spam from Jan 11 2005)
  • 13.
    13 Lexigraphical Distance inUse ƒ Edit distance is a very useful step in server-side Bayesian filters, especially when a part of the layered approach to spam filtering ƒ MailFrontier uses edit distance as a pre-processing step before statistical text analysis ƒ Computationally feasible – the software using edit distance was rated in the fastest 20% of spam filters by Network World http://www.nwfusion.com/reviews/2004/122004spampkg.html
  • 14.
    14 Conclusions ƒ Comparison ofLexigraphical distance with regular expressions – More suited to certain tasks – More robust – More intuitive – easier for the end-user ƒ Other applications – Content filtering/compliance – Intellectual property protection – Adversarial search