Man Linux: Main Page and Category List

NAME

       spamoracle - a spam classification tool

SYNOPSIS

       spamoracle [-config conf] [-f database] mark [ mailbox ...  ]

       spamoracle  [-config  conf]  [-f  database]  add [-v] -spam spambox ...
       -good goodbox ...

       spamoracle [-config conf] [-f database] test [-min prob] [-max prob]  [
       mailbox ...  ]

       spamoracle [-config conf] [-f database] stat [ mailbox ...  ]

       spamoracle [-config conf] [-f database] list regexp ...

       spamoracle [-config conf] [-f database] backup > backupfile

       spamoracle [-config conf] [-f database] restore < backupfile

       spamoracle [-config conf] [-f database] words [ mailbox ...  ]

DESCRIPTION

       SpamOracle is a tool to help detect and filter away "spam" (unsolicited
       commercial e-mail).  It proceeds by statistical analysis of  the  words
       that  appear  in  the  e-mail,  comparing the frequencies of words with
       those  found  in  a  user-provided  corpus  of  known  spam  and  known
       legitimate  e-mail.   The  classification  algorithm is based on Bayes’
       formula, and is described in Paul Graham’s  paper,  A  plan  for  spam,
       http://www.paulgraham.com/spam.html.

       This  program is designed to work in conjunction with procmail(1).  The
       result of the analysis is output as an  additional  message  header  X-
       Spam:  followed  by  yes,  no  or  unknown, plus additional details.  A
       procmail rule can then test this X-Spam: header and deliver the  e-mail
       to the appropriate mailbox.

       In  addition,  SpamOracle  also  analyses  MIME attachments, extracting
       relevant information such as MIME type, character encoding and attached
       file name, and summarizing them in an additional X-Attachments: header.
       This allows procmail to easily  reject  e-mails  containing  suspicious
       attachments, e.g. Windows executables which often indicate a virus.

REQUIREMENTS AND LIMITATIONS

       To  use  SpamOracle,  your  mail must be delivered to a Unix machine on
       which you have a shell account.  This  machine  must  have  procmail(1)
       (see http://www.procmail.org/) installed.  Your ~/.forward file must be
       set up to run all incoming e-mail through procmail(1).   If  your  mail
       server   supports   the  POP  or  IMAP  protocols,  you  can  also  use
       fetchmail(1) to fetch your mail from the server and have  it  delivered
       to your local machine.

       To  provide  the  corpus of messages from which SpamOracle "learns", an
       archive of about 1000 of your e-mails is needed.  The archive  must  be
       manually  or  semi-automatically  split into known spams and known good
       messages.  Mis-classified messages in the corpus (e.g. spams mistakenly
       stored  among  the  good  messages) will decrease the efficiency of the
       classification.  The archive must be in Unix mailbox format, or in "one
       message  per  file"  format  (a  la MH).  Other formats, such as Emacs’
       Babyl, are not supported.

       The notion of "word" used by  SpamOracle  is  slanted  towards  Western
       European  languages,  i.e.  the ISO Latin-1 and Latin-9 character sets.
       Preliminary  support  for  JIS-encoded  Japanese  can  be  selected  at
       compile-time.   SpamOracle  will  not  work  well  if  you receive many
       legitimate e-mails written in other character sets, such as Chinese  or
       Korean sets.

INITIALIZATION

       To build the database of word frequencies from the corpus, do:

              rm ~/.spamoracle.db
              spamoracle add -v -good goodmails -spam spammails

       By  default,  the database is stored in the file .spamoracle.db in your
       home directory.  This can be overriden with the -f  option:  spamoracle
       -f mydatabase add ...  The -v option prints progress information during
       the processing of the corpus.

       This assumes that the good,  non-spam  messages  from  the  corpus  are
       stored  in  the file goodmails, and the known spam messages in the file
       spammails.  You can also fetch  corpus  messages  from  several  files,
       and/or process them via several invocations of SpamOracle:

              spamoracle add -good goodmails1 ... goodmailsN
              spamoracle add -spam spammails1 ... spammailsP

TESTING THE DATABASE

       To  check  that  the  database  was  built  correctly,  and familiarize
       yourself with the statistical analysis performed by SpamOracle,  invoke
       the  "test"  mode  on the mailboxes that you just used for building the
       corpus:

              spamoracle test goodmails | more
              spamoracle test spammails | more

       For each message in the given mailboxes,  you’ll  see  a  summary  like
       this:

               From: bbo <midhack@ureach.com>
               Subject: Check This Out
               Score: 1.00 -- 15
               Details: refid:98 $$$$:98 surfing:98 asp:95 click:93 cable:92
                 instantly:90 https:88 internet:87 www:86 U4:85 isnt:14 month:81
                 com:75 surf:75
              Attachments: cset="GB2312" type="application/octet-stream"
                 name="Guangwen4.zip"
               File: inbox/314

       The  first  two  lines  are  just  the From: and Subject: fields of the
       original message.

       The Score: line summarizes the  result  of  the  analysis.   The  first
       number  (between  0.0  and  1.0) is the probability that the message is
       actually spam --- or, equivalently, the degree  of  similarity  of  the
       message  with  the  spam messages in the corpus.  The second number (an
       integer between 0 and 15) is the number of "interesting" words found in
       the message.  "Interesting" words are those that occur at least 5 times
       in the corpus.  In the example,  we  have  15  interesting  words  (the
       maximum) and a score of 1.00, indicating a spam with high certainty.

       The  Details:  line provides an explanation of the score.  It lists the
       15 most interesting words  found  in  the  message,  that  is,  the  15
       interesting words whose probability of denoting a spam is farthest away
       from the neutral 0.5.  Each word is given with  its  individual  score,
       written  as  a  percentage  (between  01  and  99)  rather  than  as  a
       probability so as to save  space.   Here,  we  see  a  number  of  very
       "spammish"  words such as $$$$ or click, with probability 0.98 and 0.93
       respectively, and a few "innocent" words  such  as  isnt  (probability
       0.14).   The  U4  word  with probability 0.85 is actually a pseudo-word
       representing a 4-letter word all in uppercase -- something spammers are
       fond of.

       The   Attachments:   line   summarizes   some  information  about  MIME
       attachments for this message.  Here, we have  one  attachment  of  type
       application/octect-stream,  file  name Guangwen4.zip, and character set
       GB2312 (an encoding for Chinese).

       The File: line shows the file that is being tested.

       Normally, when running spamoracle test goodmails, most messages  should
       come out with low score (0.2 or less), and when running spamoracle test
       spammails, most messages should come out with  a  high  score  (0.8  or
       more).   If  not,  your  corpus isn’t very good, or not well classified
       into spam and non-spam.  To quickly see the outliers,  you  can  reduce
       the  interval  of  scores for which message summaries are displayed, as
       follows:

              spamoracle test -min 0.2 goodmails | more
                     # Shows only good mails with score >= 0.2
              spamoracle test -max 0.8 spammails | more
                     # Shows only spam mails with score <= 0.8

       Now, for  a  more  challenging  test,  take  a  mailbox  that  contains
       unfiltered  e-mails, i.e. a mixture of spam and legitimate e-mails, and
       run it through SpamOracle:

              spamoracle test mymailbox | less

       Marvel at how well the oracle recognizes spam from the  rest!   If  the
       result isn’t that marvelous to you, keep in mind that certain spams are
       just too short to be recognized (not enough significant words).   Also,
       perhaps your corpus was too small, or not well categorized...

MARKING AND FILTERING INCOMING E-MAIL

       Once  the  database  is  built,  you’re  ready  to run incoming e-mails
       through SpamOracle.  The command spamoracle mark reads one e-mail  from
       standard  input,  and  copies  it  to standard output, with two headers
       inserted: X-Spam: and X-Attachments:.  The X-Spam: header has  one  the
       following formats:

          X-Spam: yes; score; details

       or

          X-Spam: no; score; details

       or

          X-Spam: unknown; score; details

       The score and details are as described for spamoracle test.

       The  yes/no/unknown  tag  synthesizes  the results of the analysis: yes
       means that the score is >= 0.8 and at least 5  interesting  words  were
       found;  no  means  that  the score is <= 0.2 and at least 5 interesting
       words were found; unknown is  returned  otherwise.   The  unknown  case
       generally  occurs for very short messages, where not enough interesting
       words were found.

       The  X-Attachments:  header  contains  the  same  information  as   the
       Attachments:  output  of  spamoracle  test,  that  is, a summary of the
       message attachments.

       To process automatically your incoming e-mail  through  SpamOracle  and
       act  upon  the  results  of  the  analysis,  just  insert the following
       "recipes" in the file ~/.procmailrc:

              :0fw
              | /usr/local/bin/spamoracle mark

              :0
              * ^X-Spam: yes;
              spambox

       What these cryptic commands mean is:

       - Run every mail through the spamoracle mark command.   (If  spamoracle
       wasn’t  installed  in  /usr/local/bin,  adjust  the path as necessary.)
       This adds two headers  to  the  message:  X-Spam:  and  X-Attachments:,
       describing  the  results  of  the  spam  analysis  and  the  attachment
       analysis.

       - If we have an X-Spam: yes header, deliver the  message  to  the  file
       spambox  rather  than to your regular mailbox.  Presumably, you’ll read
       spambox once in a while, but less  often  than  your  regular  mailbox.
       Daring  users  can  put /dev/null instead of spambox to just throw away
       the message, but please don’t do that until you’ve used SpamOracle  for
       a  while  and  are happy with the results.  SpamOracle’s false positive
       rate (i.e. legitimate mails classified as spam) is low (0.1%)  but  not
       null.   So,  better  save  the  presumed spams somewhere, and scan them
       quickly from time to time.

       If you’d like to enjoy a bit of attachment-based  filtering,  here  are
       some procmail rules for that:

              :0
              * ^X-Attachments:.*name=".*\.(pif|scr|exe|bat|com)"
              spambox

              :0
              * ^X-Attachments:.*type="audio/(x-wav|x-midi)
              spambox

              :0
              * ^(Content-type:.*|X-Attachments:.*cset="|^Subject:.*=\?)(ks_c|gb2312|iso-2|euc-|big5|windows-1251)
              spambox

       The  first rule treats as spam every mail that has a Windows executable
       as attachment.  These mails are typically sent by viruses.  The  second
       rule  does  the same with attachments of type x-wav or x-midi.  I never
       normally receive music by e-mail, however some popular  e-mail  viruses
       seem  fond  of  these  attachment types.  The third rule treats as spam
       every mail that  uses  character  encodings  corresponding  to  Korean,
       Chinese, Japanese, and Cyrillic.

UPDATING THE DATABASE

       At  any time, you can add more known spams or known legitimate messages
       to the database by using the spamoracle add command.

       For instance, if you find a spam message that  was  not  classified  as
       such, run it through spamoracle add -spam, so that SpamOracle can learn
       from its mistake.  (Without additional  arguments,  this  command  will
       read  a  single  message  from  standard  input and record it as spam.)
       Under mutt(1) for instance, just highlight the spam message and type

              |spamoracle add -spam

       Similarly, if you find a legitimate message while  checking  your  spam
       box, run it through spamoracle add -good.

       Another  option  is  to  collect  more  known  spams or more known good
       messages into mailbox files, and once in  a  while  do  spamoracle  add
       -good new_good_mails or spamoracle add -spam new_spam_mails.

QUERYING THE DATABASE

       For  your  edification  and entertainment, the contents of the database
       can be queried by regular  expressions.   The  spamoracle  list  regexp
       command  lists  all  words in the database that match regexp (an Emacs-
       style regular expression), along with their number  of  occurrences  in
       spam mail and in good mail.  For instance:

              spamoracle list.*# show all words -- big list!
              spamoracle listsex.*spamoracle listlinux.*

DATABASE BACKUPS

       The  database  used by SpamOracle is stored in a compact, binary format
       that is not humanly readable.  Moreover,  this  format  is  subject  to
       change  in  later  versions  of  SpamOracle.  To facilitate backups and
       upgrades, the database contents can also be manipulated in a  portable,
       text format.

       The  spamoracle  backup  command  dumps the contents of the database to
       standard output, in a textual, portable format.

       The spamoracle restore command reads such a dump  from  standard  input
       and rebuilds the database with this data.

       The   recommended  procedure  for  upgrading  to  a  newer  version  of
       SpamOracle is:

              # Before the upgrade:
              spamoracle backup > backupfile
              # Upgrade SpamOracle
              # Restore the database
              spamoracle restore < backupfile

CONFIGURING FILTERING PARAMETERS

       Many of the  parameters  that  govern  message  classification  can  be
       configured  via a configuration file.  By default, the configuration is
       read from the file .spamoracle.conf in the user’s  home  directory.   A
       different configuration file can be specified on the command line using
       the -config option: spamoracle -config myconfigfile ...

       The list of configurable parameters and the format of the configuration
       file are described in spamoracle.conf(5).

       All parameters have reasonable defaults, but you can try to improve the
       quality of classification further by tweaking them.  To  determine  the
       impact  of  your  changes,  use  either  the  test  or stat commands to
       spamoracle.  The spamoracle stat command prints a one-line  summary  of
       how  many  spam,  non-spam,  and  unknown  messages  were  found in the
       mailboxes given as arguments.

TECHNICAL DETAILS

       SpamOracle’s notion of "word" is any run of 3 to 12  of  the  following
       characters:  letters,  single  quotes,  and dashes (-).  If support for
       non-English european languages was compiled in,  word  characters  also
       include  the  relevant  accented letters for the languages in question.
       All words are mapped to lowercase, and accented letters are  mapped  to
       the corresponding non-accented letters.

       A  run  of 3 to 12 of the following characters also constitutes a word:
       digits, dots, commas, and dollar, Euro and percent signs.

       In addition, a run of three  or  more  uppercase  letters  generates  a
       pseudo-word  Un  where n is the length of the run.  Similarly, a run of
       three or more non-ASCII characters (code >= 128)  generates  a  pseudo-
       word Wn where n is the length of the run.

       For instance, the following text:

              SUMMER in English is written "ete" in French

       is  processed  into  the  following  words, assuming French support was
       selected at compile-time:

              U5 summer english written ete french W3

       and if French support was not selected:

              U5 summer english written french W3

       To see  the  words  that  are  extracted  from  a  message,  issue  the
       spamoracle  words  command.   It  reads  either  a  single message from
       standard input, or  all  messages  from  the  mailbox  files  given  as
       arguments, decomposes the messages into words and prints the words.

RANDOM NOTES

       The database file can be compressed with gzip(1) to save disk space, at
       the expense of slower spamoracle  operations.   If  the  database  file
       specified  with  the  -f  option has the extension .gz, spamoracle will
       automatically uncompress it  on  start-up,  and  re-compress  it  after
       updates.

       If your mail is stored in MH format, you may run into "command line too
       long" errors while trying to process a lot  of  small  files  with  the
       spamoracle add command, e.g. when doing
       spamoracle add -good archives/*/* -spam spam/*
       Instead, do something like:
       find archives -type f -print | xargs spamoracle add -good
       find spam -type f -print | xargs spamoracle add -spam

AUTHOR

       Xavier Leroy <Xavier.Leroy@inria.fr>

SEE ALSO

       spamoracle.conf(5); procmail(1); fetchmail(1)

       http://cristal.inria.fr/~xleroy/software/    (SpamOracle   distribution
       site)

       http://www.paulgraham.com/spam.html (Paul Graham’s seminal paper)