bmf - efficient Bayesian mail filter

NAME

       bmf - efficient Bayesian mail filter

SYNOPSIS

       bmf [-t] [-n] [-s] [-N] [-S] [-f fmt] [-d db] [-i file] [-k n] [-m type] [-p]
           [-v] [-V] [-h]

DESCRIPTION

       bmf  is  a  Bayesian  mail  filter. In its normal mode of operation, it
       takes an email  message  or  other  text  on  standard  input,  does  a
       statistical  check  against lists of "good" and "spam" words, registers
       the new data, and returns a status code indicating whether or  not  the
       message  is spam. BMF is written with fast, zero-copy algorithms, coded
       directly in C, and tuned for speed. It aims to be faster, smaller,  and
       more versatile than similar applications.

       bmf  supports  both  mbox  and  maildir  mail  storage formats. It will
       automatically process multiple messages within an mbox file separately.

OPTIONS

       Without  command-line options, bmf processes the input, registers it as
       either "good" or "spam", and returns the appropriate  error  code.  The
       wordlist directory and nonexistent wordfiles are created if absent.

       -t  Test to see if the input is spam. The word lists are not updated. A
       report is written to stdout showing the final score and the tokens with
       the highest deviation form a mean of 0.5.

       -n Register the input as non-spam.

       -s Register the input as spam.

       -N  Register  the  input  as  non-spam and undo a prior registration as
       spam.

       -S Register the input as spam and undo a  prior  registration  as  non-
       spam.

       -f  fmt Specify database format. Valid formats are text, db, and mysql.
       Text  is  always  valid.  The  others  may  not  be  available  if  the
       corresponding option was not enabled at compile time. The default is db
       if available, else text.

       -d db Specify database or directory for loading and saving word  lists.
       The default is ~/.bmf in text mode.

       -i file Use file for input instead of stdin.

       -k  n  Specify  the  number  of  extrema  (keepers) to use in the Bayes
       calculation. The default is 15.

       -m fmt Specify mail storage format. Valid formats are mbox and maildir.
       The  default  is  to automatically detect the mail storage format. This
       option is deprecated.

       -p Copy the input to the output (passthrough) and insert  spam  headers
       in  the  style  of  SpamAssassin.  An  X-Spam-Status  header  is always
       inserted with processing details. The contents of  this  header  always
       begin with either "Yes" or "No". If the input is judged to be spam, the
       header "X-Spam-Flag: YES" is also inserted.

       -v Be more verbose. This option is not well supported yet.

       -V Display version information.

       -h Display usage information.

THEORY OF OPERATION

       bmf treats its input as a bag of tokens. Each token is checked  against
       "good"  and  "bad"  wordlists,  which maintain counts of the numbers of
       times it has occurred in non-spam and spam  mails.  These  numbers  are
       used  to  compute the probability that a mail in which the token occurs
       is spam. After probabilities for all input tokens have been computed, a
       fixed  number  of  the probabilities that deviate furthest from average
       are combined using Bayes’s theorem on conditional probabilities.

       While this method sounds crude compared  to  the  more  usual  pattern-
       matching  approach,  it  turns  out  to  be  extremely  effective. Paul
       Graham’s paper A Plan For Spam: http://www.paulgraham.com/spam.html  is
       recommended reading.

       bmf  improves  on Paul’s proposal by doing smarter lexical analysis. In
       particular, hostnames and IP addresses are not discarded,  and  certain
       types of MTA information are discarded (such as message ids and dates).

       MIME and other attachments are not decoded.  Experience  from  watching
       the  token  streams suggests that spam with enclosures invariably gives
       itself away through  cues  in  the  headers  and  non-enclosure  parts.
       Nonetheless, I would like to add the ability to decode quoted-printable
       and perhaps base64 encodings for textual attachments.

INTEGRATION WITH OTHER TOOLS

       Please   see   the   /usr/share/doc/bmf/README.gz   for   samples   and
       suggestions.

RETURN VALUES

       In passthrough mode: zero for success, nonzero for failure.

       In non-passthrough mode: 0 for spam; 1 for non-spam; 2 for I/O or other
       errors.

FILES

       ~/.bmf/goodlist.txt
              List of good tokens for text mode.

       ~/.bmf/spamlist.txt
              List of bad tokens for text mode.

       ~/.bmf/goodlist.db
              List of good tokens for libdb mode.

       ~/.bmf/spamlist.db
              List of bad tokens for libdb mode.

BUGS

       Only one copy of bmf(1) instance can access the database  (see  options
       -d  and  -f). In Procmail recipes, ensure sequential access with a lock
       file:

               :0 fw: bmf.lock
               | bmf -p

       The lexer does not recognize multiline headers.

       The lexer does not recognize MIME attachments.

       Content-Transfer-Encoding is not decoded.

AUTHOR

       Tom Marshall <tommy@tig-grr.com>.

       The  Bayes  algorithm  is  from   bogofilter   by   Eric   S.   Raymond
       <esr@thyrsus.com>.  bogofilter  can  be found at the bogofilter project
       page: http://bogofilter.sourceforge.net/.