spamoracle.conf - SpamOracle configuration file format

NAME

       spamoracle.conf - SpamOracle configuration file format

DESCRIPTION

       The   spamoracle.conf  file  is  a  configuration  file  governing  the
       operation of the spamoracle(1) e-mail classification tool.  By default,
       the  configuration  file  is  searched in $HOME/.spamoracle.conf but an
       alternate  location  can  be  specified  using  the  -config  flag   to
       spamoracle(1).

       Important  note:  most  of  the  configuration parameters should not be
       modified lightly,  as  this  may  result  in  completely  wrong  e-mail
       classification.   Familiarity  with  Graham’s  filtering  algorithm, as
       described in the paper referenced at the end of this page, is  required
       to really understand the effect of the parameters.

SYNTAX

       The  spamoracle.conf  file  is composed of lines of the form variable =
       value.  Lines starting with a hash sign (#) are treated as comments and
       ignored.  Blank lines are ignored.

       Depending  on  the  type  of  the  variable  (see the list of variables
       below), the value part is of the following forms:

       string A  sequence  of  characters.   Blanks  (spaces,  tabs)  at   the
              beginning and the end of the string are ignored.  Alternatively,
              the string can be enclosed in double quotes ("), in  which  case
              spaces  are  not trimmed.  Inside quoted strings, blackslashes (
              and double quotes (") must be escaped with a backslash, as in  \
              or

       boolean
              Either  on,  yes,  true, or 1 to activate the boolean option, or
              off, no, false, or 0 to deactivate it.

       integer
              A decimal integer

       float  A decimal floating-point number.

       regexp A  regular  expression  in  emacs(1)  syntax.   The   repetition
              operators  are  *,  +,  and  ?.   Alternation  is written \| and
              grouping is written  \(...\).   Character  classes  are  written
              between  brackets  [...]   as  usual.   A single dot denotes any
              character  except  newline.   Regular  expressions   are   case-
              insensitive.

CONFIGURABLE PARAMETERS

       database_file
              (type string, default value $HOME/.spamoracle.db )
              The  location  of  the  file  that contains the database of word
              frequencies used by spamoracle(1).

       html_retain_tags
              (type boolean, default value false)
              In HTML-formatted e-mails and attachments,  the  names  of  HTML
              tags  are  normally not treated as words and are ignored for the
              word frequency calculations. If the  html_retain_tags  parameter
              is  set  to true, HTML tags (such as img or bold) are treated as
              words and included in the computation of word frequencies.

       html_tag_attributes
              (type regexp, default value
              a/href\|img/src\|img/alt\|frame/src\|font/face\|font/color)
              This regular expression matches pairs  of  HTML  tags  and  HTML
              attributes   written  as  tag/attribute.   When  scanning  HTML-
              formatted e-mails and attachments, attributes to HTML  tags  are
              normally  ignored,  unless  the  tag/attribute  pair matches the
              regular expression html_tag_attributes.   If  the  tag/attribute
              pair  matches  this  regexp,  the  value  of  the attribute (for
              instance, the URL for  the  a/href  attribute)  is  scanned  for
              words.

       mail_headers
              (type regexp, default value from:\|subject:)
              A  regular  expression  determining  which  headers of an e-mail
              message are scanned for words.

       spam_header
              (type string, default value X-Spam)
              The name of the header that spamoracle mark adds to incoming  e-
              mail   messages,   with   the   results   of  the  spam/non-spam
              classification.

       attachments_header
              (type string, default value X-Attachments)
              The name of the header that spamoracle mark adds to incoming  e-
              mail  messages,  with  the one-line summary of attachment types,
              names and character sets.  The generation of this header can  be
              turned off with the summarize_attachment parameter.

       summarize_attachment
              (type boolean, default value true)
              If  this  parameter is set, spamoracle mark generates a one-line
              summary of the attachments of the incoming messages, and inserts
              this  summary in the message headers.  Setting this parameter to
              false disables the generation of this extra header.

       num_meaningful_words
              (type integer, default value 15)
              Maximal number of  "meaningful"  words  that  are  retained  for
              computing   the   spam   probability.    During  mail  analysis,
              spamoracle extracts all words of the message, and retains  those
              whose  spam frequency (frequency of occurrence in spam messages)
              is closest to 1 or to  0.   At  most  num_meaningful_words  such
              "meaningful" words are retained.

       max_repetitions
              (type integer, default value 2)
              Maximum  number  of  times  a given word can occur in the set of
              "meaningful" words retained for computing the spam  probability.
              The  default  value of 2 means that at most 2 occurrences of the
              same word will be retained.

       low_freq_limit
              (type float, default value 0.01)

       high_freq_limit
              (type float, default value 0.99)
              The spam frequency of a  word  is  computed  as  the  number  of
              occurrences  in  spam  divided  by  number of occurrences in all
              messages.   This  ratio  is  then  clipped  to  the  interval  [
              low_freq_limit,  high_freq_limit  ],  so  that  words  that  are
              extremely rare or extremely common  in  spam  do  not  bias  the
              probability  computation  too  much.  The default values of 0.01
              and 0.99 are adequate for a corpus of a  few  thousand  e-mails.
              For  larger  corpora  (e.g. 10000 e-mails), the values 0.001 and
              0.999 may give better results.

       min_meaningful_words
              (type integer, default value 5)
              Minimum number of "meaningful" words below which spamoracle mark
              refuses  to  classify  the  e-mail and outputs "unknown" status.
              This happens with very short e-mails, or  e-mails  that  consist
              exclusively of links and pictures.

       good_mail_prob
              (type float, default value 0.2)
              Spam  probability  below  which the e-mail is classified as non-
              spam.

       spam_mail_prob
              (type float, default value 0.8)
              Spam probability above which the e-mail is classified  as  spam.
              Messages  whose  probability  falls  between  good_mail_prob and
              spam_mail_prob are classified as "unknown".

AUTHOR

       Xavier Leroy <Xavier.Leroy@inria.fr>

NAME

DESCRIPTION

SYNTAX

CONFIGURABLE PARAMETERS

AUTHOR

SEE ALSO