mailfoot - a full-online-ordered-training simulator for use with dbacl.

NAME

       mailfoot - a full-online-ordered-training simulator for use with dbacl.

SYNOPSIS

       mailfoot command [ command_arguments ]

DESCRIPTION

       mailfoot  automates  the  task   of   testing   email   filtering   and
       classification  programs  such as dbacl(1).  Given a set of categorized
       documents, mailfoot initiates test runs to estimate the  classification
       errors  and  thereby  permit  fine  tuning  of  the  parameters  of the
       classifier.

       Full Online Ordered Training is a learning method for email classifiers
       where  each  incoming  email  is learned as soon as it arrives, thereby
       always  keeping  category  descriptions  up  to  date  for   the   next
       classification.    This   directly  models  the  way  that  some  email
       classifiers are used in practice.

       FOOT’s error rates depend directly on the order  in  which  emails  are
       seen.   A  small  change in ordering, as might happen due to networking
       delays, can  have  an  impact  on  the  number  of  misclassifications.
       Consequently,  mailfoot  does  not  give meaningful results, unless the
       sample emails  are  chosen  carefully.   However,  as  this  method  is
       commonly  used  by  spam filters, it is still worth computing to foster
       comparisons. Other methods  (see  mailcross(1),mailtoe(1))  attempt  to
       capture the behaviour of classification errors in other ways.

       To  improve and stabilize the error rate calculation, mailfoot performs
       the FOOT simulations several times on slightly reordered email streams,
       and  averages  the  results.  The reorderings occur by multiplexing the
       emails from each category mailbox in random order. Thus  if  there  are
       three  categories,  the  first email classified is chosen randomly from
       the front of the sample email streams of each type.  The  second  email
       is also chosen randomly among the three types, from the front of the
        streams  after  the first email was removed. Simulation stops when all
       sample streams are exhausted.

       mailfoot uses the environment variable MAILFOOT_FILTER when  executing,
       which  permits  the  simulation  of  arbitrary  filters, provided these
       satisfy the compatibility conditions stated in the ENVIRONMENT  section
       below.

       For   convenience,  mailfoot  implements  a  testsuite  framework  with
       predefined wrappers for several open source classifiers.  This  permits
       the  direct  comparison  of  dbacl(1) with competing classifiers on the
       same set of email samples. See the USAGE section below.

       During preparation, mailfoot builds a subdirectory named mailfoot.d  in
       the  current  working directory.  All needed calculations are performed
       inside this subdirectory.

EXIT STATUS

       mailfoot returns 0 on success, 1 if a problem occurred.

COMMANDS

       prepare size
              Prepares a subdirectory named mailfoot.d in the current  working
              directory,  and  populates  it  with  empty  subdirectories  for
              exactly size subsets.

       add category [ FILE ]...
              Takes a set of emails from either FILE if specified,  or  STDIN,
              and  associates  them  with  category.   The  ordering of emails
              within FILE is preserved, and subsequent FILEs are  appended  to
              the  first  in  each  category.   This  command  can be repeated
              several times, but should be executed at least once.

       clean  Deletes the directory mailfoot.d and all its contents.

       run    Multiplexes randomly from the email streams added  earlier,  and
              relearns  categories  only  when a misclassification occurs. The
              simulation is repeated size times.

       summarize
              Prints average error rates for the simulations.

       plot [ ps | logscale ]...
              Plots the number  of  errors  over  simulation  time.  The  "ps"
              option,  if present, writes the plot to a postscript file in the
              directory mailfoot/plots, instead of being shown on-screen.  The
              "logscale"  option, if present, causes the plot to be on the log
              scale for both ordinates.

       review truecat predcat
              Scans the last run statistics  and  extracts  all  the  messages
              which  belong  to category truecat but have been classified into
              category predcat.  The extracted  messages  are  copied  to  the
              directory mailfoot.d/review for perusal.

       testsuite list
              Shows  a  list of available filters/wrapper scripts which can be
              selected.

       testsuite select [ FILTER ]...
              Prepares the filter(s) named FILTER to be used  for  simulation.
              The  filter  name is the name of a wrapper script located in the
              directory /usr/share/dbacl/testsuite.  Each filter has  a  rigid
              interface  documented  below, and the act of selecting it copies
              it to the mailfoot.d/filters  directory.  Only  filters  located
              there are used in the simulations.

       testsuite deselect [ FILTER ]...
              Removes    the    named    filter(s)    from    the    directory
              mailfoot.d/filters so that they are not used in the  simulation.

       testsuite run [ plots ]
              Invokes  every selected filter on the datasets added previously,
              and calculates misclassification rates. If the "plots" option is
              present,  each filter simulation is plotted as a postscript file
              in the directory mailfoot.d/plots.

       testsuite status
              Describes the scheduled simulations.

       testsuite summarize
              Shows the cross validation results for all filters.  Only  makes
              sense after the run command.

USAGE

       The  normal  usage pattern is the following: first, you should separate
       your email collection into several categories (manually or  otherwise).
       Each  category  should be associated with one or more folders, but each
       folder should not contain more than  one  category.  Next,  you  should
       decide how many runs to use, say 10.  The more runs you use, the better
       the predicted error rates. However, more runs take more time.  Now  you
       can type

       % mailfoot prepare 10

       Next,  for  every  category,  you must add every folder associated with
       this category. Suppose you have three categories named spam, work,  and
       play,  which  are  associated with the mbox files spam.mbox, work.mbox,
       and play.mbox respectively. You would type

       % mailfoot add spam spam.mbox
       % mailfoot add work work.mbox
       % mailfoot add play play.mbox

       You should aim for a similar number of emails in each category, as  the
       random  multiplexing  will be unbalanced otherwise. The ordering of the
       email messages in each *.mbox  file  is  important,  and  is  preserved
       during each simulation. If you repeatedly add to the same category, the
       later mailboxes will be appended to the first, preserving  the  implied
       ordering.

       You   can  now  perform  as  many  FOOT  simulations  as  desired.  The
       multiplexed emails are  classified  and  learned  one  at  a  time,  by
       executing    the    command   given   in   the   environment   variable
       MAILFOOT_FILTER. If not set, a default value is used.

       % mailfoot run
       % mailfoot summarize

       The testsuite commands are designed to simplify  the  above  steps  and
       allow  comparison  of  a wide range of email classifiers, including but
       not limited  to  dbacl.   Classifiers  are  supported  through  wrapper
       scripts, which are located in the /usr/share/dbacl/testsuite directory.

       The first stage when using the testsuite is deciding which  classifiers
       to compare.  You can view a list of available wrappers by typing:

       % mailfoot testsuite list

       Note  that  the  wrapper  scripts are NOT the actual email classifiers,
       which must be installed separately  by  your  system  administrator  or
       otherwise.   Once this is done, you can select one or more wrappers for
       the simulation by typing, for example:

       % mailfoot testsuite select dbaclA ifile

       If some of the selected classifiers cannot be found on the system, they
       are  not  selected.  Note  also  that some wrappers can have hard-coded
       category  names,  e.g.  if  the   classifier   only   supports   binary
       classification. Heed the warning messages.

       It  remains  only  to  run the simulation. Beware, this can take a long
       time (several hours depending on the classifier).

       % mailfoot testsuite run
       % mailfoot testsuite summarize

       Once you are all done, you can delete the working files, log files etc.
       by typing

       % mailfoot clean

SCRIPT INTERFACE

       mailfoot testsuite takes care of learning and classifying your prepared
       email corpora for each  selected  classifier.  Since  classifiers  have
       widely  varying  interfaces,  this  is  only possible by wrapping those
       interfaces individually into a standard  form  which  can  be  used  by
       mailfoot testsuite.

       Each  wrapper  script  is  a  command  line tool which accepts a single
       command followed by zero or more optional arguments,  in  the  standard
       form:

       wrapper command [argument]...

       Each  wrapper  script  also  makes  use  of  STDIN and STDOUT in a well
       defined way. If no behaviour is described,  then  no  output  or  input
       should be used.  The possible commands are described below:

       filter In this case, a single email is expected on STDIN, and a list of
              category filenames is expected in $2, $3, etc. The script writes
              the category name corresponding to the input email on STDOUT. No
              trailing newline is required or expected.

       learn  In this case, a standard mbox stream is expected on STDIN, while
              a  suitable  category  file name is expected in $2. No output is
              written to STDOUT.

       clean  In this case, a directory is expected in $2, which  is  examined
              for  old  database  information. If any old databases are found,
              they are purged or reset. No output is written to STDOUT.

       describe
              IN this case, a single  line  of  text  is  written  to  STDOUT,
              describing  the  filter’s functionality. The line should be kept
              short to prevent line wrapping on a terminal.

       bootstrap
              In this case, a directory is expected in $2. The wrapper  script
              first checks for the existence of its associated classifier, and
              other prerequisites.  If  the  check  is  successful,  then  the
              wrapper  is  cloned  into  the  supplied  directory.  A courtesy
              notification should be given on STDOUT  to  express  success  or
              failure.   It  is  also  permissible to give longer descriptions
              caveats.

       toe    Used by mailtoe(1).

       foot   In this case, a list of categories is expected in $3,  $4,  etc.
              Every possible category must be listed. Preceding this list, the
              true category is given in $2.

ENVIRONMENT

       Right after loading, mailfoot reads the hidden file .mailfootrc in  the
       $HOME  directory, if it exists, so this would be a good place to define
       custom values for environment variables.

       MAILFOOT_FILTER
              This variable contains a shell command to be executed repeatedly
              during  the  running  stage.  The command should accept an email
              message on STDIN and output a resulting category  name.  On  the
              command  line,  it  should  also  accept first the true category
              name, then a list of all possible category file names.   If  the
              output  category  does  not  match  the  true category, then the
              relevant  categories  are  assumed   to   have   been   silently
              updated/relearned.   If  MAILFOOT_FILTER  is undefined, mailfoot
              uses a default value.

       TEMPDIR
              This directory is exported for the benefit of  wrapper  scripts.
              Scripts which need to create temporary files should place them a
              the location given in TEMPDIR.

NOTES

       The subdirectory mailfoot.d can grow quite large. It  contains  a  full
       copy  of the training corpora, as well as learning files for size times
       all the added categories, and various log files.

       FOOT simulations for dbacl(1) are very, very slow (order n squared) and
       will take all night to perform. This is not easy to improve.

WARNING

       Because  the ordering of emails within the added mailboxes matters, the
       estimated error rates are not well defined or  even  meaningful  in  an
       objective  sense.   However,  if  the sample emails represent an actual
       snapshot of a user’s incoming email, then the error rates are  somewhat
       meaningful.  The  simulations  can  then  be  interpreted  as alternate
       realities where a given classifier would have intercepted the  incoming
       mail.

SOURCE

       The  source code for the latest version of this program is available at
       the following locations:

       http://www.lbreyer.com/gpl.html
       http://dbacl.sourceforge.net

AUTHOR

       Laird A. Breyer <laird@lbreyer.com>

NAME

SYNOPSIS

DESCRIPTION

EXIT STATUS

COMMANDS

USAGE

SCRIPT INTERFACE

ENVIRONMENT

NOTES

WARNING

SOURCE

AUTHOR

SEE ALSO