ifile - core executable for the ifile mail filtering system

NAME

       ifile - core executable for the ifile mail filtering system

SYNOPSIS

       ifile  [-b  file] [-q|-Q] [-g] [-k] [-o] [-v num] [lexing options] file
       ...
       ifile -c -q|-Q [-T threshold] [-b file] [-g] [-k] [-o] [lexing options]
       file ...
       ifile  [-b  file]  [-d folder] [-i folder|-u folder] [-g] [-k] [-o] [-v
       num] [lexing options] file ...
       ifile -r [-b file]

DESCRIPTION

       ifile is a mail filter client that uses machine learning to classify e-
       mail  into  folders/mail  boxes.   The algorithm that it uses is called
       Naive Bayes.    Basically,  naive  bayes  considers  each  document  an
       unordered  collection  of words and classifies by matching the document
       distribution   with   the   most   closely   matching    folder/mailbox
       distribution.

OPTIONS

       -b, --db-file=file
              Location to read/store ifile database.  Default is ~/.idata

       -c, --concise
              equivalent of "ifile -v 0 | head -1 | cut -f1 -d".  Must be used
              with -q or -Q.

       -d, --delete=folder
              Delete the statistics for each of files from the category folder

       -f, --folder-calcs=folder
              Show the word-probability calculations for folder

       -g, --log-file
              Create and store debugging information in ~/.ifile.log

       -i, --insert=folder
              Add the statistics for each of the files to the category folder

       -k, --keep-infrequent
              Leave  in  the  database words that occur infrequently (normally
              they are tossed)

       -l, --query-loocv=folder
              For each of the files, temporarily  removes  file  from  folder,
              performs  query  and then reinserts file in folder.  Database is
              not modified.

       -o, --occur
              Uses document bit-vector representation.  Count each  word  once
              per document.

       -q, --query
              Output rating scores for each of the files

       -Q, --query-insert
              For  each  of the files, output rating scores and add statistics
              for the folder with the highest score

       -T, --threshold=threshold
              When used with both -c and -q, output the  two  highest  ranking
              categories  if  their score differs by at most threshold / 1000,
              which can be used to detect border cases.   When  used  with  -q
              only  and  any  threshold  >  0,  output  the  score  difference
              percentage.  For example,
                     ifile -T1 -q foo.txt
              might result in
                     spam -15570.48640776
                     non-spam -18728.00272369
                     diff[spam,non-spam](%) 9.21
              If so, then
                     ifile -T93 -q -c foo.txt
              will result in
                     foo.txt spam,non-spam
              whereas
                     ifile -T92 -q -c foo.txt
              will result in
                     foo.txt spam

       -r, --reset-data
              Erases all currently stored information

       -u, --update=folder
              Same as ’insert’ except only adds stats if folder already exists

       -v, --verbosity=num
              Amount  of  output while running: 0=silent, 1=quiet, 2=progress,
              3=verbose, 4=debug

       Lexing options:

       -a, --alpha-lexer
              Lex words as sequences of alphabetic characters (default)

       -A, --alpha-only-lexer
              Only lex space-separated character sequences which are  composed
              entirely of alphabetic characters

       -h, --strip-header
              Skip all of the header lines except Subject:, From: and To:

       -m, --max-length=char
              Ignore  portion  of  message  after  first char characters.  Use
              entire message if char set to 0.  Default is 50,000.

       -p, --print-tokens
              Just  tokenize  and  print,  don’t  do  any  other   processing.
              Documents are returned as a list of word, frequency pairs.

       -s, --no-stoplist
              Do not throw out overly frequent (stoplist) words when lexing

       -S, --stemming
              Use ’Porter’ stemming algorithm when lexing documents

       -w, --white-lexer
              Lex words as sequences of space separated characters

       If  no files are specified on the command line, ifile will use standard
       input as its message to process.

       -?, --help
              Give this help list

       --usage
              Give a short usage message

       -V, --version
              Print program version

       Mandatory or optional arguments to long options are also  mandatory  or
       optional for any corresponding short options.

FILES

       ~/.idata
              ifile  database  (default  location).  See FAQ included in ifile
              package for description of database format.

AUTHOR

       Jason  Rennie  <jrennie@csail.mit.edu>  and  many  others.    See   the
       ChangeLog for the full list.

EXAMPLES

       Before  using  ifile,  you  need  to train it.  Let’s say that you have
       three  folders,  "spam",  "ifile"  and  "friends",  and  the  following
       directory structure:

              /--+--spam----+--1
                 |          +--2
                 |          +--3
                 |
                 +--ifile---+--1
                 |          +--2
                 |          +--3
                 |
                 +--friends-+--1
                            +--2
                            +--3

       The following commands build the ifile database in ~/.idata (use the -d
       option to specify a different location for the database):

              ifile -h -i spam /spam/*
              ifile -h -i ifile /ifile/*
              ifile -h -i friends /friends/*

       The -h option strips off headers besides "Subject:", "From:" and "To:".
       I find that -h improves ifile’s performance, but you may find otherwise
       for your personal collection.

       Note that we have made the argument to -i the same as the corresponding
       folder  name. This is not necessary. The argument to -i can be any word
       you want to use to identify a category of e-mails. The argument  to  -i
       must not include space characters (including tab, feedline, etc.).

       At this point, your ~/.idata file should look something like this:

              spam ifile friends
              662 1020 6451
              3 3 3
              jrennie 9 0:3 1:18 2:16
              mindspring 6 1:7 2:5
              make 9 0:5 1:3
              yahoo 9 0:1 1:22 2:2

       The  first  line is the space-separated list of folders. Their ordering
       specifies a numbering (spam=0, ifile=1, friends=2). The second line  is
       a  token  count  for each folder (e.g. 662 tokens observed in the three
       spam messages). The third line is an e-mail count for each folder (e.g.
       3  e-mails  for  each  of spam, ifile and friends). Each following line
       specifies statistics for a word. The format of a line is

              word age folder:count [folder:count ...]

       where folder  is  the  folder  number  determined  by  the  first  line
       ordering.  Folders  with  a  count of zero are not listed. So, the line
       beginning with "jrennie" indicates that "jrennie" appeared 3  times  in
       "spam"  e-mails,  18 times in "ifile" e-mails and 16 times in "friends"
       e-mails. The age is the number of  e-mails  that  have  been  processed
       since  the  word  was  added to the database. Very infrequent words are
       pruned from the database to keep the database size down.

       Now that you have a database, you might want to  filter  some  e-mails.
       Say you have the following incoming e-mails:

              /--inbox--+--1
                        +--2
                        +--3

       To find out what folders ifile thinks these e-mails belong in, run

              ifile -c -q /inbox/1
              ifile -c -q /inbox/2
              ifile -c -q /inbox/3

       Let’s  say  that  1  is  about ifile, 2 is spam and 3 is from a friend.
       Assuming ifile does its job correctly, you’ll see output like this:

              /inbox/1 ifile
              /inbox/2 spam
              /inbox/3 friends

       With such little training data, ifile is unlikely  to  get  the  labels
       correct, but you should get the idea :-)

       Now,  if you move the e-mails to the folders suggested by ifile, you’ll
       want to update the database accordingly. You can do this  with  the  -i
       option,  like  before.  Or, you can simply use -Q in place of -q above.
       This automatically adds the e-mail to the folder ifile suggests.

       Now, assume for a moment that e-mail 1 was actually spam. We’ve added 1
       to ifile and put it in the ifile folder. We need to move it to the spam
       folder and update the ifile database accordingly.  We  can  update  the
       database with the following command:

              ifile -d ifile -i spam /inbox/1

       This deletes the e-mail from "ifile" and adds it to "spam".

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

FILES

AUTHOR

EXAMPLES

SEE ALSO