qsf - quick spam filter

NAME

       qsf - quick spam filter

SYNOPSIS

       Filtering:       qsf [-snrAtav] [-d DB] [-g DB]
                            [-L LVL] [-S SUBJ] [-H MARK] [-Q NUM]
                            [-X NUM]
       Training:        qsf -T SPAM NONSPAM [MAXROUNDS] [-d DB]
       Retraining:      qsf -[m|M] [-d DB] [-w WEIGHT] [-ayN]
       Database:        qsf -[p|D|R|O] [-d DB]
       Database merge:  qsf -E OTHERDB [-d DB]
       Allowlist query: qsf -e EMAIL [-m|-M|-t] [-d DB] [-g DB]
       Denylist query:  qsf -y -e EMAIL [-m -m|-M -M|-t] [-d DB] [-g DB]
       Help:            qsf -[h|V]

DESCRIPTION

       qsf  reads  a single email on standard input, and by default outputs it
       on standard output.   If  the  email  is  determined  to  be  spam,  an
       additional  header  ("X-Spam:  YES")  will be added, and optionally the
       subject line can have "[SPAM]" prepended to it.

       qsf is intended to be used in a procmail(1) recipe, in a  ruleset  such
       as this:

               :0 wf
               | qsf -ra

               :0 H:
               * X-Spam: YES
               $HOME/mail/spam

       For  more  examples,  including  sample  procmail(1)  recipes,  see the
       EXAMPLES section below.

TRAINING

       Before qsf can be used properly, it needs to be trained.  A good way to
       train qsf is to collect a copy of all your email into two folders - one
       for spam, and one for non-spam.  Once you have done this, you  can  use
       the training function, like this:

               qsf -aT spam-folder non-spam-folder

       This  will generate a database that can be used by qsf to guess whether
       email received in the future is spam or not.  Note  that  this  initial
       training  run  may  take a long time, but you should only need to do it
       once.

       To mark a single message as spam, pipe it to qsf with  the  --mark-spam
       or  -m  ("mark  as  spam")  option.   This  will  update  the  database
       accordingly and discard the email.

       To mark a single message as non-spam, pipe it to qsf with  the  --mark-
       nonspam  or  -M  ("mark as non-spam") option.  Again, this will discard
       the email.

       If a message has been mis-tagged, simply send it to qsf as the opposite
       type,  i.e.  if it has been mistakenly tagged as spam, pipe it into qsf
       --mark-nonspam --weight=2 to  add  it  to  the  non-spam  side  of  the
       database with double the usual weighting.

OPTIONS

       The qsf options are listed below.

       -d, --database [TYPE:]FILE
              Use  FILE  as the spam/non-spam database.  The default is to use
              /var/lib/qsfdb and, if that is not available  or  is  read-only,
              $HOME/.qsfdb.   This  option  can  also  be useful if there is a
              system-wide database but you do not want to use it -  specifying
              your own here will override the default.

              If   you   prefix   the  filename  with  a  TYPE,  of  the  form
              btree:$HOME/.qsfdb, then this will specify what kind of database
              FILE is, such as list, btree, gdbm, sqlite and so on.  Check the
              output of qsf -V to see which database backends  are  available.
              The default is to auto-detect the type, or, if the file does not
              already exist, use list.  Note that TYPE is not  case-sensitive.

       -g, --global [TYPE:]FILE
              Use   FILE   as   the   default   global  database,  instead  of
              /var/lib/qsfdb.  If you also specify a database  with  -d,  then
              this  "global"  database  will  be  used  in  read-only  mode in
              conjunction with the read-write database specified with -d.  The
              -g option can be used a second time to specify a third database,
              which will also be used in read-only mode.  Again, the  filename
              can  optionally  be  prefixed  with  a  TYPE which specifies the
              database type.

       -P, --plain-map FILE
              Maintain a mapping of all database tokens  to  their  non-hashed
              counterparts in FILE, one token per line.  This can be useful if
              you want to be able to list the contents of your database  at  a
              later  date,  for  instance  to get a list of email addresses in
              your allow-list.  Note that using this option may slow qsf down,
              and  only  entries  written to the database while this option is
              active will be stored in FILE.

       -s, --subject
              Rewrite the Subject line of any email that turns out to be spam,
              adding "[SPAM]" to the start of the line.

       -S, --subject-marker SUBJECT
              Instead  of  adding "[SPAM]", add SUBJECT to the Subject line of
              any email that turns out to be spam.  Implies -s.

       -H, --header-marker MARK
              Instead of setting the X-Spam header to "YES", set it to MARK if
              email  turns  out  to be spam.  This can be useful if your email
              client can only search all headers for a string, rather than one
              particular  header (so searching for "YES" might match more than
              just the output of qsf).

       -n, --no-header
              Do not add an X-Spam header to messages.

       -r, --add-rating
              Insert an additional header X-Spam-Rating which is a  rating  of
              the  "spamminess"  of  a message from 0 to 100; 90 and above are
              counted as spam, anything under 90 is not considered  spam.   If
              combined with -t, then the rating (0-100) will be output, on its
              own, on standard output.

       -A, --asterisk
              Insert an additional  header  X-Spam-Level  which  will  contain
              between 0 and 20 asterisks (*), depending on the spam rating.

       -t, --test
              Instead  of  passing  the message out on standard output, output
              nothing, and exit 0 if the message is not spam, or exit 1 if the
              message is spam.  If combined with -r, then the spam rating will
              be output on standard output.

       -a, --allowlist
              Enable the allow-list.  This causes the email addresses given in
              the  message’s  "From:" and "Return-Path:" headers to be checked
              against a list; if either  one  matches,  then  the  message  is
              always  treated  as  non-spam,  regardless  of  what  the  token
              database says. When specified with  a  retraining  flag,  -a  -m
              (mark  as  spam) will remove that address from the allow-list as
              well as marking the message as spam, and -a  -M  (mark  as  non-
              spam) will add that address to the allow-list as well as marking
              the message as non-spam.  The idea is that you add all  of  your
              friends  to the allow-list, and then none of their messages ever
              get marked as spam.

       -y, --denylist
              Enable the deny-list.  This causes the email addresses given  in
              the  message’s  "From:" and "Return-Path:" headers to be checked
              against a second list; if either one matches, then theh  message
              is  always  treated  as spam.  Training works in the same way as
              with -a, except that you must specify -m or -M twice  to  modify
              the  deny-list  instead  of the allow-list, and with the reverse
              syntax: -y -m -m (mark as spam) will add  that  address  to  the
              deny-list,  whereas -y -M -M (mark as non-spam) will remove that
              address from the deny-list.  This  double  specification  is  so
              that  the  usual retraining process never touches the deny-list;
              the  deny-list  should  be  carefully  maintained  rather   than
              automatically generated.

              Normally you would not need to use the deny-list.

       -L, --level, --threshold LEVEL
              Change  the  spam  scoring threshold level which must be reached
              before an email is classified as spam.  The default is 90.

       -Q, --min-tokens NUM
              Only give a score if more than  NUM  tokens  are  found  in  the
              message  -  otherwise the message is assumed to be non-spam, and
              it is not modified in any way.  The default is 0.   This  option
              might  be  useful if you find that very short messages are being
              frequently miscategorised.

       -e, --email, --email-only EMAIL
              Query or update the  allow-list  entry  for  the  email  address
              EMAIL.   With no other options, this will simply output "YES" if
              EMAIL is in the allow-list, or "NO" if it is not.  With  -t,  it
              will  not output anything, but will exit 0 (success) if EMAIL is
              in the allow-list, or 1 (failure) if it  is  not.  With  the  -m
              (mark-spam) option, any previous allow-list entry for EMAIL will
              be removed. Finally, with the -M  (mark-nonspam)  option,  EMAIL
              will be added to the allow-list if it is not already on it.

              If  EMAIL is just the word MSG on its own, then an email will be
              read from standard input, and the email addresses given  in  the
              "From:" and "Return-Path:" headers will be used.

              Using -e automatically switches on -a.

              If  you also specify -y, then the deny-list will be operated on.
              Remember that -m and -M are reversed with the deny-list.

              If you specify an email address of  the  form  @domain  (nothing
              before  the  @),  then  the  whole  domain will be allow or deny
              listed.

       -v, --verbose
              Add extra X-QSF-Info headers to any filtered  email,  containing
              error  messages  and  so on if applicable.  Specify -v more than
              once to increase verbosity.

       -T, --train SPAM NONSPAM [MAXROUNDS]
              Train the database using the two mbox folders SPAM and  NONSPAM,
              by testing each message in each folder and updating the database
              each time a message is miscategorised.   This  is  done  several
              times, and may take a while to run.  Specify the -a (allow-list)
              flag to add every sender in the NONSPAM folder  to  your  allow-
              list  as a side-effect of the training process.  If MAXROUNDS is
              specified, training will end after this number of rounds if  the
              results  are  still not good enough. The default is a maximum of
              200 rounds.

       -m, --mark-spam
              Instead of passing the message out on standard output, mark  its
              contents  as  spam  and update the database accordingly.  If the
              allow-list (-a) is enabled, the message’s "From:"  and  "Return-
              Path:"  addresses are removed from the allow-list.  If the deny-
              list (-y) is enabled and you specify  -m  twice,  the  message’s
              addresses are added to the deny-list instead.

       -M, --mark-nonspam
              Instead  of passing the message out on standard output, mark its
              contents as non-spam and update the  database  accordingly.   If
              the  allow-list  (-a)  is  enabled,  the  message’s  "From:" and
              "Return-Path:" addresses are added to the allow-list (see the -a
              option above).  If the deny-list (-y) is enabled and you specify
              -M twice, the message’s addresses are removed from the deny-list
              instead.

       -w, --weight WEIGHT
              When  marking  as  spam  or non-spam, update the database with a
              weighting of WEIGHT per token  instead  of  the  default  of  1.
              Useful  when  correcting  mistakes,  eg  a message that has been
              mistakenly detected as spam should be marked as non-spam using a
              weighting  of  2, i.e. double the usual weighting, to counteract
              the error.

       -D, --dump [FILE]
              Dump the contents of the database as a platform-independent text
              file, suitable for archival, transfer to another machine, and so
              on.  The data is output on stdout or into the given FILE.

       -R, --restore [FILE]
              Rebuild the database from scratch from the text file  on  stdin.
              If  a  FILE  is  given,  data is read from there instead of from
              stdin.

       -O, --tokens
              Instead of filtering, output a list of the tokens found  in  the
              message read from standard input, along with the number of times
              each token was found.  This is only useful if you  want  to  use
              qsf  as  a  general  tokeniser  for  use  with another filtering
              package.

       -E, --merge OTHERDB
              Merge the OTHERDB database into the current database.  This  can
              be  useful  if  you want to take one user’s mailbox and merge it
              into the system-wide one, for instance (this would be  done  by,
              as  root,  doing  qsf -d /var/lib/qsfdb -E /home/user/.qsfdb and
              then removing /home/user/.qsfdb).

       -B, --benchmark SPAM NONSPAM [MAXROUNDS]
              Benchmark the training process using the two mbox  folders  SPAM
              and  NONSPAM.  A temporary database is created and trained using
              the first 75% of the messages  in  each  folder,  and  then  the
              entire  contents  of each folder is tested to see how many false
              positives and false negatives occur. Some timing information  is
              also displayed.

              This can be used to decide which backend is best on your system.
              Use -d to select a backend, eg qsf -B spam  nonspam  -d  GDBM  -
              this   will   create  a  temporary  database  which  is  removed
              afterwards.

              The exception to  this  is  the  MySQL  backend,  where  a  full
              database      specification      must      be      given     (-d
              MySQL:database=db;host=localhost;...)  and  the  database  table
              given will not be wiped beforehand or dropped afterwards.

              As  with  -T,  if MAXROUNDS is specified, training will never be
              done for more than this number of rounds; the default is 200.

       -h, --help
              Print a usage message on standard output and exit  successfully.

       -V, --version
              Print   version  information,  including  a  list  of  available
              database backends, on standard output and exit successfully.

DEPRECATED OPTIONS

       The following options are  only  for  use  with  the  old  binary  tree
       database backend or old databases that haven’t been upgraded to the new
       format that came in with version 1.1.0.

       -N, --no-autoprune
              When marking as spam or nonspam, never automatically  prune  the
              database.  Usually the database is pruned after every 500 marks;
              if  you  would  rather  --prune  manually,  use  -N  to  disable
              automatic pruning.

       -p, --prune
              Remove  redundant  entries  from  the database and clean it up a
              little.  This is  automatically  done  after  several  calls  to
              --mark-spam  or --mark-nonspam, and during training with --train
              if the training takes a large number of  rounds,  so  it  should
              rarely be necessary to use --prune manually unless you are using
              -N / --no-autoprune.

       -X, --prune-max NUM
              When the database is being pruned, no more than NUM entries will
              be  considered  for  removal.  This is to prevent CPU and memory
              resources being taken over.  The default is 100,000 but in  some
              circumstances  (if  you  find  that pruning takes too long) this
              option may be used to reduce it to a more manageable number.

FILES

       /var/lib/qsfdb
              The default (system-wide) spam database.  If you wish to install
              qsf  system-wide,  this  should  be read-only to everyone; there
              should be one user with write access who  can  update  the  spam
              database  with  qsf  --mark-spam  and  qsf  --mark-non-spam when
              necessary.

       /var/lib/qsfdb2
              A second, read-only, system-wide database. This  can  be  useful
              when  installing  qsf  system-wide  and  using  third-party spam
              databases; the first global database can be updated with system-
              specific  changes,  and this second database can be periodically
              updated when the third-party spam database is updated.

       $HOME/.qsfdb
              The default spam database  for  per-user  data.   Users  without
              write  access  to  the system-wide database will have their data
              written here, and the two databases will be read together.   The
              per-user  database  will  be  given a weighting equivalent to 10
              times the weighting of the global database.

NOTES

       Currently, you cannot use qsf to check for spam while the  database  is
       being  updated.   This  means  that while an update is in progress, all
       email is passed through as non-spam.

       There is an upper size limit  of  512Kb  on  incoming  email;  anything
       larger  than this is just passed through as non-spam, to avoid tying up
       machine resources.

       The plaintext  token  mapping  maintained  by  --plain-map  will  never
       shrink,  only  grow.   It  is intended for use by housekeeping and user
       interface scripts that, for instance, the user  can  use  to  list  all
       email addresses on their allow-list.  These scripts should take care of
       weeding out entries for tokens that are no longer in the database.   If
       you  have no such scripts, there is probably no point in using --plain-
       map anyway.

       Avoid using the deny-list (-y) in any automated retraining, as  it  can
       be cause the filter to reject mail unnecessarily.  In general the deny-
       list is probably best left unused unless explicitly  required  by  your
       particular setup.

       If  both  the  allow-list  and  the  deny-list  are enabled, then email
       addresses will first be checked against the deny-list, then the  allow-
       list, then the domain of the email address will be checked for matching
       "@domain" entries in the deny-list and then in the allow-list.

EXAMPLES

       To filter all of your mail through qsf, with the allow-list enabled and
       the  "spam  rating"  header  being  added, add this to your .procmailrc
       file:

               :0 wf
               | qsf -ra

       If you want qsf to add "[SPAM]" to the subject line of any messages  it
       thinks are spam, do this instead:

               :0 wf
               | qsf -sra

       To  automatically mark any email sent to spambox@yourdomain.com as spam
       (this is the "naive" version):

               :0 H
               * ^To:.*spambox@yourdomain.com
               | qsf -am

       To   do   the   same,   but   cleverly,   so   that   only   email   to
       spambox@yourdomain.com which qsf does NOT already classify as spam gets
       marked as spam in the database (this stops  the  database  getting  too
       heavily weighted):

               # If sent to spambox@yourdomain.com:
               :0
               * ^To:.*spambox@yourdomain.com
               {
                  :0 wf
                  | qsf -a

                  # The above two lines can be skipped if you’ve
                  # already piped the message through qsf.

                  # If the qsf database says it’s not spam,
                  # mark it as spam!
                  :0 H
                  * ^X-Spam: NO
                  | qsf -am
               }

       Remove the -a option in the above examples if you don’t want to use the
       allow-list.

       A more complicated filtering example  -  this  will  only  run  qsf  on
       messages which don’t have a subject line saying "your <something> is on
       fire" and which don’t have a sender address  ending  in  "@foobar.com",
       meaning  that  messages  with  that subject line OR that sender address
       will NEVER be marked as spam, no matter what:

               :0 wf
               * ! ^Subject: Your .* is on fire
               * ! ^From: .*@foobar.com
               | qsf -ra

       For  more  on  procmail(1)   recipes,   see   the   procmailrc(5)   and
       procmailex(5) manual pages.

       A couple of macros to add to your .muttrc file, if you use mutt(1) as a
       mail user agent:

               # Press F5 to mark a message as spam and delete it
               macro index <f5> "<pipe-message>qsf -am\n<delete-message>"
               macro pager <f5> "<pipe-message>qsf -am\n<delete-message>"

               # Press F9 to mark a message as non-spam
               macro index <f9> "<pipe-message>qsf -aM\n"
               macro pager <f9> "<pipe-message>qsf -aM\n"

       Again, remove the -a option in the above examples if you don’t want  to
       use the allow-list.

       Note,  however,  that  the  above  macros  won’t work when operating on
       multiple tagged messages. For that, you’d need something like this:

               macro   index   <f5>    ":set    pipe_split\n<tag-prefix><pipe-
              message>qsf            -am\n<tag-prefix><delete-message>\n:unset
              pipe_split\n"

       If you use qmail(7), then to get procmail working with it you will need
       to  put  a  line  containing just DEFAULT=./Maildir/ at the top of your
       ~/.procmailrc file, so that procmail delivers to  your  Maildir  folder
       instead  of  trying  to  deliver to /var/spool/mail/$USER, and you will
       need to put this in your ~/.qmail file:

               | preline procmail

       This will cause all your mail to be delivered via procmail  instead  of
       being delivered directly into your mail directory.

       See the qmail(7) documentation for more about mail delivery with qmail.

       If you use postfix(1), you can set up  a  system-wide  mail  filter  by
       creating  a  user account for the purpose of filtering mail, populating
       that account’s .qsfdb, and then creating a shell script, to run as that
       user, which runs qsf on stdin and passes stdout to sendmail(8).

       Doing  this  requires  some knowledge of postfix configuration and care
       needs to be taken to avoid mail loops.  One qsf user’s  full  HOWTO  is
       included in the doc/ directory with this package.

THE ALLOW-LIST

       A  feature called the "allow-list" can be switched on by specifying the
       --allowlist or -a option.  This causes messages’ "From:"  and  "Return-
       Path:"  addresses  to be checked against a list of people you have said
       to allow all messages from, and if  a  message’s  "From:"  or  "Return-
       Path:"  address is in the list, it is never marked as spam.  This means
       you can add all your friends to an "allow-list" and qsf will then never
       mis-file  their  messages - a quick way to do this is to use -a with -T
       (train); everyone in your non-spam folder who has  sent  you  an  email
       will be added to the allow-list automatically during training.

       You  can  manually  add and remove addresses to and from the allow-list
       using the -e (email) option. For instance, to add  foo@bar.com  to  the
       allow-list, do this:

               qsf -e foo@bar.com -M

       To remove bad@nasty.com from the allow-list, do this:

               qsf -e bad@nasty.com -m

       And  to  see whether someone@somewhere.com is in the allow-list or not,
       just do this:

               qsf -e someone@somewhere.com

       In general, you probably always  want  to  enable  the  allow-list,  so
       always  specify  the -a option when using qsf.  This will automatically
       maintain the allow-list based on what you classify as spam or non-spam.

       The  only  times  you might want to turn it off are when people on your
       allow-list are prone to getting viruses or if a virus is causing  email
       to  be sent to you that is pretending to be from someone on your allow-
       list.

BACKUP AND RESTORE

       Because the database format is platform-specific, it is a good idea  to
       periodically  dump the database to a text file using qsf -D so that, if
       necessary, it can be transferred to another machine and  restored  with
       qsf -R later on.

       Also  note  that  since the actual contents of email messages are never
       stored in the database (see TECHNICAL DETAILS), you  can  safely  share
       your  qsf  database with friends - simply dump your database to a file,
       like this:

               qsf -D > your-database-dump.txt

       Once you have sent your-database-dump.txt to another person,  they  can
       do this:

               qsf -R < your-database-dump.txt

       They will then have an identical database to yours.

TECHNICAL DETAILS

       When  a message is passed to qsf, any attachments are decoded, all HTML
       elements are removed, and the message  text  is  then  broken  up  into
       "tokens",  where  a  "token"  is  a  single word or URL.  Each token is
       hashed using the MD5 algorithm (see below for why), and  that  hash  is
       then used to look up each token in the qsf database.

       For   full   details  of  which  parts  of  an  email  (headers,  body,
       attachments, etc) are used  to  calculate  the  spam  rating,  see  the
       TOKENISATION section below.

       Within the database, each token has two numbers associated with it: the
       number of times that token has been seen in spam,  and  the  number  of
       times  it has been seen in non-spam.  These two numbers, along with the
       total number of spam and non-spam messages seen, are then used to  give
       a  "spamminess"  value  for  that  particular token.  This "spamminess"
       value ranges from "definitely not spammy" at  one  end  of  the  scale,
       through "neutral" in the middle, up to "definitely spammy" at the other
       end.

       Once a "spamminess" value has been calculated for all of the tokens  in
       the  message, a summary calculation is made to give an overall "is this
       spam?"  probability rating for the message.  If the overall probability
       is 0.9 or above, the message is flagged as spam.

       In  addition  to  the probability test is the "allow-list".  If enabled
       (with the -a option), the whole probability check  is  skipped  if  the
       sender  of  the message is listed in the allow-list, and the message is
       not marked as spam.

       When training the database, a  message  is  split  up  into  tokens  as
       described  above,  and  then the numbers in the database for each token
       are simply added to: if you tell qsf that a message is  spam,  it  adds
       one  to  the "number of times seen in spam" counter for each token, and
       if you tell it a message is not spam, it adds one  to  the  "number  of
       times  seen  in  non-spam"  counter  for  each token.  If you specify a
       weight, with -w, then the number you specify is added instead of one.

       To stop the database growing uncontrollably, the database  keeps  track
       of  when  a  token  was  last used.  Underused tokens are automatically
       removed from the database.  (The old method was to  "prune"  every  500
       updates).

       Finally,  the  reason  MD5  hashes were used is privacy.  If the actual
       tokens from the messages, and the actual email addresses in the  allow-
       list,  were  stored,  you could not share a single qsf database between
       multiple users because bits of everyone’s  messages  would  be  in  the
       database - things like emailed passwords, keywords relating to personal
       gossip, and so on.  So a hash is stored instead.  A hash is a "one-way"
       function;  it  is  easy to turn a token into a hash but very hard (some
       might say impossible) to turn a hash back into the token  that  created
       it.   This  means  that  you  end  up  with a database with no personal
       information in it.

TOKENISATION

       When a message is broken up into tokens, various parts of  the  message
       are treated in different ways.

       First,  all header fields are discarded, except for the important ones:
       From, Return-Path, Sender, To, Reply-To, and Subject.

       Next, any MIME-encoded attachments are decoded.  Any attachments  whose
       MIME type starts with "text/" (i.e. HTML and text) are tokenised, after
       having  any  HTML  tags  stripped.   Any  non-textual  attachments  are
       replaced  with their MD5 hash (such that two identical attachments will
       have the same hash), and that hash is then used as a token.

       In addition to single-word tokens from textual message parts, qsf  adds
       doubled-up  tokens  so that word pairs get added to the database.  This
       makes the database a bit bigger (although the automatic  pruning  tends
       to take care of that) but makes matching more exact.

SPECIAL FILTERS

       As  well as using the textual content of email to detect spam, qsf also
       uses special filters which  create  "pseudo-tokens"  based  on  various
       rules.   This  means that specific patterns, not just individual words,
       can be used to determine whether a message is spam or not.

       For example,  if  a  message  contains  lots  of  words  with  multiple
       consonants,  like  "ashjkbnxcsdjh",  then each time a word like that is
       seen the special token ".GIBBERISH-CONSONANTS." is added to the list of
       tokens  found  in the message.  If it turns out that most messages with
       words that trigger this filter rule are spam, then other messages  with
       gibberish  consonant strings will be more likely to be flagged as spam.

       Currently the special filters are:

       GTUBE  Flags      any      message      containing      the      string
              XJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-
              EMAIL*C.34X  as  spam  -  useful  for  testing  that  your   qsf
              installation is working.

       ATTACH-SCR

       ATTACH-PIF

       ATTACH-EXE

       ATTACH-VBS

       ATTACH-VBA

       ATTACH-LNK

       ATTACH-COM

       ATTACH-BAT
              Adds a token for every attachment whose filename ends in ".scr",
              ".pif", ".exe",  ".vbs",  ".vba",  ".lnk",  ".com",  and  ".bat"
              respectively (these are often viruses).

       ATTACH-GIF

       ATTACH-JPG

       ATTACH-PNG
              Adds a token for every attachment whose filename ends in ".gif",
              ".jpg" or ".jpeg", and ".png" respectively.

       ATTACH-DOC

       ATTACH-XLS

       ATTACH-PDF
              Adds a token for every attachment whose filename ends in ".doc",
              ".xls",  or  ".pdf"  respectively (these tend to indicate a non-
              spam email).

       SINGLE-IMAGE
              Adds a token if the message contains exactly one attached image.

       MULTIPLE-IMAGES
              Adds  a  token  if  the  message contains more than one attached
              image.

       GIBBERISH-CONSONANTS
              Adds a token for every word found that has  multiple  consonants
              in  a  row,  as described above.  Spam often contains strings of
              gibberish.

       GIBBERISH-VOWELS
              Adds a token for every word found that has multiple vowels in  a
              row, eg "aeaiaiaeeio".

       GIBBERISH-FROMCONS
              Like GIBBERISH-CONSONANTS, but only for the "From:" and "Return-
              Path:" addresses on their own.

       GIBBERISH-FROMVOWL
              Like GIBBERISH-VOWELS, but only for  the  "From:"  and  "Return-
              Path:" addresses on their own.

       GIBBERISH-BADSTART
              Adds  a  token  for  every word that starts with a bad character
              such as %.

       GIBBERISH-HYPHENS
              Adds a token for every word with  more  than  three  hyphens  or
              underscores in it.

       GIBBERISH-LONGWORDS
              Adds  a  token for every word with over 30 characters in it (but
              less than 60).

       HTML-COMMENTS-IN-WORDS
              Adds a token for every HTML comment found in  the  middle  of  a
              word.   Spam  often  contains  HTML  inside  words,  like  this:
              w<!--dsgfhsdgjgh-->ord

       HTML-EXTERNAL-IMG
              Adds a token  for  every  HTML  <img>  (image)  tag  found  that
              contains :// (i.e.  it refers to an external image).

       HTML-FONT
              Adds a token for every HTML <font> tag found.

       HTML-IP-IN-URLS
              Adds a token for every URL found containing an IP address.

       HTML-INT-IN-URL
              Adds  a  token  for every URL found containing an integer in its
              hostname.

       HTML-URLENCODED-URL
              Adds a token for every URL found containing  a  %  sign  in  its
              hostname.

       Normally, filters will just cause a token to be added, and these tokens
       are processed by the normal weighting  algorithm.   However  the  GTUBE
       filter  will  immediately  flag any matching message as spam, bypassing
       the token matching.

DATABASE BACKENDS

       The inbuilt "list" database backend will not  necessarily  provide  the
       best performance, but is provided because using it requires no external
       libraries.

       If, when qsf was compiled, the correct libraries were  available,  then
       it  will be possible to use qsf with alternative database backends.  To
       find out which backends you have available, run qsf -V (capital V)  and
       read  the  second  line of output.  To see how well a backend performs,
       collect some spam and non-spam and use qsf -d BACKEND -B  SPAM  NONSPAM
       (see the entry for -B above).

       Some  people  find  that  they get the best performance out of the gdbm
       backend; this is a library that is widely available on many systems.

       To efficiently share a qsf database across multiple machines,  you  may
       find  the  MySQL  backend  useful.   However, using it is a little more
       complicated.

       To use the MySQL backend you will need  to  create  a  table  with  the
       fields  key1,  key2,  token,  value1,  value2  and  value3.  The token,
       value1, value2, and value3 fields must be VARCHAR(64), BIGINT  or  INT,
       and  BIGINT  or  INT respectively, and indexing on the token field is a
       good idea. The key1 and key2 fields can be anything, but they  must  be
       present.

       For example:

                USE mydatabase;
                CREATE TABLE qsfdb (
                  key1      BIGINT UNSIGNED NOT NULL,
                  key2      BIGINT UNSIGNED NOT NULL,
                  token     VARCHAR(64) DEFAULT ’’ NOT NULL,
                  value1    INT UNSIGNED NOT NULL,
                  value2    INT UNSIGNED NOT NULL,
                  value3    INT UNSIGNED NOT NULL,
                  PRIMARY KEY (key1,key2,token),
                  KEY (key1),
                  KEY (key2),
                  KEY (token)
                );

       The  key1  and  key2 fields allow you to have multiple qsf databases in
       one table, by specifying different key1 and key2 values on  invocation.

       Instead  of specifying a database file with the --database / -d option,
       you must specify either a specification string as described  below,  or
       the name of a file containing such a string on its first line.

       The specification string is as follows:

                database=DATABASE;host=HOST;port=PORT;
                user=USER;pass=PASS;table=TABLE;
                key1=KEY1;key2=KEY2

       This string must be all on one line, with no spaces.

       DATABASE
              is the name of the MySQL database.

       HOST   is the hostname of the database server (eg "localhost").

       PORT   is the TCP port to connect on (eg 3306).

       USER   is the username to connect with.

       PASS   is the password to connect with.

       TABLE  is  the  database  table to use.  If a table with this name does
              not exist when qsf is called in update or training mode, then it
              will be created if permissions allow this to be done.

       KEY1   is the value to use for the key1 field.

       KEY2   is the value to use for the key2 field.

       Since  command  lines  can  be seen in the process list, it is probably
       best to specify a filename (eg qsf -d  mysql:qsfdb.spec)  and  put  the
       specification string inside that file.

TROUBLESHOOTING

       If  you  have  problems  with qsf, please check the list below; if this
       does not help, go to the qsf home  page  and  investigate  the  mailing
       lists, or email the author.

       Nothing is being marked as spam.
              First,  use the -r option to switch on the X-Spam-Rating header,
              and check that this header appears in email passed through  qsf.
              If  it  does not, then it is likely that qsf is not being run at
              all - check your configuration of procmail(1) or its equivalent.

              If  you  are  seeing X-Spam-Rating headers, and different emails
              have different scores, then you may simply need to retrain  your
              database a little more.  Take more spam email and pass it to qsf
              -m.

              If you are seeing X-Spam-Rating headers but they  all  give  the
              same spam rating, then the most likely reason is that qsf is not
              reading any database.  Make sure that whatever is processing the
              email  has  read permissions on /var/lib/qsfdb and/or ~/.qsfdb -
              and make sure  that,  if  you  are  using  ~/.qsfdb,  what  your
              database  creator thought was ~ ($HOME) is the same as it is for
              whatever is processing the email.

       Retraining sometimes takes a very long time.
              With the obtree backend or  2-column  MySQL  or  SQLite  tables,
              every 500th retrain (-m or -M), the database is pruned.  On some
              systems this may take  some  time,  and  during  this  time  the
              database  is  locked  (except  when  using  the  MySQL or SQLite
              backends).  If you constantly do a lot of retraining and want to
              avoid this, then use the -N option to suppress auto-pruning, and
              then have a cron(8) job or something run a manual prune (qsf -p)
              every now and again.

       Running qsf from procmail fails with an error.
              If  you  can run qsf from the command line, but in your procmail
              log file you get errors about "qsf: cannot execute binary file",
              then  contact your system administrator for help. It may be that
              incoming email is handled by a different server to the  one  you
              normally  shell  into,  and  either  they  are  of  a  different
              architecture or operating system, or  the  mail  server  is  not
              permitted to execute user-owned binaries.

ACKNOWLEDGEMENTS

       The  following  people have contributed suggestions, comments, patches,
       and testing:

              Tom Parker <http://www.bits.bris.ac.uk/palfrey/>
              Dr Kelly A. Parker
              Vesselin Mladenov <http://www.antipodes.bg/>
              Glyn Faulkner
              Mark Reynolds
              Sam Roberts
              Scott Allen
              Karsten Kankowski
              M. Kolbl
              Micha Holzmann
              Jef Poskanzer <http://www.acme.com/jef/>
              Clemens Fischer <http://ino-waiting.gmxhome.de/>
              Nelson A. de Oliveira
              Michal Vitecek
              Tommy Pettersson <http://www.lysator.liu.se/~ptp/>

AUTHOR

       The author:

              Andrew Wood <andrew.wood@ivarch.com>
              http://www.ivarch.com/

       Project home page:

              http://www.ivarch.com/programs/qsf/

BUGS

       If you find any bugs, please contact the author, either by email or  by
       using the contact form on the web site.

LICENSE

       This is free software, distributed under the ARTISTIC 2.0 license.

NAME

SYNOPSIS

DESCRIPTION

TRAINING

OPTIONS

DEPRECATED OPTIONS

FILES

NOTES

EXAMPLES

THE ALLOW-LIST

BACKUP AND RESTORE

TECHNICAL DETAILS

TOKENISATION

SPECIAL FILTERS

DATABASE BACKENDS

TROUBLESHOOTING

ACKNOWLEDGEMENTS

AUTHOR

BUGS

SEE ALSO

LICENSE