dbacl - a digramic Bayesian classifier for text recognition.

NAME

       dbacl - a digramic Bayesian classifier for text recognition.

SYNOPSIS

       dbacl [-01dvnirmwMNDXW] [-T type ] -l category [-h size] [-H gsize] [-x
              decim] [-q quality] [-w max_order] [-e deftok] [-o  online]  [-L
              measure] [-g regex]...  [FILE]...

       dbacl  [-vnimNRX] [-h size] [-T type] -c category [-c category]...  [-f
              keep]...  [FILE]...

       dbacl -V

OVERVIEW

       dbacl is a Bayesian text  and  email  classifier.  When  using  the  -l
       switch,  it  learns  a  body  of text and produce a file named category
       which summarizes the text. When using the -c  switch,  it  compares  an
       input  text  stream  with any number of category files, and outputs the
       name of the closest  match,  or  optionally  various  numerical  scores
       explained below.

       Whereas  this manual page is intended as a reference, there are several
       tutorials and documents you can read to  get  specialized  information.
       Specific  documentation  about  the design of dbacl and the statistical
       models that it uses can be found in dbacl.ps.  For a basic overview  of
       text   classification  using  dbacl,  see  tutorial.html.  A  companion
       tutorial geared towards email filtering  is  email.html.  If  you  have
       trouble  getting  dbacl  to classify reliably, read is_it_working.html.
       The USAGE section of this manual page also has some examples.

       /usr/share/dbacl/doc/dbacl.ps

       /usr/share/dbacl/doc/tutorial.html

       /usr/share/dbacl/doc/email.html

       /usr/share/dbacl/doc/is_it_working.html

       dbacl uses  a  maximum  entropy  (minimum  divergence)  language  model
       constructed  with  respect  to  a  digramic  reference measure (unknown
       tokens are predicted from digrams, i.e. pairs of letters). Practically,
       this  means  that a category is constructed from tokens in the training
       set, while previously unseen tokens can be predicted automatically from
       their  letters.  A  token  here  is  either  a  word  (fragment)  or  a
       combination  of  words  (fragments),  selected  according  to   various
       switches.  Learning roughly works by tweaking token probabilities until
       the training data is least surprising.

EXIT STATUS

       The normal shell exit conventions aren’t followed (sorry!). When  using
       the -l command form, dbacl returns zero on success, nonzero if an error
       occurs. When using the  -c  form,  dbacl  returns  a  positive  integer
       corresponding  to  the category with the highest posterior probability.
       In case of a tie, the first most probable category  is  chosen.  If  an
       error occurs, dbacl returns zero.

DESCRIPTION

       When  using the -l command form, dbacl learns a category when given one
       or more FILE names, which should contain readable  ASCII  text.  If  no
       FILE  is  given, dbacl learns from STDIN. If FILE is a directory, it is
       opened and all its files are read,  but  not  its  subdirectories.  The
       result  is  saved  in  the  binary  file named category, and completely
       replaces any previous contents. As a convenience,  if  the  environment
       variable DBACL_PATH contains a directory, then that is prepended to the
       file path, unless category starts with a ’/’ or a ’.’.

       The input text for learning is assumed to be unstructured plain text by
       default.  This  is  not  suitable  for  learning  email,  because email
       contains various transport encodings and formatting instructions  which
       can  reduce classification effectiveness. You must use the -T switch in
       that case so that dbacl knows it should perform decoding and  filtering
       of  MIME  and  HTML  as  appropriate.  Apropriate switch values are "-T
       email" for RFC2822 email input, "-T html" for HTML input, "-T xml"  for
       generic XML style input and "-T text" is the default plain text format.
       There are other values of the -T switch that also allow fine tuning  of
       the decoding capabilities.

       When  using  the  -c  command form, dbacl attempts to classify the text
       found in FILE, or STDIN if no FILE is  given.  Each  possible  category
       must  be  given separately, and should be the file name of a previously
       learned text corpus. As  a  convenience,  if  the  variable  DBACL_PATH
       contains  a  directory, it is prepended to each file path which doesn’t
       start with a ’/’ or a ’.’. The visible  output  of  the  classification
       depends  on  the  combination  of  extra switches used. If no switch is
       used, then no output is shown on STDOUT. However, dbacl always produces
       an exit code which can be tested.

       To see an output for a classification, you must use at least one of the
       -v,-U,-n,-N,-D,-d switches. Sometimes, they can be used in  combination
       to  produce a natural variation of their individual outputs. Sometimes,
       dbacl also produces warnings on STDERR if applicable.

       The -v switch outputs the name of  the  best  category  among  all  the
       choices given.

       The  -U  switch  outputs  the  name  of the best category followed by a
       confidence percentage. Normally, this is the switch that  you  want  to
       use. A percentage of 100% means that dbacl is sure of its choice, while
       a percentage of 0% means that some other category  is  equally  likely.
       This  is  not  the  model probability, but measures how unambiguous the
       classification is, and can be used to tag unsure classifications  (e.g.
       if the confidence is 25% or less).

       The  -N  switch  prints  each category name followed by its (posterior)
       probability, expressed as a percentage. The percentages always  sum  to
       100%.  This  is  intuitive,  but  only  valuable  if the document being
       classified contains a handful of tokens (ten or less).  In  the  common
       case  with  many  more  tokens,  the probabilities are always extremely
       close to 100% and 0%.

       The -n switch prints  each  category  name  followed  by  the  negative
       logarithm  of  its  probability.  This  is  equivalent  to using the -N
       switch, but much more  useful.  The  smallest  number  gives  the  best
       category.  A more convenient form is to use both -n and -v which prints
       each category name followed by the cross  entropy  and  the  number  of
       tokens  analyzed.  The  cross  entropy  measures  (in bits) the average
       compression rate which is achievable, under the given  category  model,
       per token of input text. If you use all three of -n,-v,-X then an extra
       value is output for each category, representing a kind of  p-value  for
       each  category  score. This indicates how typical the score is compared
       to the training documents, but only works if the  -X  switch  was  used
       during learning, and only for some types of models (e.g. email).  These
       p-values are uniformly distributed and independent (if  the  categories
       are independent), so can be combined using Fisher’s chi squared test to
       obtain composite p-values for groupings of categories.

       The -v and -X switches together print each category name followed by  a
       detailed   decomposition   of  the  category  score,  factored  into  (
       divergence rate + shannon entropy rate )* token count @ p-value. Again,
       this only works in some types of models.

       The  -v  and  -U  switches  print  each  category  name  followed  by a
       decomposition of the category score into ( divergence  rate  +  shannon
       entropy rate # score variance )* token count.

       The -D switch prints out the input text as modified internally by dbacl
       prior to tokenization. For example, if a MIME encoded email document is
       classified,  then  this  prints  the decoded text that will be actually
       tokenized and classified. This switch is mainly useful for debugging.

       The -d switch dumps tokens and scores while they are being read. It  is
       useful   for   debugging,   or   if   you   want  to  create  graphical
       representations of the classification. A detailed  explanation  of  the
       output  is beyond the scope of this manual page, but is straightforward
       if you’ve read dbacl.ps.  Possible variations include -d together  with
       -n or -N.

       Classification can be done with one or several categories in principle.
       When  two  or  more  categories  are  used,  the   Bayesian   posterior
       probability  is  used,  given  the  input  text,  with  a uniform prior
       distribution on  categories.  For  other  choices  of  prior,  see  the
       companion   utility  bayesol(1).   When  a  single  category  is  used,
       classification can be done by comparing the score with a  treshold.  In
       practice  however,  much  better  results  are  obtained  with  several
       categories.

       Learning  and  classifying  cannot  be  mixed  on  the   same   command
       invocation,  however  there  are  no  locking issues and separate dbacl
       processes can operate simultaneously with obvious results, because file
       operations are designed to be atomic.

       Finally,  note that dbacl does not manage your document corpora or your
       computed categories, and in particular it does not allow you to  extend
       an  existing  category file with new documents.  This is unlike various
       current spam filters, which can learn new  emails  incrementally.  This
       limitation of dbacl is partially due to the nonlinear procedure used in
       the  learning  algorithm,  and  partially  a   desire   for   increased
       flexibility.

       You  can  simulate  the  effect  of incremental learning by saving your
       training documents into archives and  adding  to  these  archives  over
       time, relearning from scratch periodically. Learning is actually faster
       if these archives are compressed  and  decompressed  on  the  fly  when
       needed.  By  keeping  control  of your archives, you can never lose the
       information in your categories, and  you  can  easily  experiment  with
       different  switches  or  tokenizations or sets of training documents if
       you like.

SECONDARY SWITCHES

       By default, dbacl classifies the input text as a whole.  However,  when
       using  the  -f  option,  dbacl  can  be  used to filter each input line
       separately, printing only those lines which match one  or  more  models
       identified  by  keep  (use  the  category  name or number to refer to a
       category). This is useful if you want to filter  out  some  lines,  but
       note that if the lines are short, then the error rate can be high.

       The   -e,-w,-g,-j  switches  are  used  for  selecting  an  appropriate
       tokenization scheme. A token is a word or word fragment or  combination
       of  words  or  fragments.  The  shape of tokens is important because it
       forms the basis of the language models used by dbacl.   The  -e  switch
       selects  a predefined tokenization scheme, which is speedy but limited.
       The -w switch specifies composite tokens derived from  the  -e  switch.
       For  example,  "-e alnum -w 2" means that tokens should be alphanumeric
       word fragments combined into overlapping pairs (bigrams). When  the  -j
       switch  is  used,  all tokens are converted to lowercase, which reduces
       the number of possible tokens and therefore memory consumption.

       If the -g switch is used, you can completely specify  what  the  tokens
       should look like using a regular expression. Several -g switches can be
       used to construct complex tokenization schemes, and parentheses  within
       each  expression  can be used to select fragments and combine them into
       n-grams. The cost of such flexibility  is  reduced  classification  and
       learning speed. When experimenting with tokenization schemes, try using
       the -d or -D switches while learning or classifying, as they will print
       the  tokens explicitly so you can see what text fragments are picked up
       or missed out. For regular exression syntax, see regex(7).

       The -h and -H switches regulate how  much  memory  dbacl  may  use  for
       learning.  Text  classification can use a lot of memory, and by default
       dbacl limits itself even at the expense of learning accuracy.  In  many
       cases  if  a  limit  is  reached,  a warning message will be printed on
       STDERR with some advice.

       When relearning the same category several times, a significant  speedup
       can  be  obtained by using the -1 switch, as this allows the previously
       learned probabilities to be read from the category and reused.

       Note that classification accuracy depends foremost on  the  amount  and
       quality of the training samples, and then only on amount of tweaking.

EXIT STATUS

       When  using  the  -l  command form, dbacl returns zero on success. When
       using  the  -c  form,  dbacl  returns  a  positive  integer  (1,2,3...)
       corresponding  to  the category with the highest posterior probability.
       In case of a tie, the first most probable category  is  chosen.  If  an
       error occurs, dbacl returns zero.

OPTIONS

       -0     When  learning,  prevents  weight  preloading.  Normally,  dbacl
              checks if the category file already exists, and if so, tries  to
              use   the  existing  weights  as  a  starting  point.  This  can
              dramatically speed up learning.  If the -0 (zero) switch is set,
              then  dbacl  behaves as if no category file already exists. This
              is mainly useful for testing.  This switch  is  now  enabled  by
              default,  to  protect  against  weight  drift  which  can reduce
              accuracy  over  many  learning  iterations.  Use  -1  to   force
              preloading.

       -1     Force weight preloading if the category file already exists. See
              discussion of the -0 switch.

       -a     Append scores. Every input line is written  to  STDOUT  and  the
              dbacl  scores  are  appended.  This is useful for postprocessing
              with bayesol(1).  For ease of processing, every  original  input
              line is indented by a single space (to distinguish them from the
              appended scores), and the line with the scores (if -n  is  used)
              is prefixed with the string "scores ". If a second copy of dbacl
              needs to read this output later, it should be invoked  with  the
              -A switch.

       -d     Dump  the model parameters to STDOUT. In conjunction with the -l
              option, this produces a human-readable summary  of  the  maximum
              entropy  model.  In conjunction with the -c option, displays the
              contribution of each token to the final  score.  Suppresses  all
              other normal output.

       -e     Select   character   class   for   default   (not   regex-based)
              tokenization. By default, tokens are  alphabetic  strings  only.
              This  corresponds  to  the case when deftok is "alpha". Possible
              values for deftok are "alpha", "alnum", "graph",  "char",  "cef"
              and  "adp".   The  last  two  are custom tokenizers intended for
              email messages.  See  also  isalpha(3).   The  "char"  tokenizer
              picks  up single printable characters rather than bigger tokens,
              and is intended for testing only.

       -f     Filter each line of input separately,  passing  to  STDOUT  only
              lines  which match the category identified as keep.  This option
              should be used repeatedly for each category which must be  kept.
              keep can be either the category file name, or a positive integer
              representing the required category in the same order it  appears
              on the command line.

              Output  lines  are  flushed  as soon as they are written. If the
              input file is a pipe or character device,  then  an  attempt  is
              made  to  use  line buffering mode, otherwise the more efficient
              block buffering is used.

       -g     Learn only features described by the extended regular expression
              regex.  This overrides the default feature selection method (see
              -w option) and learns, for  each  line  of  input,  only  tokens
              constructed  from  the  concatenation of strings which match the
              tagged subexpressions within the supplied regex.  All substrings
              which match regex within a suffix of each input line are treated
              as features, even if they overlap on the input line.

              As an optional convenience, regex can include the  suffix  ||xyz
              which  indicates  which  parenthesized  subexpressions should be
              tagged. In this case, xyz should consist exclusively of digits 1
              to  9,  numbering  exactly  those subexpressions which should be
              tagged. Alternatively, if no  parentheses  exist  within  regex,
              then it is assumed that the whole expression must be captured.

       -h     Set  the  size  of the hash table to 2^size elements. When using
              the -l option, this refers  to  the  total  number  of  features
              allowed  in  the maximum entropy model being learned. When using
              the -c option toghether with the -M switch and multinomial  type
              categories,  this refers to the maximum number of features taken
              into account during classification.  Without the -M switch, this
              option has no effect.

       -i     Fully  internationalized mode. Forces the use of wide characters
              internally, which is necessary in some locales.  This  incurs  a
              noticeable performance penalty.

       -j     Make   features  case  sensitive.  Normally,  all  features  are
              converted to lower case during processing, which reduces storage
              requirements   and  improves  statistical  estimates  for  small
              datasets. With this option, the original capitalization is  used
              for each feature. This can improve classification accuracy.

       -m     Aggressively maps categories into memory and locks them into RAM
              to prevent swapping, if possible. This is useful when  speed  is
              paramount  and memory is plentiful, for example when testing the
              classifier on large datasets.

              Locking may require relaxing user limits  with  ulimit(1).   Ask
              your  system  administrator.  Beware  when  using  the -m switch
              together with the -o switch, as  only  one  dbacl  process  must
              learn  or  classify  at a time to prevent file corruption. If no
              learning takes place, then the  -m  switch  for  classifying  is
              always safe to use. See also the discussion for the -o switch.

       -n     Print  scores  for  each category.  Each score is the product of
              two numbers, the cross entropy and the complexity of  the  input
              text  under  each model. Multiplied together, they represent the
              log probability that the input resembles the model. To see these
              numbers  separately, use also the -v option. In conjunction with
              the -f option,  stops  filtering  but  prints  each  input  line
              prepended with a list of scores for that line.

       -q     Select quality of learning, where quality can be 1,2,3,4. Higher
              values take  longer  to  learn,  and  should  be  slightly  more
              accurate.  The default quality is 1 if the category file doesn’t
              exist or weights cannot be preloaded, and 2 otherwise.

       -o     When learning, reads/writes partial token counts so they can  be
              reused.  Normally,  category  files are learned from exactly the
              input data given, and don’t contain extraneous information. When
              this option is in effect, some extra information is saved in the
              file online, after all input was read. This information  can  be
              reread the next time that learning occurs, to continue where the
              previous dataset left  off.  If  online  doesn’t  exist,  it  is
              created.  If  online  exists,  it  is  read before learning, and
              updated afterwards. The file is approximately 3 times bigger (at
              least) than the learned category.

              In  dbacl,  file updates are atomic, but if using the -o switch,
              two or more processes should not learn simultaneously,  as  only
              one  process  will write a lasting category and memory dump. The
              -m switch can also speed  up  online  learning,  but  beware  of
              possible  corruption.   Only  one process should read or write a
              file. This option is  intended  primarily  for  controlled  test
              runs.

       -r     Learn  the  digramic reference model only. Skips the learning of
              extra features in the text corpus.

       -v     Verbose  mode.  When  learning,  print  out   details   of   the
              computation,  when  classifying,  print out the name of the most
              probable category.  In conjunction with the  -n  option,  prints
              the  scores  as an explicit product of the cross entropy and the
              complexity.

       -w     Select default features to be n-grams up to max_order.  This  is
              incompatible  with the -g option, which always takes precedence.
              If no -w or -g options are given, dbacl assumes -w 1. Note  that
              n-grams  for  n  greater  than  1 do not straddle line breaks by
              default.  The -S switch enables line straddling.

       -x     Set decimation probability to 1 - 2^(-decim).  To reduce  memory
              requirements  when  learning,  some inputs are randomly skipped,
              and only a few are added to the model.  Exact behaviour  depends
              on  the  applicable  -T option (default is -T "text").  When the
              type is not "email" (eg "text"), then individual input  features
              are added with probability 2^(-decim). When the type is "email",
              then full input messages are added with probability  2^(-decim).
              Within each such message, all features are used.

       -A     Expect  indented  input  and  scores.  With  this  switch, dbacl
              expects input lines to be indented by a single  space  character
              (which   is  then  skipped).   Lines  starting  with  any  other
              character are ignored. This is the counterpart to the -a  switch
              above.  When used together with the -a switch, dbacl outputs the
              skipped lines as they are, and reinserts the space at the  front
              of each processed input line.

       -D     Print  debug output. Do not use normally, but can be very useful
              for displaying the list features picked up while learning.

       -H     Allow hash table to grow up to a  maximum  of  2^gsize  elements
              during learning. Initial size is given by -h option.

       -L     Select the digramic reference measure for character transitions.
              The measure can be one of "uniform",  "dirichlet"  or  "maxent".
              Default is "uniform".

       -M     Force  multinomial calculations. When learning, forces the model
              features to be treated multinomially. When classifying, corrects
              entropy   scores  to  reflect  multinomial  probabilities  (only
              applicable to multinomial type models, if present).  Scores will
              always be lower, because the ordering of features is lost.

       -N     Print  posterior  probabilities for each category.  This assumes
              the   supplied   categories   form   an   exhaustive   list   of
              possibilities.    In  conjunction  with  the  -f  option,  stops
              filtering but prints each input line prepended with a summary of
              the posterior distribution for that line.

       -R     Include  an  extra category for purely random text. The category
              is called "random".  Only makes sense when using the -c  option.

       -S     Enable  line  straddling.  This  is  useful together with the -w
              option to allow n-grams for n > 1 to ignore line  breaks,  so  a
              complex token can continue past the end of the line. This is not
              recommended for email.

       -T     Specify nonstandard text format. By default, dbacl assumes  that
              the  input text is a purely ASCII text file. This corresponds to
              the case when type is "text".

              There are several types and subtypes which can be used to  clean
              the  input  text  of extraneous tokens before actual learning or
              classifying takes place. Each (sub)type you wish to use must  be
              indicated  with  a  separate  -T option on the command line, and
              automatically implies the corresponding type.

              The "text" type is for unstructured plain text.  No  cleanup  is
              performed.  This  is  the  default  if no types are given on the
              command line.

              The "email" type is for mbox format input files or single RFC822
              emails.  Headers are recognized and most are skipped. To include
              extra RFC822 standard headers (except for  trace  headers),  use
              the  "email:headers" subtype.  To include trace headers, use the
              "email:theaders" subtype. To include all headers in  the  email,
              use  the  "email:xheaders"  subtype. To skip all headers, except
              the subject, use "email:noheaders". To scan  binary  attachments
              for strings, use the "email:atts" subtype.

              When the "email" type is in effect, HTML markup is automatically
              removed from text attachments except text/plain attachments.  To
              also  remove  HTML  markup  from  plain  text  attachments,  use
              "email:noplain". To prevent HTML  markup  removal  in  all  text
              attachments, use "email:plain".

              The  "html" type is for removing HTML markup (between <html> and
              </html> tags) and surrounding text. Note  that  if  the  "email"
              type  is  enabled,  then  "html"  is  automatically  enabled for
              compatible message attachments only.

              The "xml" type is like "html", but  doesn’t  honour  <html>  and
              </html>,  and  doesn’t  interpret  tags  (so this should be more
              properly called "angle markup" removal, and has  nothing  to  do
              with actual XML semantics).

              When  "html"  is  enabled,  most markup attributes are lost (for
              values of ’most’ close  to  ’all’).   The  "html:links"  subtype
              forces link urls to be parsed and learned, which would otherwise
              be ignored. The "html:alt" subtype forces parsing of alternative
              text   in   ALT   attributes   and   various   other  tags.  The
              "html:scripts" subtype forces parsing of scripts,  "html:styles"
              forces  parsing  of  styles, "html:forms" forces parsing of form
              values, while "html:comments" forces parsing of HTML comments.

       -U     Print (U)nambiguity.  When  used  in  conjunction  with  the  -v
              switch,  prints  scores  followed  by  their  empirical standard
              deviations. When used alone, prints the best category,  followed
              by  an  estimated  probability  that  this  category  choice  is
              unambiguous. More precisely, the probability  measures  lack  of
              overlap  of CLT confidence intervals for each category score (If
              there is overlap, then there is ambiguity).

              This estimated probability can be used as an "unsure" flag, e.g.
              if  the  estimated  probability  is  lower than 50%. Formally, a
              score of 0% means another category is equally likely to apply to
              the input, and a score of 100% means no other category is likely
              to apply to the input. Note that  this  type  of  confidence  is
              unrelated  to  the  -X switch. Also, the probability estimate is
              usually low if the document is short, or if the message contains
              many  tokens  that  have never been seen before (only applies to
              uniform digramic measure).

       -V     Print the program version number and exit.

       -W     Like -w, but prevents features from straddling newlines. See the
              description of -w.

       -X     Print  the confidence in the score calculated for each category,
              when used together with the -n or -N switch. Prepares the  model
              for  confidence  scores,  when  used  with  the  -l switch.  The
              confidence is an  estimate  of  the  typicality  of  the  score,
              assuming the null hypothesis that the given category is correct.
              When used with the -v switch alone, factorizes the score as  the
              empirical  divergence  plus  the  shannon entropy, multiplied by
              complexity, in that order. The -X switch is not supported in all
              possible  models, and displays a percentage of "0.0" if it can’t
              be calculated. Note that for  unknown  documents,  it  is  quite
              common to have confidences close to zero.

USAGE

       To  create  two  category files in the current directory from two ASCII
       text   files   named   Mark_Twain.txt    and    William_Shakespeare.txt
       respectively, type:

       % dbacl -l twain Mark_Twain.txt
       % dbacl -l shake William_Shakespeare.txt

       Now you can classify input text, for example:

       % echo "howdy" | dbacl -v -c twain -c shake
       twain
       % echo "to be or not to be" | dbacl -v -c twain -c shake
       shake

       Note that the -v option at least is necessary, otherwise dbacl does not
       print anything. The return value is 1 in  the  first  case,  2  in  the
       second.

       % echo "to be or not to be" | dbacl -v -N -c twain -c shake
       twain 22.63% shake 77.37%
       % echo "to be or not to be" | dbacl -v -n -c twain -c shake
       twain  7.04 * 6.0 shake  6.74 * 6.0

       These  invocations  are equivalent. The numbers 6.74 and 7.04 represent
       how close the average token is to each category, and 6.0 is the  number
       of  tokens  observed.  If  you  want to print a simple confidence value
       together with the best category, replace -v with -U.

       % echo "to be or not to be" | dbacl -U -c twain -c shake
       shake # 34%

       Note that the true probability of category shake versus category  twain
       is  77.37%,  but  the calculation is somewhat ambiguous, and 34% is the
       confidence out of 100% that the calculation is qualitatively correct.

       Suppose a file document.txt contains English  text  lines  interspersed
       with noise lines. To filter out the noise lines from the English lines,
       assuming you have an existing category shake say, type:

       % dbacl -c shake -f shake -R document.txt > document.txt_eng
       % dbacl -c shake -f random -R document.txt > document.txt_rnd

       Note that the quality of the results will vary depending  on  how  well
       the  categories  shake  and  random  represent  each input line.  It is
       sometimes useful to see  the  posterior  probabilities  for  each  line
       without filtering:

       % dbacl -c shake -f shake -RN document.txt > document.txt_probs

       You  can  now  postprocess the posterior probabilities for each line of
       text with another script, to replicate an arbitrary  Bayesian  decision
       rule of your choice.

       In  the  special  case  of exactly two categories, the optimal Bayesian
       decision procedure can be implemented for documents as follows: let  p1
       be  the  prior  probability  that  the  input  text  is  classified  as
       category1.  Consequently,  the  prior  probability  of  classifying  as
       category2 is 1 - p1.  Let u12 be the cost of misclassifying a category1
       input text as belonging to category2 and vice versa for u21.  We assume
       there is no cost for classifying correctly.  Then the following command
       implements the optimal Bayesian decision:

       % dbacl -n -c category1 -c category2 | awk ’{ if($2 * p1 * u12 > $4 *
              (1 - p1) * u21) { print $1; } else { print $3; } }’

       dbacl  can  also be used in conjunction with procmail(1) to implement a
       simple Bayesian email classification system. Assume that incoming  mail
       should  be automatically delivered to one of three mail folders located
       in $MAILDIR and named work, personal, and spam.  Initially, these  must
       be  created  and  filled  with appropriate sample emails.  A crontab(1)
       file can be used to learn the three categories once a day, e.g.

       CATS=$HOME/.dbacl
       5  0 * * * dbacl -T email -l $CATS/work $MAILDIR/work
       10 0 * * * dbacl -T email -l $CATS/personal $MAILDIR/personal
       15 0 * * * dbacl -T email -l $CATS/spam $MAILDIR/spam

       To automatically deliver  each  incoming  email  into  the  appropriate
       folder, the following procmailrc(5) recipe fragment could be used:

       CATS=$HOME/.dbacl

       # run the spam classifier
       :0 c
       YAY=| dbacl -vT email -c $CATS/work -c $CATS/personal -c $CATS/spam

       # send to the appropriate mailbox
       :0:
       * ? test -n "$YAY"
       $MAILDIR/$YAY

       :0:
       $DEFAULT

       Sometimes,  dbacl  will  send  the  email to the wrong mailbox. In that
       case, the misclassified  message  should  be  removed  from  its  wrong
       destination  and  placed  in  the  correct  mailbox.  The error will be
       corrected the next time your messages are learned.  If it  is  left  in
       the wrong category, dbacl will learn the wrong corpus statistics.

       The  default text features (tokens) read by dbacl are purely alphabetic
       strings, which minimizes memory requirements but can be unrealistic  in
       some  cases.  To construct models based on alphanumeric tokens, use the
       -e switch. The example below also uses the optional  -D  switch,  which
       prints a list of actual tokens found in the document:

       % dbacl -e alnum -D -l twain Mark_Twain.txt | less

       It  is  also  possible to override the default feature selection method
       used to learn the category model by means of regular  expressions.  For
       example,  the following duplicates the default feature selection method
       in the C locale, while being much slower:

       % dbacl -l twain -g ’^([[:alpha:]]+)’ -g ’[^[:alpha:]]([[:alpha:]]+)’
              Mark_Twain.txt

       The  category twain which is obtained depends only on single alphabetic
       words in the text file Mark_Twain.txt (and computed  digram  statistics
       for  prediction).  For a second example, the following command builds a
       smoothed Markovian (word  bigram)  model  which  depends  on  pairs  of
       consecutive  words  within  each line (but pairs cannot straddle a line
       break):

       % dbacl -l twain2 -g ’(^|[^[:alpha:]])([[:alpha:]]+)||2’ -g
              ’(^|[^[:alpha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23’
              Mark_Twain.txt

       More general, line based, n-gram models of all orders (up to 7) can  be
       built  in  a  similar  way.   To  construct paragraph based models, you
       should reformat the input corpora with awk(1) or sed(1) to  obtain  one
       paragraph  per line. Line size is limited by available memory, but note
       that regex performance will degrade quickly for long lines.

PERFORMANCE

       The underlying assumption of statistical learning is that a  relatively
       small  number  of training documents can represent a much larger set of
       input documents. Thus in the long run, learning can  grind  to  a  halt
       without  serious  impact  on classification accuracy. While not true in
       reality, this assumption is surprisingly accurate for problems such  as
       email  filtering.  In practice, this means that a well chosen corpus on
       the order of ten thousand documents is sufficient for  highly  accurate
       results  for  years.   Continual  learning  after  such a critical mass
       results in diminishing returns.   Of  course,  when  real  world  input
       document  patterns  change  dramatically,  the  predictive power of the
       models can be lost. At the other end, a few hundred  documents  already
       give acceptable results in most cases.

       dbacl is heavily optimized for the case of frequent classifications but
       infrequent batch learning. This  is  the  long  run  optimum  described
       above.  Under ideal conditions, dbacl can classify a hundred emails per
       second on low end hardware (500Mhz Pentium III). Learning speed is  not
       very  much slower, but takes effectively much longer for large document
       collections for various  reasons.   When  using  the  -m  switch,  data
       structures  are  aggressively  mapped into memory if possible, reducing
       overheads for both I/O and memory allocations.

       dbacl throws away its input as soon as possible, and has no  limits  on
       the  input  document  size.  Both classification and learning speed are
       directly proportional to  the  number  of  tokens  in  the  input,  but
       learning  also  needs  a  nonlinear  optimization step which takes time
       proportional to the number of unique tokens  discovered.   At  time  of
       writing, dbacl is one of the fastest open source mail filters given its
       optimal usage scenario, but uses more memory for  learning  than  other
       filters.

MULTIPLE PROCESSES AND DATA CORRUPTION

       When  saving category files, dbacl first writes out a temporary file in
       the same location, and renames it afterwards. If  a  problem  or  crash
       occurs  during  learning,  the  old  category  file  is  therefore left
       untouched. This ensures that categories  can  never  be  corrupted,  no
       matter  how many processes try to simultaneously learn or classify, and
       means that valid categories are available  for  classification  at  any
       time.

       When  using  the  -m switch, file contents are memory mapped for speedy
       reading and writing. This, together with the  -o  switch,  is  intended
       mainly for testing purposes, when tens of thousands of messages must be
       learned and scored in a laboratory to measure dbacl’s accuracy. Because
       no  file  locking is attempted for performance reasons, corruptions are
       possible, unless you make sure that only one  dbacl  process  reads  or
       writes  any  file  at  any given time. This is the only case (-m and -o
       together) when corruption is possible.

MEMORY USE

       When classifying a document, dbacl loads all indicated categories  into
       RAM,  so  the  total  memory  needed  is  approximately  the sum of the
       category file sizes plus a fixed small overhead.  The input document is
       consumed  while  being  read, so its size doesn’t matter, but very long
       lines can take up space.  When using the -m switch, the categories  are
       read using mmap(2) as available.

       When  learning,  dbacl keeps a large structure in memory which contains
       many objects which won’t be saved into the output category. The size of
       this structure is proportional to the number of unique tokens read, but
       not the size of the input documents, since  they  are  discarded  while
       being  read.  As a rough guide, this structure is 4x-5x the size of the
       final category file that is produced.

       To prevent unchecked memory growth, dbacl allocates by default a  fixed
       smallish  amount  of  memory  for  tokens.  When this space is used up,
       further tokens are discarded  which  has  the  effect  of  skewing  the
       learned  category  making  it less usable as more tokens are dropped. A
       warning is printed on STDERR in such a case.

       The -h switch lets you fix the initial  size  of  the  token  space  in
       powers  of  2,  ie  "-h 17" means 2^17 = 131072 possible tokens. If you
       type "dbacl -V", you can see the number of bytes needed for each  token
       when  either  learning  or  classifying.  Multiply  this  number by the
       maximum number of possible tokens to estimate  the  memory  needed  for
       learning. The -H switch lets dbacl grow its tables automatically if and
       when needed, up to a maximum specified. So if you type  "-H  21",  then
       the  initial  size  will  be  doubled  repeatedly  if  necessary, up to
       approximately two million unique tokens.

       When learning with the -X switch, a handful of input documents are also
       kept in RAM throughout.

ENVIRONMENT

       DBACL_PATH
              When  this  variable  is  set,  its  value is prepended to every
              category filename which doesn’t start with a ’/’ or a ’.’.

SIGNALS

       INT    If this signal is caught, dbacl simply exits without  doing  any
              cleanup  or  other  operations. This signal can often be sent by
              pressing Ctrl-C on the keyboard. See stty(1).

       HUP, QUIT, TERM
              If one of these signals is caught, dbacl stops reading input and
              continues  its operation as if no more input was available. This
              is a way of quitting gracefully, but note that in learning mode,
              a  category  file will be written based on the incomplete input.
              The QUIT signal can often  be  sent  by  pressing  Ctrl- on  the
              keyboard. See stty(1).

       USR1   If  this  signal is caught, dbacl reloads the current categories
              at the earliest  feasible  opportunity.  This  is  not  normally
              useful  at all, but might be in special cases, such as if the -f
              switch is invoked together with input from a long running  pipe.

NOTES

       dbacl generated category files are in binary format, and may or may not
       be portable to systems using a different byte order architecture  (this
       depends  on  how  dbacl was compiled). The -V switch prints out whether
       categories are portable, or else you can just experiment.

       dbacl does not recognize functionally equivalent  regular  expressions,
       and in this case duplicate features will be counted several times.

       With  every  learned  category, the command line options that were used
       are saved.  When classifying, make sure that  every  relevant  category
       was  learned  with  the  same  set  of  options (regexes are allowed to
       differ), otherwise behaviour is undefined. There is no need  to  repeat
       all the switches when classifying.

       If you get many digitization warnings, then you are trying to learn too
       much data at once, or your model is too complex.  dbacl is compiled  to
       save   memory   by  digitizing  final  weights,  but  you  can  disable
       digitization by editing dbacl.h and recompiling.

       dbacl offers several built-in tokenizers (see -e switch) with  more  to
       come in future versions, as the author invents them.  While the default
       tokenizer may evolve, no tokenizer should ever be removed, so that  you
       can  always  simulate previous dbacl behaviour subject to bug fixes and
       architectural changes.

       The  confidence  estimates  obtained  through   the   -X   switch   are
       underestimates, ie are more conservative than they should be.

BUGS

       "Ya  know,  some  day  scientists  are gonna invent something that will
       outsmart a rabbit." (Robot Rabbit, 1953)

SOURCE

       The source code for the latest version of this program is available  at
       the following locations:

       http://www.lbreyer.com/gpl.html
       http://dbacl.sourceforge.net

AUTHOR

       Laird A. Breyer <laird@lbreyer.com>

NAME

SYNOPSIS

OVERVIEW

EXIT STATUS

DESCRIPTION

SECONDARY SWITCHES

EXIT STATUS

OPTIONS

USAGE

PERFORMANCE

MULTIPLE PROCESSES AND DATA CORRUPTION

MEMORY USE

ENVIRONMENT

SIGNALS

NOTES

BUGS

SOURCE

AUTHOR

SEE ALSO