NAME
dbacl - a digramic Bayesian classifier for text recognition.
SYNOPSIS
dbacl [-01dvnirmwMNDXW] [-T type ] -l category [-h size] [-H gsize] [-x
decim] [-q quality] [-w max_order] [-e deftok] [-o online] [-L
measure] [-g regex]... [FILE]...
dbacl [-vnimNRX] [-h size] [-T type] -c category [-c category]... [-f
keep]... [FILE]...
dbacl -V
OVERVIEW
dbacl is a Bayesian text and email classifier. When using the -l
switch, it learns a body of text and produce a file named category
which summarizes the text. When using the -c switch, it compares an
input text stream with any number of category files, and outputs the
name of the closest match, or optionally various numerical scores
explained below.
Whereas this manual page is intended as a reference, there are several
tutorials and documents you can read to get specialized information.
Specific documentation about the design of dbacl and the statistical
models that it uses can be found in dbacl.ps. For a basic overview of
text classification using dbacl, see tutorial.html. A companion
tutorial geared towards email filtering is email.html. If you have
trouble getting dbacl to classify reliably, read is_it_working.html.
The USAGE section of this manual page also has some examples.
/usr/share/dbacl/doc/dbacl.ps
/usr/share/dbacl/doc/tutorial.html
/usr/share/dbacl/doc/email.html
/usr/share/dbacl/doc/is_it_working.html
dbacl uses a maximum entropy (minimum divergence) language model
constructed with respect to a digramic reference measure (unknown
tokens are predicted from digrams, i.e. pairs of letters). Practically,
this means that a category is constructed from tokens in the training
set, while previously unseen tokens can be predicted automatically from
their letters. A token here is either a word (fragment) or a
combination of words (fragments), selected according to various
switches. Learning roughly works by tweaking token probabilities until
the training data is least surprising.
EXIT STATUS
The normal shell exit conventions aren’t followed (sorry!). When using
the -l command form, dbacl returns zero on success, nonzero if an error
occurs. When using the -c form, dbacl returns a positive integer
corresponding to the category with the highest posterior probability.
In case of a tie, the first most probable category is chosen. If an
error occurs, dbacl returns zero.
DESCRIPTION
When using the -l command form, dbacl learns a category when given one
or more FILE names, which should contain readable ASCII text. If no
FILE is given, dbacl learns from STDIN. If FILE is a directory, it is
opened and all its files are read, but not its subdirectories. The
result is saved in the binary file named category, and completely
replaces any previous contents. As a convenience, if the environment
variable DBACL_PATH contains a directory, then that is prepended to the
file path, unless category starts with a ’/’ or a ’.’.
The input text for learning is assumed to be unstructured plain text by
default. This is not suitable for learning email, because email
contains various transport encodings and formatting instructions which
can reduce classification effectiveness. You must use the -T switch in
that case so that dbacl knows it should perform decoding and filtering
of MIME and HTML as appropriate. Apropriate switch values are "-T
email" for RFC2822 email input, "-T html" for HTML input, "-T xml" for
generic XML style input and "-T text" is the default plain text format.
There are other values of the -T switch that also allow fine tuning of
the decoding capabilities.
When using the -c command form, dbacl attempts to classify the text
found in FILE, or STDIN if no FILE is given. Each possible category
must be given separately, and should be the file name of a previously
learned text corpus. As a convenience, if the variable DBACL_PATH
contains a directory, it is prepended to each file path which doesn’t
start with a ’/’ or a ’.’. The visible output of the classification
depends on the combination of extra switches used. If no switch is
used, then no output is shown on STDOUT. However, dbacl always produces
an exit code which can be tested.
To see an output for a classification, you must use at least one of the
-v,-U,-n,-N,-D,-d switches. Sometimes, they can be used in combination
to produce a natural variation of their individual outputs. Sometimes,
dbacl also produces warnings on STDERR if applicable.
The -v switch outputs the name of the best category among all the
choices given.
The -U switch outputs the name of the best category followed by a
confidence percentage. Normally, this is the switch that you want to
use. A percentage of 100% means that dbacl is sure of its choice, while
a percentage of 0% means that some other category is equally likely.
This is not the model probability, but measures how unambiguous the
classification is, and can be used to tag unsure classifications (e.g.
if the confidence is 25% or less).
The -N switch prints each category name followed by its (posterior)
probability, expressed as a percentage. The percentages always sum to
100%. This is intuitive, but only valuable if the document being
classified contains a handful of tokens (ten or less). In the common
case with many more tokens, the probabilities are always extremely
close to 100% and 0%.
The -n switch prints each category name followed by the negative
logarithm of its probability. This is equivalent to using the -N
switch, but much more useful. The smallest number gives the best
category. A more convenient form is to use both -n and -v which prints
each category name followed by the cross entropy and the number of
tokens analyzed. The cross entropy measures (in bits) the average
compression rate which is achievable, under the given category model,
per token of input text. If you use all three of -n,-v,-X then an extra
value is output for each category, representing a kind of p-value for
each category score. This indicates how typical the score is compared
to the training documents, but only works if the -X switch was used
during learning, and only for some types of models (e.g. email). These
p-values are uniformly distributed and independent (if the categories
are independent), so can be combined using Fisher’s chi squared test to
obtain composite p-values for groupings of categories.
The -v and -X switches together print each category name followed by a
detailed decomposition of the category score, factored into (
divergence rate + shannon entropy rate )* token count @ p-value. Again,
this only works in some types of models.
The -v and -U switches print each category name followed by a
decomposition of the category score into ( divergence rate + shannon
entropy rate # score variance )* token count.
The -D switch prints out the input text as modified internally by dbacl
prior to tokenization. For example, if a MIME encoded email document is
classified, then this prints the decoded text that will be actually
tokenized and classified. This switch is mainly useful for debugging.
The -d switch dumps tokens and scores while they are being read. It is
useful for debugging, or if you want to create graphical
representations of the classification. A detailed explanation of the
output is beyond the scope of this manual page, but is straightforward
if you’ve read dbacl.ps. Possible variations include -d together with
-n or -N.
Classification can be done with one or several categories in principle.
When two or more categories are used, the Bayesian posterior
probability is used, given the input text, with a uniform prior
distribution on categories. For other choices of prior, see the
companion utility bayesol(1). When a single category is used,
classification can be done by comparing the score with a treshold. In
practice however, much better results are obtained with several
categories.
Learning and classifying cannot be mixed on the same command
invocation, however there are no locking issues and separate dbacl
processes can operate simultaneously with obvious results, because file
operations are designed to be atomic.
Finally, note that dbacl does not manage your document corpora or your
computed categories, and in particular it does not allow you to extend
an existing category file with new documents. This is unlike various
current spam filters, which can learn new emails incrementally. This
limitation of dbacl is partially due to the nonlinear procedure used in
the learning algorithm, and partially a desire for increased
flexibility.
You can simulate the effect of incremental learning by saving your
training documents into archives and adding to these archives over
time, relearning from scratch periodically. Learning is actually faster
if these archives are compressed and decompressed on the fly when
needed. By keeping control of your archives, you can never lose the
information in your categories, and you can easily experiment with
different switches or tokenizations or sets of training documents if
you like.
SECONDARY SWITCHES
By default, dbacl classifies the input text as a whole. However, when
using the -f option, dbacl can be used to filter each input line
separately, printing only those lines which match one or more models
identified by keep (use the category name or number to refer to a
category). This is useful if you want to filter out some lines, but
note that if the lines are short, then the error rate can be high.
The -e,-w,-g,-j switches are used for selecting an appropriate
tokenization scheme. A token is a word or word fragment or combination
of words or fragments. The shape of tokens is important because it
forms the basis of the language models used by dbacl. The -e switch
selects a predefined tokenization scheme, which is speedy but limited.
The -w switch specifies composite tokens derived from the -e switch.
For example, "-e alnum -w 2" means that tokens should be alphanumeric
word fragments combined into overlapping pairs (bigrams). When the -j
switch is used, all tokens are converted to lowercase, which reduces
the number of possible tokens and therefore memory consumption.
If the -g switch is used, you can completely specify what the tokens
should look like using a regular expression. Several -g switches can be
used to construct complex tokenization schemes, and parentheses within
each expression can be used to select fragments and combine them into
n-grams. The cost of such flexibility is reduced classification and
learning speed. When experimenting with tokenization schemes, try using
the -d or -D switches while learning or classifying, as they will print
the tokens explicitly so you can see what text fragments are picked up
or missed out. For regular exression syntax, see regex(7).
The -h and -H switches regulate how much memory dbacl may use for
learning. Text classification can use a lot of memory, and by default
dbacl limits itself even at the expense of learning accuracy. In many
cases if a limit is reached, a warning message will be printed on
STDERR with some advice.
When relearning the same category several times, a significant speedup
can be obtained by using the -1 switch, as this allows the previously
learned probabilities to be read from the category and reused.
Note that classification accuracy depends foremost on the amount and
quality of the training samples, and then only on amount of tweaking.
EXIT STATUS
When using the -l command form, dbacl returns zero on success. When
using the -c form, dbacl returns a positive integer (1,2,3...)
corresponding to the category with the highest posterior probability.
In case of a tie, the first most probable category is chosen. If an
error occurs, dbacl returns zero.
OPTIONS
-0 When learning, prevents weight preloading. Normally, dbacl
checks if the category file already exists, and if so, tries to
use the existing weights as a starting point. This can
dramatically speed up learning. If the -0 (zero) switch is set,
then dbacl behaves as if no category file already exists. This
is mainly useful for testing. This switch is now enabled by
default, to protect against weight drift which can reduce
accuracy over many learning iterations. Use -1 to force
preloading.
-1 Force weight preloading if the category file already exists. See
discussion of the -0 switch.
-a Append scores. Every input line is written to STDOUT and the
dbacl scores are appended. This is useful for postprocessing
with bayesol(1). For ease of processing, every original input
line is indented by a single space (to distinguish them from the
appended scores), and the line with the scores (if -n is used)
is prefixed with the string "scores ". If a second copy of dbacl
needs to read this output later, it should be invoked with the
-A switch.
-d Dump the model parameters to STDOUT. In conjunction with the -l
option, this produces a human-readable summary of the maximum
entropy model. In conjunction with the -c option, displays the
contribution of each token to the final score. Suppresses all
other normal output.
-e Select character class for default (not regex-based)
tokenization. By default, tokens are alphabetic strings only.
This corresponds to the case when deftok is "alpha". Possible
values for deftok are "alpha", "alnum", "graph", "char", "cef"
and "adp". The last two are custom tokenizers intended for
email messages. See also isalpha(3). The "char" tokenizer
picks up single printable characters rather than bigger tokens,
and is intended for testing only.
-f Filter each line of input separately, passing to STDOUT only
lines which match the category identified as keep. This option
should be used repeatedly for each category which must be kept.
keep can be either the category file name, or a positive integer
representing the required category in the same order it appears
on the command line.
Output lines are flushed as soon as they are written. If the
input file is a pipe or character device, then an attempt is
made to use line buffering mode, otherwise the more efficient
block buffering is used.
-g Learn only features described by the extended regular expression
regex. This overrides the default feature selection method (see
-w option) and learns, for each line of input, only tokens
constructed from the concatenation of strings which match the
tagged subexpressions within the supplied regex. All substrings
which match regex within a suffix of each input line are treated
as features, even if they overlap on the input line.
As an optional convenience, regex can include the suffix ||xyz
which indicates which parenthesized subexpressions should be
tagged. In this case, xyz should consist exclusively of digits 1
to 9, numbering exactly those subexpressions which should be
tagged. Alternatively, if no parentheses exist within regex,
then it is assumed that the whole expression must be captured.
-h Set the size of the hash table to 2^size elements. When using
the -l option, this refers to the total number of features
allowed in the maximum entropy model being learned. When using
the -c option toghether with the -M switch and multinomial type
categories, this refers to the maximum number of features taken
into account during classification. Without the -M switch, this
option has no effect.
-i Fully internationalized mode. Forces the use of wide characters
internally, which is necessary in some locales. This incurs a
noticeable performance penalty.
-j Make features case sensitive. Normally, all features are
converted to lower case during processing, which reduces storage
requirements and improves statistical estimates for small
datasets. With this option, the original capitalization is used
for each feature. This can improve classification accuracy.
-m Aggressively maps categories into memory and locks them into RAM
to prevent swapping, if possible. This is useful when speed is
paramount and memory is plentiful, for example when testing the
classifier on large datasets.
Locking may require relaxing user limits with ulimit(1). Ask
your system administrator. Beware when using the -m switch
together with the -o switch, as only one dbacl process must
learn or classify at a time to prevent file corruption. If no
learning takes place, then the -m switch for classifying is
always safe to use. See also the discussion for the -o switch.
-n Print scores for each category. Each score is the product of
two numbers, the cross entropy and the complexity of the input
text under each model. Multiplied together, they represent the
log probability that the input resembles the model. To see these
numbers separately, use also the -v option. In conjunction with
the -f option, stops filtering but prints each input line
prepended with a list of scores for that line.
-q Select quality of learning, where quality can be 1,2,3,4. Higher
values take longer to learn, and should be slightly more
accurate. The default quality is 1 if the category file doesn’t
exist or weights cannot be preloaded, and 2 otherwise.
-o When learning, reads/writes partial token counts so they can be
reused. Normally, category files are learned from exactly the
input data given, and don’t contain extraneous information. When
this option is in effect, some extra information is saved in the
file online, after all input was read. This information can be
reread the next time that learning occurs, to continue where the
previous dataset left off. If online doesn’t exist, it is
created. If online exists, it is read before learning, and
updated afterwards. The file is approximately 3 times bigger (at
least) than the learned category.
In dbacl, file updates are atomic, but if using the -o switch,
two or more processes should not learn simultaneously, as only
one process will write a lasting category and memory dump. The
-m switch can also speed up online learning, but beware of
possible corruption. Only one process should read or write a
file. This option is intended primarily for controlled test
runs.
-r Learn the digramic reference model only. Skips the learning of
extra features in the text corpus.
-v Verbose mode. When learning, print out details of the
computation, when classifying, print out the name of the most
probable category. In conjunction with the -n option, prints
the scores as an explicit product of the cross entropy and the
complexity.
-w Select default features to be n-grams up to max_order. This is
incompatible with the -g option, which always takes precedence.
If no -w or -g options are given, dbacl assumes -w 1. Note that
n-grams for n greater than 1 do not straddle line breaks by
default. The -S switch enables line straddling.
-x Set decimation probability to 1 - 2^(-decim). To reduce memory
requirements when learning, some inputs are randomly skipped,
and only a few are added to the model. Exact behaviour depends
on the applicable -T option (default is -T "text"). When the
type is not "email" (eg "text"), then individual input features
are added with probability 2^(-decim). When the type is "email",
then full input messages are added with probability 2^(-decim).
Within each such message, all features are used.
-A Expect indented input and scores. With this switch, dbacl
expects input lines to be indented by a single space character
(which is then skipped). Lines starting with any other
character are ignored. This is the counterpart to the -a switch
above. When used together with the -a switch, dbacl outputs the
skipped lines as they are, and reinserts the space at the front
of each processed input line.
-D Print debug output. Do not use normally, but can be very useful
for displaying the list features picked up while learning.
-H Allow hash table to grow up to a maximum of 2^gsize elements
during learning. Initial size is given by -h option.
-L Select the digramic reference measure for character transitions.
The measure can be one of "uniform", "dirichlet" or "maxent".
Default is "uniform".
-M Force multinomial calculations. When learning, forces the model
features to be treated multinomially. When classifying, corrects
entropy scores to reflect multinomial probabilities (only
applicable to multinomial type models, if present). Scores will
always be lower, because the ordering of features is lost.
-N Print posterior probabilities for each category. This assumes
the supplied categories form an exhaustive list of
possibilities. In conjunction with the -f option, stops
filtering but prints each input line prepended with a summary of
the posterior distribution for that line.
-R Include an extra category for purely random text. The category
is called "random". Only makes sense when using the -c option.
-S Enable line straddling. This is useful together with the -w
option to allow n-grams for n > 1 to ignore line breaks, so a
complex token can continue past the end of the line. This is not
recommended for email.
-T Specify nonstandard text format. By default, dbacl assumes that
the input text is a purely ASCII text file. This corresponds to
the case when type is "text".
There are several types and subtypes which can be used to clean
the input text of extraneous tokens before actual learning or
classifying takes place. Each (sub)type you wish to use must be
indicated with a separate -T option on the command line, and
automatically implies the corresponding type.
The "text" type is for unstructured plain text. No cleanup is
performed. This is the default if no types are given on the
command line.
The "email" type is for mbox format input files or single RFC822
emails. Headers are recognized and most are skipped. To include
extra RFC822 standard headers (except for trace headers), use
the "email:headers" subtype. To include trace headers, use the
"email:theaders" subtype. To include all headers in the email,
use the "email:xheaders" subtype. To skip all headers, except
the subject, use "email:noheaders". To scan binary attachments
for strings, use the "email:atts" subtype.
When the "email" type is in effect, HTML markup is automatically
removed from text attachments except text/plain attachments. To
also remove HTML markup from plain text attachments, use
"email:noplain". To prevent HTML markup removal in all text
attachments, use "email:plain".
The "html" type is for removing HTML markup (between <html> and
</html> tags) and surrounding text. Note that if the "email"
type is enabled, then "html" is automatically enabled for
compatible message attachments only.
The "xml" type is like "html", but doesn’t honour <html> and
</html>, and doesn’t interpret tags (so this should be more
properly called "angle markup" removal, and has nothing to do
with actual XML semantics).
When "html" is enabled, most markup attributes are lost (for
values of ’most’ close to ’all’). The "html:links" subtype
forces link urls to be parsed and learned, which would otherwise
be ignored. The "html:alt" subtype forces parsing of alternative
text in ALT attributes and various other tags. The
"html:scripts" subtype forces parsing of scripts, "html:styles"
forces parsing of styles, "html:forms" forces parsing of form
values, while "html:comments" forces parsing of HTML comments.
-U Print (U)nambiguity. When used in conjunction with the -v
switch, prints scores followed by their empirical standard
deviations. When used alone, prints the best category, followed
by an estimated probability that this category choice is
unambiguous. More precisely, the probability measures lack of
overlap of CLT confidence intervals for each category score (If
there is overlap, then there is ambiguity).
This estimated probability can be used as an "unsure" flag, e.g.
if the estimated probability is lower than 50%. Formally, a
score of 0% means another category is equally likely to apply to
the input, and a score of 100% means no other category is likely
to apply to the input. Note that this type of confidence is
unrelated to the -X switch. Also, the probability estimate is
usually low if the document is short, or if the message contains
many tokens that have never been seen before (only applies to
uniform digramic measure).
-V Print the program version number and exit.
-W Like -w, but prevents features from straddling newlines. See the
description of -w.
-X Print the confidence in the score calculated for each category,
when used together with the -n or -N switch. Prepares the model
for confidence scores, when used with the -l switch. The
confidence is an estimate of the typicality of the score,
assuming the null hypothesis that the given category is correct.
When used with the -v switch alone, factorizes the score as the
empirical divergence plus the shannon entropy, multiplied by
complexity, in that order. The -X switch is not supported in all
possible models, and displays a percentage of "0.0" if it can’t
be calculated. Note that for unknown documents, it is quite
common to have confidences close to zero.
USAGE
To create two category files in the current directory from two ASCII
text files named Mark_Twain.txt and William_Shakespeare.txt
respectively, type:
% dbacl -l twain Mark_Twain.txt
% dbacl -l shake William_Shakespeare.txt
Now you can classify input text, for example:
% echo "howdy" | dbacl -v -c twain -c shake
twain
% echo "to be or not to be" | dbacl -v -c twain -c shake
shake
Note that the -v option at least is necessary, otherwise dbacl does not
print anything. The return value is 1 in the first case, 2 in the
second.
% echo "to be or not to be" | dbacl -v -N -c twain -c shake
twain 22.63% shake 77.37%
% echo "to be or not to be" | dbacl -v -n -c twain -c shake
twain 7.04 * 6.0 shake 6.74 * 6.0
These invocations are equivalent. The numbers 6.74 and 7.04 represent
how close the average token is to each category, and 6.0 is the number
of tokens observed. If you want to print a simple confidence value
together with the best category, replace -v with -U.
% echo "to be or not to be" | dbacl -U -c twain -c shake
shake # 34%
Note that the true probability of category shake versus category twain
is 77.37%, but the calculation is somewhat ambiguous, and 34% is the
confidence out of 100% that the calculation is qualitatively correct.
Suppose a file document.txt contains English text lines interspersed
with noise lines. To filter out the noise lines from the English lines,
assuming you have an existing category shake say, type:
% dbacl -c shake -f shake -R document.txt > document.txt_eng
% dbacl -c shake -f random -R document.txt > document.txt_rnd
Note that the quality of the results will vary depending on how well
the categories shake and random represent each input line. It is
sometimes useful to see the posterior probabilities for each line
without filtering:
% dbacl -c shake -f shake -RN document.txt > document.txt_probs
You can now postprocess the posterior probabilities for each line of
text with another script, to replicate an arbitrary Bayesian decision
rule of your choice.
In the special case of exactly two categories, the optimal Bayesian
decision procedure can be implemented for documents as follows: let p1
be the prior probability that the input text is classified as
category1. Consequently, the prior probability of classifying as
category2 is 1 - p1. Let u12 be the cost of misclassifying a category1
input text as belonging to category2 and vice versa for u21. We assume
there is no cost for classifying correctly. Then the following command
implements the optimal Bayesian decision:
% dbacl -n -c category1 -c category2 | awk ’{ if($2 * p1 * u12 > $4 *
(1 - p1) * u21) { print $1; } else { print $3; } }’
dbacl can also be used in conjunction with procmail(1) to implement a
simple Bayesian email classification system. Assume that incoming mail
should be automatically delivered to one of three mail folders located
in $MAILDIR and named work, personal, and spam. Initially, these must
be created and filled with appropriate sample emails. A crontab(1)
file can be used to learn the three categories once a day, e.g.
CATS=$HOME/.dbacl
5 0 * * * dbacl -T email -l $CATS/work $MAILDIR/work
10 0 * * * dbacl -T email -l $CATS/personal $MAILDIR/personal
15 0 * * * dbacl -T email -l $CATS/spam $MAILDIR/spam
To automatically deliver each incoming email into the appropriate
folder, the following procmailrc(5) recipe fragment could be used:
CATS=$HOME/.dbacl
# run the spam classifier
:0 c
YAY=| dbacl -vT email -c $CATS/work -c $CATS/personal -c $CATS/spam
# send to the appropriate mailbox
:0:
* ? test -n "$YAY"
$MAILDIR/$YAY
:0:
$DEFAULT
Sometimes, dbacl will send the email to the wrong mailbox. In that
case, the misclassified message should be removed from its wrong
destination and placed in the correct mailbox. The error will be
corrected the next time your messages are learned. If it is left in
the wrong category, dbacl will learn the wrong corpus statistics.
The default text features (tokens) read by dbacl are purely alphabetic
strings, which minimizes memory requirements but can be unrealistic in
some cases. To construct models based on alphanumeric tokens, use the
-e switch. The example below also uses the optional -D switch, which
prints a list of actual tokens found in the document:
% dbacl -e alnum -D -l twain Mark_Twain.txt | less
It is also possible to override the default feature selection method
used to learn the category model by means of regular expressions. For
example, the following duplicates the default feature selection method
in the C locale, while being much slower:
% dbacl -l twain -g ’^([[:alpha:]]+)’ -g ’[^[:alpha:]]([[:alpha:]]+)’
Mark_Twain.txt
The category twain which is obtained depends only on single alphabetic
words in the text file Mark_Twain.txt (and computed digram statistics
for prediction). For a second example, the following command builds a
smoothed Markovian (word bigram) model which depends on pairs of
consecutive words within each line (but pairs cannot straddle a line
break):
% dbacl -l twain2 -g ’(^|[^[:alpha:]])([[:alpha:]]+)||2’ -g
’(^|[^[:alpha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23’
Mark_Twain.txt
More general, line based, n-gram models of all orders (up to 7) can be
built in a similar way. To construct paragraph based models, you
should reformat the input corpora with awk(1) or sed(1) to obtain one
paragraph per line. Line size is limited by available memory, but note
that regex performance will degrade quickly for long lines.
PERFORMANCE
The underlying assumption of statistical learning is that a relatively
small number of training documents can represent a much larger set of
input documents. Thus in the long run, learning can grind to a halt
without serious impact on classification accuracy. While not true in
reality, this assumption is surprisingly accurate for problems such as
email filtering. In practice, this means that a well chosen corpus on
the order of ten thousand documents is sufficient for highly accurate
results for years. Continual learning after such a critical mass
results in diminishing returns. Of course, when real world input
document patterns change dramatically, the predictive power of the
models can be lost. At the other end, a few hundred documents already
give acceptable results in most cases.
dbacl is heavily optimized for the case of frequent classifications but
infrequent batch learning. This is the long run optimum described
above. Under ideal conditions, dbacl can classify a hundred emails per
second on low end hardware (500Mhz Pentium III). Learning speed is not
very much slower, but takes effectively much longer for large document
collections for various reasons. When using the -m switch, data
structures are aggressively mapped into memory if possible, reducing
overheads for both I/O and memory allocations.
dbacl throws away its input as soon as possible, and has no limits on
the input document size. Both classification and learning speed are
directly proportional to the number of tokens in the input, but
learning also needs a nonlinear optimization step which takes time
proportional to the number of unique tokens discovered. At time of
writing, dbacl is one of the fastest open source mail filters given its
optimal usage scenario, but uses more memory for learning than other
filters.
MULTIPLE PROCESSES AND DATA CORRUPTION
When saving category files, dbacl first writes out a temporary file in
the same location, and renames it afterwards. If a problem or crash
occurs during learning, the old category file is therefore left
untouched. This ensures that categories can never be corrupted, no
matter how many processes try to simultaneously learn or classify, and
means that valid categories are available for classification at any
time.
When using the -m switch, file contents are memory mapped for speedy
reading and writing. This, together with the -o switch, is intended
mainly for testing purposes, when tens of thousands of messages must be
learned and scored in a laboratory to measure dbacl’s accuracy. Because
no file locking is attempted for performance reasons, corruptions are
possible, unless you make sure that only one dbacl process reads or
writes any file at any given time. This is the only case (-m and -o
together) when corruption is possible.
MEMORY USE
When classifying a document, dbacl loads all indicated categories into
RAM, so the total memory needed is approximately the sum of the
category file sizes plus a fixed small overhead. The input document is
consumed while being read, so its size doesn’t matter, but very long
lines can take up space. When using the -m switch, the categories are
read using mmap(2) as available.
When learning, dbacl keeps a large structure in memory which contains
many objects which won’t be saved into the output category. The size of
this structure is proportional to the number of unique tokens read, but
not the size of the input documents, since they are discarded while
being read. As a rough guide, this structure is 4x-5x the size of the
final category file that is produced.
To prevent unchecked memory growth, dbacl allocates by default a fixed
smallish amount of memory for tokens. When this space is used up,
further tokens are discarded which has the effect of skewing the
learned category making it less usable as more tokens are dropped. A
warning is printed on STDERR in such a case.
The -h switch lets you fix the initial size of the token space in
powers of 2, ie "-h 17" means 2^17 = 131072 possible tokens. If you
type "dbacl -V", you can see the number of bytes needed for each token
when either learning or classifying. Multiply this number by the
maximum number of possible tokens to estimate the memory needed for
learning. The -H switch lets dbacl grow its tables automatically if and
when needed, up to a maximum specified. So if you type "-H 21", then
the initial size will be doubled repeatedly if necessary, up to
approximately two million unique tokens.
When learning with the -X switch, a handful of input documents are also
kept in RAM throughout.
ENVIRONMENT
DBACL_PATH
When this variable is set, its value is prepended to every
category filename which doesn’t start with a ’/’ or a ’.’.
SIGNALS
INT If this signal is caught, dbacl simply exits without doing any
cleanup or other operations. This signal can often be sent by
pressing Ctrl-C on the keyboard. See stty(1).
HUP, QUIT, TERM
If one of these signals is caught, dbacl stops reading input and
continues its operation as if no more input was available. This
is a way of quitting gracefully, but note that in learning mode,
a category file will be written based on the incomplete input.
The QUIT signal can often be sent by pressing Ctrl- on the
keyboard. See stty(1).
USR1 If this signal is caught, dbacl reloads the current categories
at the earliest feasible opportunity. This is not normally
useful at all, but might be in special cases, such as if the -f
switch is invoked together with input from a long running pipe.
NOTES
dbacl generated category files are in binary format, and may or may not
be portable to systems using a different byte order architecture (this
depends on how dbacl was compiled). The -V switch prints out whether
categories are portable, or else you can just experiment.
dbacl does not recognize functionally equivalent regular expressions,
and in this case duplicate features will be counted several times.
With every learned category, the command line options that were used
are saved. When classifying, make sure that every relevant category
was learned with the same set of options (regexes are allowed to
differ), otherwise behaviour is undefined. There is no need to repeat
all the switches when classifying.
If you get many digitization warnings, then you are trying to learn too
much data at once, or your model is too complex. dbacl is compiled to
save memory by digitizing final weights, but you can disable
digitization by editing dbacl.h and recompiling.
dbacl offers several built-in tokenizers (see -e switch) with more to
come in future versions, as the author invents them. While the default
tokenizer may evolve, no tokenizer should ever be removed, so that you
can always simulate previous dbacl behaviour subject to bug fixes and
architectural changes.
The confidence estimates obtained through the -X switch are
underestimates, ie are more conservative than they should be.
BUGS
"Ya know, some day scientists are gonna invent something that will
outsmart a rabbit." (Robot Rabbit, 1953)
SOURCE
The source code for the latest version of this program is available at
the following locations:
http://www.lbreyer.com/gpl.html
http://dbacl.sourceforge.net
AUTHOR
Laird A. Breyer <laird@lbreyer.com>
SEE ALSO
awk(1), bayesol(1), crontab(1), hmine(1), hypex(1), less(1),
mailcross(1), mailfoot(1), mailinspect(1), mailtoe(1), procmailex(5),
regex(7), stty(1), sed(1)