NAME
autoclass - automatically discover classes in data
SYNOPSIS
autoclass -search data_file header_file model_file s_param_file
autoclass -report results_file search_file r_params_file
autoclass -predict results_file search_file results_file
DESCRIPTION
AutoClass solves the problem of automatic discovery of classes in data
(sometimes called clustering, or unsupervised learning), as distinct
from the generation of class descriptions from labeled examples (called
supervised learning). It aims to discover the "natural" classes in the
data. AutoClass is applicable to observations of things that can be
described by a set of attributes, without referring to other things.
The data values corresponding to each attribute are limited to be
either numbers or the elements of a fixed set of symbols. With numeric
data, a measurement error must be provided.
AutoClass is looking for the best classification(s) of the data it can
find. A classification is composed of:
1) A set of classes, each of which is described by a set of class
parameters, which specify how the class is distributed along the
various attributes. For example, "height normally distributed
with mean 4.67 ft and standard deviation .32 ft",
2) A set of class weights, describing what percentage of cases are
likely to be in each class.
3) A probabilistic assignment of cases in the data to these
classes. I.e. for each case, the relative probability that it
is a member of each class.
As a strictly Bayesian system (accept no substitutes!), the quality
measure AutoClass uses is the total probability that, had you known
nothing about your data or its domain, you would have found this set of
data generated by this underlying model. This includes the prior
probability that the "world" would have chosen this number of classes,
this set of relative class weights, and this set of parameters for each
class, and the likelihood that such a set of classes would have
generated this set of values for the attributes in the data cases.
These probabilities are typically very small, in the range of e^-30000,
and so are usually expressed in exponential notation.
When run with the -search command, AutoClass searches for a
classification. The required arguments are the paths to the four input
files, which supply the data, the data format, the desired
classification model, and the search parameters, respectively.
By default, AutoClass writes intermediate results in a binary file.
With the -report command, AutoClass generates an ASCII report. The
arguments are the full path names of the .results, .search, and .r-
params files.
When run with the -predict command, AutoClass predicts the class
membership of a "test" data set based on classes found in a "training"
data set (see "PREDICTIONS" below).
INPUT FILES
An AutoClass data set resides in two files. There is a header file
(file type "hd2") that describes the specific data format and attribute
definitions. The actual data values are in a data file (file type
"db2"). We use two files to allow editing of data descriptions without
having to deal with the entire data set. This makes it easy to
experiment with different descriptions of the database without having
to reproduce the data set. Internally, an AutoClass database structure
is identified by its header and data files, and the number of data
loaded.
For more detailed information on the formats of these files, see
/usr/share/doc/autoclass/preparation-c.text.
DATA FILE
The data file contains a sequence of data objects (datum or case)
terminated by the end of the file. The number of values for each data
object must be equal to the number of attributes defined in the header
file. Data objects must be groups of tokens delimited by "new-line".
Attributes are typed as REAL, DISCRETE, or DUMMY. Real attribute
values are numbers, either integer or floating point. Discrete
attribute values can be strings, symbols, or integers. A dummy
attribute value can be any of these types. Dummys are read in but
otherwise ignored -- they will be set to zeros in the the internal
database. Thus the actual values will not be available for use in
report output. To have these attribute values available, use either
type REAL or type DISCRETE, and define their model type as IGNORE in
the .model file. Missing values for any attribute type may be
represented by either "?", or other token specified in the header file.
All are translated to a special unique value after being read, so this
symbol is effectively reserved for unknown/missing values.
For example:
white 38.991306 0.54248405 2 2 1
red 25.254923 0.5010235 9 2 1
yellow 32.407973 ? 8 2 1
all_white 28.953982 0.5267696 0 1 1
HEADER FILE
The header file specifies the data file format, and the definitions of
the data attributes. The header file functional specifications
consists of two parts -- the data set format definition specifications,
and the attribute descriptors. ";" in column 1 identifies a comment.
A header file follows this general format:
;; num_db2_format_defs value (number of format def lines
;; that follow), range of n is 1 -> 5
num_db2_format_defs n
;; number_of_attributes token and value required
number_of_attributes <as required>
;; following are optional - default values are specified
separator_char ’ ’
comment_char ’;’
unknown_token ’?’
separator_char ’,’
;; attribute descriptors
;; <zero-based att#> <att_type> <att_sub_type> <att_description>
;; <att_param_pairs>
Each attribute descriptor is a line of:
Attribute index (zero based, beginning in column 1)
Attribute type. See below.
Attribute subtype. See below
Attribute description: symbol (no embedded blanks) or
string; <= 40 characters
Specific property and value pairs.
Currently available combinations:
type subtype property type(s)
---- -------- ---------------
dummy none/nil --
discrete nominal range
real location error
real scalar zero_point rel_error
The ERROR property should represent your best estimate of the average
error expected in the measurement and recording of that real attribute.
Lacking better information, the error can be taken as 1/2 the minimum
possible difference between measured values. It can be argued that
real values are often truncated, so that smaller errors may be
justified, particularly for generated data. But AutoClass only sees
the recorded values. So it needs the error in the recorded values,
rather than the actual measurement error. Setting this error much
smaller than the minimum expressible difference implies the possibility
of values that cannot be expressed in the data. Worse, it implies that
two identical values must represent measurements that were much closer
than they might actually have been. This leads to over-fitting of the
classification.
The REL_ERROR property is used for SCALAR reals when the error is
proportional to the measured value. The ERROR property is not
supported.
AutoClass uses the error as a lower bound on the width of the normal
distribution. So small error estimates tend to give narrower peaks and
to increase both the number of classes and the classification
probability. Broad error estimates tend to limit the number of
classes.
The scalar ZERO_POINT property is the smallest value that the
measurement process could have produced. This is often 0.0, or less by
some error range. Similarly, the bounded real’s min and max properties
are exclusive bounds on the attributes generating process. For a
calculated percentage these would be 0-e and 100+e, where e is an error
value. The discrete attribute’s range is the number of possible values
the attribute can take on. This range must include unknown as a value
when such values occur.
Header File Example:
!#; AutoClass C header file -- extension .hd2
!#; the following chars in column 1 make the line a comment:
!#; ’!’, ’#’, ’;’, ’ ’, and ’\n’ (empty line)
;#! num_db2_format_defs <num of def lines -- min 1, max 4>
num_db2_format_defs 2
;; required
number_of_attributes 7
;; optional - default values are specified
;; separator_char ’ ’
;; comment_char ’;’
;; unknown_token ’?’
separator_char ’,’
;; <zero-based att#> <att_type> <att_sub_type> <att_description>
<att_param_pairs>
0 dummy nil "True class, range = 1 - 3"
1 real location "X location, m. in range of 25.0 - 40.0" error .25
2 real location "Y location, m. in range of 0.5 - 0.7" error .05
3 real scalar "Weight, kg. in range of 5.0 - 10.0" zero_point 0.0
rel_error .001
4 discrete nominal "Truth value, range = 1 - 2" range 2
5 discrete nominal "Color of foobar, 10 values" range 10
6 discrete nominal Spectral_color_group range 6
MODEL FILE
A classification of a data set is made with respect to a model which
specifies the form of the probability distribution function for classes
in that data set. Normally the model structure is defined in a model
file (file type "model"), containing one or more models. Internally, a
model is defined relative to a particular database. Thus it is
identified by the corresponding database, the model’s model file and
its sequential position in the file.
Each model is specified by one or more model group definition lines.
Each model group line associates attribute indices with a model term
type.
Here is an example model file:
# AutoClass C model file -- extension .model
model_index 0 7
ignore 0
single_normal_cn 3
single_normal_cn 17 18 21
multi_normal_cn 1 2
multi_normal_cn 8 9 10
multi_normal_cn 11 12 13
single_multinomial default
Here, the first line is a comment. The following characters in column
1 make the line a comment: ‘!’, ‘#’, ‘ ’, ‘;’, and ‘\n’ (empty line).
The tokens "model_index n m" must appear on the first non-comment line,
and precede the model term definition lines. n is the zero-based model
index, typically 0 where there is only one model -- the majority of
search situations. m is the number of model term definition lines that
follow.
The last seven lines are model group lines. Each model group line
consists of:
A model term type (one of single_multinomial, single_normal_cm,
single_normal_cn, multi_normal_cn, or ignore).
A list of attribute indices (the attribute set list), or the symbol
default. Attribute indices are zero-based. Single model terms may
have one or more attribute indices on each line, while multi model
terms require two or more attribute indices per line. An attribute
index must not appear more than once in a model list.
Notes:
1) At least one model definition is required (model_index token).
2) There may be multiple entries in a model for any model term
type.
3) Model term types currently consist of:
single_multinomial
models discrete attributes as multinomials, with missing
values.
single_normal_cn
models real valued attributes as normals; no missing
values.
single_normal_cm
models real valued attributes with missing values.
multi_normal_cn
is a covariant normal model without missing values.
ignore allows the model to ignore one or more attributes.
ignore is not a valid default model term type.
See the documentation in models-c.text for further information
about specific model terms.
4) Single_normal_cn, single_normal_cm, and multi_normal_cn modeled
data, whose subtype is scalar (value distribution is away from
0.0, and is thus not a "normal" distribution) will be log
transformed and modeled with the log-normal model. For data
whose subtype is location (value distribution is around 0.0), no
transform is done, and the normal model is used.
SEARCHING
AutoClass, when invoked in the "search" mode will check the validity of
the set of data, header, model, and search parameter files. Errors
will stop the search from starting, and warnings will ask the user
whether to continue. A history of the error and warning messages is
saved, by default, in the log file.
Once you have succeeded in describing your data with a header file and
model file that passes the AUTOCLASS -SEARCH <...> input checks, you
will have entered the search domain where AutoClass classifies your
data. (At last!)
The main function to use in finding a good classification of your data
is AUTOCLASS -SEARCH, and using it will take most of the computation
time. Searches are invoked with:
autoclass -search <.db2 file path> <.hd2 file path>
<.model file path> <.s-params file path>
All files must be specified as fully qualified relative or absolute
pathnames. File name extensions (file types) for all files are forced
to canonical values required by the AutoClass program:
data file ("ascii") db2
data file ("binary") db2-bin
header file hd2
model file model
search params file s-params
The sample-run (/usr/share/doc/autoclass/examples/) that comes with
AutoClass shows some sample searches, and browsing these is probably
the fastest way to get familiar with how to do searches. The test data
sets located under /usr/share/doc/autoclass/examples/ will show you
some other header (.hd2), model (.model), and search params (.s-params)
file setups. The remainder of this section describes how to do
searches in somewhat more detail.
The bold faced tokens below are generally search params file
parameters. For more information on the s-params file, see SEARCH
PARAMETERS below, or /usr/share/doc/autoclass/search-c.text.gz.
WHAT RESULTS ARE
AutoClass is looking for the best classification(s) of the data it can
find. A classification is composed of:
1) a set of classes, each of which is described by a set of class
parameters, which specify how the class is distributed along the
various attributes. For example, "height normally distributed
with mean 4.67 ft and standard deviation .32 ft",
2) a set of class weights, describing what percentage of cases are
likely to be in each class.
3) a probabilistic assignment of cases in the data to these
classes. I.e. for each case, the relative probability that it
is a member of each class.
As a strictly Bayesian system (accept no substitutes!), the quality
measure AutoClass uses is the total probability that, had you known
nothing about your data or its domain, you would have found this set of
data generated by this underlying model. This includes the prior
probability that the "world" would have chosen this number of classes,
this set of relative class weights, and this set of parameters for each
class, and the likelihood that such a set of classes would have
generated this set of values for the attributes in the data cases.
These probabilities are typically very small, in the range of e^-30000,
and so are usually expressed in exponential notation.
WHAT RESULTS MEAN
It is important to remember that all of these probabilities are GIVEN
that the real model is in the model family that AutoClass has
restricted its attention to. If AutoClass is looking for Gaussian
classes and the real classes are Poisson, then the fact that AutoClass
found 5 Gaussian classes may not say much about how many Poisson
classes there really are.
The relative probability between different classifications found can be
very large, like e^1000, so the very best classification found is
usually overwhelmingly more probable than the rest (and overwhelmingly
less probable than any better classifications as yet undiscovered). If
AutoClass should manage to find two classifications that are within
about exp(5-10) of each other (i.e. within 100 to 10,000 times more
probable) then you should consider them to be about equally probable,
as our computation is usually not more accurate than this (and
sometimes much less).
HOW IT WORKS
AutoClass repeatedly creates a random classification and then tries to
massage this into a high probability classification though local
changes, until it converges to some "local maximum". It then remembers
what it found and starts over again, continuing until you tell it to
stop. Each effort is called a "try", and the computed probability is
intended to cover the whole volume in parameter space around this
maximum, rather than just the peak.
The standard approach to massaging is to
1) Compute the probabilistic class memberships of cases using the
class parameters and the implied relative likelihoods.
2) Using the new class members, compute class statistics (like
mean) and revise the class parameters.
and repeat till they stop changing. There are three available
convergence algorithms: "converge_search_3" (the default),
"converge_search_4" and "converge". Their specification is controlled
by search params file parameter try_fn_type.
WHEN TO STOP
You can tell AUTOCLASS -SEARCH to stop by: 1) giving a max_duration (in
seconds) argument at the beginning; 2) giving a max_n_tries (an
integer) argument at the beginning; or 3) by typing a "q" and <return>
after you have seen enough tries. The max_duration and max_n_tries
arguments are useful if you desire to run AUTOCLASS -SEARCH in batch
mode. If you are restarting AUTOCLASS -SEARCH from a previous search,
the value of max_n_tries you provide, for instance 3, will tell the
program to compute 3 more tries in addition to however many it has
already done. The same incremental behavior is exhibited by
max_duration.
Deciding when to stop is a judgment call and it’s up to you. Since the
search includes a random component, there’s always the chance that if
you let it keep going it will find something better. So you need to
trade off how much better it might be with how long it might take to
find it. The search status reports that are printed when a new best
classification is found are intended to provide you information to help
you make this tradeoff.
One clear sign that you should probably stop is if most of the
classifications found are duplicates of previous ones (flagged by "dup"
as they are found). This should only happen for very small sets of
data or when fixing a very small number of classes, like two.
Our experience is that for moderately large to extremely large data
sets (~200 to ~10,000 datum), it is necessary to run AutoClass for at
least 50 trials.
WHAT GETS RETURNED
Just before returning, AUTOCLASS -SEARCH will give short descriptions
of the best classifications found. How many will be described can be
controlled with n_final_summary.
By default AUTOCLASS -SEARCH will write out a number of files, both at
the end and periodically during the search (in case your system crashes
before it finishes). These files will all have the same name (taken
from the search params pathname [<name>.s-params]), and differ only in
their file extensions. If your search runs are very long and there is
a possibility that your machine may crash, you can have intermediate
"results" files written out. These can be used to restart your search
run with minimum loss of search effort. See the documentation file
/usr/share/doc/autoclass/checkpoint-c.text.
A ".log" file will hold a listing of most of what was printed to the
screen during the run, unless you set log_file_p to false to say you
want no such foolishness. Unless results_file_p is false, a binary
".results-bin" file (the default) or an ASCII ".results" text file,
will hold the best classifications that were returned, and unless
search_file_p is false, a ".search" file will hold the record of the
search tries. save_compact_p controls whether the "results" files are
saved as binary or ASCII text.
If the C global variable "G_safe_file_writing_p" is defined as TRUE in
"autoclass-c/prog/globals.c", the names of "results" files (those that
contain the saved classifications) are modified internally to account
for redundant file writing. If the search params file name is
"my_saved_clsfs" you will see the following "results" file names
(ignoring directories and pathnames for this example)
save_compact_p = true --
"my_saved_clsfs.results-bin" - completely written file
"my_saved_clsfs.results-tmp-bin" - partially written file, renamed
when complete
save_compact_p = false --
"my_saved_clsfs.results" - completely written file
"my_saved_clsfs.results-tmp" - partially written file, renamed
when complete
If check pointing is being done, these additional names will appear
save_compact_p = true --
"my_saved_clsfs.chkpt-bin" - completely written checkpoint file
"my_saved_clsfs.chkpt-tmp-bin" - partially written checkpoint file,
renamed when complete
save_compact_p = false --
"my_saved_clsfs.chkpt" - completely written checkpoint file
"my_saved_clsfs.chkpt-tmp" - partially written checkpoint file,
renamed when complete
HOW TO GET STARTED
The way to invoke AUTOCLASS -SEARCH is:
autoclass -search <.db2 file path> <.hd2 file path>
<.model file path> <.s-params file path>
To restart a previous search, specify that force_new_search_p has the
value false in the search params file, since its default is true.
Specifying false tells AUTOCLASS -SEARCH to try to find a previous
compatible search (<...>.results[-bin] & <...>.search) to continue
from, and will restart using it if found. To force a new search
instead of restarting an old one, give the parameter force_new_search_p
the value of true, or use the default. If there is an existing search
(<...>.results[-bin] & <...>.search), the user will be asked to confirm
continuation since continuation will discard the existing search.
If a previous search is continued, the message "RESTARTING SEARCH" will
be given instead of the usual "BEGINNING SEARCH". It is generally
better to continue a previous search than to start a new one, unless
you are trying a significantly different search method, in which case
statistics from the previous search may mislead the current one.
STATUS REPORTS
A running commentary on the search will be printed to the screen and to
the log file (unless log_file_p is false). Note that the ".log" file
will contain a listing of all default search params values, and the
values of all params that are overridden.
After each try a very short report (only a few characters long) is
given. After each new best classification, a longer report is given,
but no more often than min_report_period (default is 30 seconds).
SEARCH VARIATIONS
AUTOCLASS -SEARCH by default uses a certain standard search method or
"try function" (try_fn_type = "converge_search_3"). Two others are
also available: "converge_search_4" and "converge"). They are provided
in case your problem is one that may happen to benefit from them. In
general the default method will result in finding better
classifications at the expense of a longer search time. The default
was chosen so as to be robust, giving even performance across many
problems. The alternatives to the default may do better on some
problems, but may do substantially worse on others.
"converge_search_3" uses an absolute stopping criterion
(rel_delta_range, default value of 0.0025) which tests the variation of
each class of the delta of the log approximate-marginal-likelihood of
the class statistics with-respect-to the class hypothesis
(class->log_a_w_s_h_j) divided by the class weight (class->w_j) between
successive convergence cycles. Increasing this value loosens the
convergence and reduces the number of cycles. Decreasing this value
tightens the convergence and increases the number of cycles. n_average
(default value of 3) specifies how many successive cycles must meet the
stopping criterion before the trial terminates.
"converge_search_4" uses an absolute stopping criterion
(cs4_delta_range, default value of 0.0025) which tests the variation of
each class of the slope for each class of log approximate-marginal-
likelihood of the class statistics with-respect-to the class hypothesis
(class->log_a_w_s_h_j) divided by the class weight (class->w_j) over
sigma_beta_n_values (default value 6) convergence cycles. Increasing
the value of cs4_delta_range loosens the convergence and reduces the
number of cycles. Decreasing this value tightens the convergence and
increases the number of cycles. Computationally, this try function is
more expensive than "converge_search_3", but may prove useful if the
computational "noise" is significant compared to the variations in the
computed values. Key calculations are done in double precision
floating point, and for the largest data base we have tested so far (
5,420 cases of 93 attributes), computational noise has not been a
problem, although the value of max_cycles needed to be increased to
400.
"converge" uses one of two absolute stopping criterion which test the
variation of the classification (clsf) log_marginal (clsf->log_a_x_h)
delta between successive convergence cycles. The largest of halt_range
(default value 0.5) and halt_factor * current_clsf_log_marginal) is
used (default value of halt_factor is 0.0001). Increasing these values
loosens the convergence and reduces the number of cycles. Decreasing
these values tightens the convergence and increases the number of
cycles. n_average (default value of 3) specifies how many cycles must
meet the stopping criteria before the trial terminates. This is a very
approximate stopping criterion, but will give you some feel for the
kind of classifications to expect. It would be useful for
"exploratory" searches of a data base.
The purpose of reconverge_type = "chkpt" is to complete an interrupted
classification by continuing from its last checkpoint. The purpose of
reconverge_type = "results" is to attempt further refinement of the
best completed classification using a different value of try_fn_type
("converge_search_3", "converge_search_4", "converge"). If max_n_tries
is greater than 1, then in each case, after the reconvergence has
completed, AutoClass will perform further search trials based on the
parameter values in the <...>.s-params file.
With the use of reconverge_type ( default value ""), you may apply more
than one try function to a classification. Say you generate several
exploratory trials using try_fn_type = "converge", and quit the search
saving .search and .results[-bin] files. Then you can begin another
search with try_fn_type = "converge_search_3", reconverge_type =
"results", and max_n_tries = 1. This will result in the further
convergence of the best classification generated with try_fn_type =
"converge", with try_fn_type = "converge_search_3". When AutoClass
completes this search try, you will have an additional refined
classification.
A good way to verify that any of the alternate try_fun_type are
generating a well converged classification is to run AutoClass in
prediction mode on the same data used for generating the
classification. Then generate and compare the corresponding case or
class cross reference files for the original classification and the
prediction. Small differences between these files are to be expected,
while large differences indicate incomplete convergence. Differences
between such file pairs should, on average and modulo class deletions,
decrease monotonically with further convergence.
The standard way to create a random classification to begin a try is
with the default value of "random" for start_fn_type. At this point
there are no alternatives. Specifying "block" for start_fn_type
produces repeatable non-random searches. That is how the <..>.s-params
files in the autoclass-c/data/.. sub-directories are specified. This
is how development testing is done.
max_cycles controls the maximum number of convergence cycles that will
be performed in any one trial by the convergence functions. Its
default value is 200. The screen output shows a period (".") for each
cycle completed. If your search trials run for 200 cycles, then either
your data base is very complex (increase the value), or the try_fn_type
is not adequate for situation (try another of the available ones, and
use converge_print_p to get more information on what is going on).
Specifying converge_print_p to be true will generate a brief print-out
for each cycle which will provide information so that you can modify
the default values of rel_delta_range & n_average for
"converge_search_3"; cs4_delta_range & sigma_beta_n_values for
"converge_search_4"; and halt_range, halt_factor, and n_average for
"converge". Their default values are given in the <..>.s-params files
in the autoclass-c/data/.. sub-directories.
HOW MANY CLASSES?
Each new try begins with a certain number of classes and may end up
with a smaller number, as some classes may drop out of the convergence.
In general, you want to begin the try with some number of classes that
previous tries have indicated look promising, and you want to be sure
you are fishing around elsewhere in case you missed something before.
n_classes_fn_type = "random_ln_normal" is the default way to make this
choice. It fits a log normal to the number of classes (usually called
"j" for short) of the 10 best classifications found so far, and
randomly selects from that. There is currently no alternative.
To start the game off, the default is to go down start_j_list for the
first few tries, and then switch to n_classes_fn_type. If you believe
that the probable number of classes in your data base is say 75, then
instead of using the default value of start_j_list (2, 3, 5, 7, 10, 15,
25), specify something like 50, 60, 70, 80, 90, 100.
If one wants to always look for, say, three classes, one can use
fixed_j and override the above. Search status reports will describe
what the current method for choosing j is.
DO I HAVE ENOUGH MEMORY AND DISK SPACE?
Internally, the storage requirements in the current system are of order
n_classes_per_clsf * (n_data + n_stored_clsfs * n_attributes *
n_attribute_values). This depends on the number of cases, the number
of attributes, the values per attribute (use 2 if a real value), and
the number of classifications stored away for comparison to see if
others are duplicates -- controlled by max_n_store (default value =
10). The search process does not itself consume significant memory,
but storage of the results may do so.
AutoClass C is configured to handle a maximum of 999 attributes. If
you attempt to run with more than that you will get array bound
violations. In that case, change these configuration parameters in
prog/autoclass.h and recompile AutoClass C:
#define ALL_ATTRIBUTES 999
#define VERY_LONG_STRING_LENGTH 20000
#define VERY_LONG_TOKEN_LENGTH 500
For example, these values will handle several thousand attributes:
#define ALL_ATTRIBUTES 9999
#define VERY_LONG_STRING_LENGTH 50000
#define VERY_LONG_TOKEN_LENGTH 50000
Disk space taken up by the "log" file will of course depend on the
duration of the search. n_save (default value = 2) determines how many
best classifications are saved into the ".results[-bin]" file.
save_compact_p controls whether the "results" and "checkpoint" files
are saved as binary. Binary files are faster and more compact, but are
not portable. The default value of save_compact_p is true, which
causes binary files to be written.
If the time taken to save the "results" files is a problem, consider
increasing min_save_period (default value = 1800 seconds or 30
minutes). Files are saved to disk this often if there is anything
different to report.
JUST HOW SLOW IS IT?
Compute time is of order n_data * n_attributes * n_classes * n_tries *
converge_cycles_per_try. The major uncertainties in this are the number
of basic back and forth cycles till convergence in each try, and of
course the number of tries. The number of cycles per trial is
typically 10-100 for try_fn_type "converge", and 10-200+ for
"converge_search_3" and "converge_search-4". The maximum number is
specified by max_n_tries (default value = 200). The number of trials
is up to you and your available computing resources.
The running time of very large data sets will be quite uncertain. We
advise that a few small scale test runs be made on your system to
determine a baseline. Specify n_data to limit how many data vectors
are read. Given a very large quantity of data, AutoClass may find its
most probable classifications at upwards of a hundred classes, and this
will require that start_j_list be specified appropriately (See above
section HOW MANY CLASSES?). If you are quite certain that you only
want a few classes, you can force AutoClass to search with a fixed
number of classes specified by fixed_j. You will then need to run
separate searches with each different fixed number of classes.
CHANGING FILENAMES IN A SAVED CLASSIFICATION FILE
AutoClass caches the data, header, and model file pathnames in the
saved classification structure of the binary (".results-bin") or ASCII
(".results") "results" files. If the "results" and "search" files are
moved to a different directory location, the search cannot be
successfully restarted if you have used absolute pathnames. Thus it is
advantageous to run invoke AutoClass in a parent directory of the data,
header, and model files, so that relative pathnames can be used. Since
the pathnames cached will then be relative, the files can be moved to a
different host or file system and restarted -- providing the same
relative pathname hierarchy exists.
However, since the ".results" file is ASCII text, those pathnames could
be changed with a text editor (save_compact_p must be specified as
false).
SEARCH PARAMETERS
The search is controlled by the ".s-params" file. In this file, an
empty line or a line starting with one of these characters is treated
as a comment: "#", "!", or ";". The parameter name and its value can
be separated by an equal sign, a space, or a tab:
n_clsfs 1
n_clsfs = 1
n_clsfs<tab>1
Spaces are ignored if "=" or "<tab>" are used as separators. Note
there are no trailing semicolons.
The search parameters, with their default values, are as follows:
rel_error = 0.01
Specifies the relative difference measure used by clsf-DS-%=,
when deciding if a new clsf is a duplicate of an old one.
start_j_list = 2, 3, 5, 7, 10, 15, 25
Initially try these numbers of classes, so as not to narrow the
search too quickly. The state of this list is saved in the
<..>.search file and used on restarts, unless an override
specification of start_j_list is made in the .s-params file for
the restart run. This list should bracket your expected number
of classes, and by a wide margin! "start_j_list = -999"
specifies an empty list (allowed only on restarts)
n_classes_fn_type = "random_ln_normal"
Once start_j_list is exhausted, AutoClass will call this
function to decide how many classes to start with on the next
try, based on the 10 best classifications found so far.
Currently only "random_ln_normal" is available.
fixed_j = 0
When fixed_j > 0, overrides start_j_list and n_classes_fn_type,
and AutoClass will always use this value for the initial number
of classes.
min_report_period = 30
Wait at least this time (in seconds) since last report until
reporting verbosely again. Should be set longer than the
expected run time when checking for repeatability of results.
For repeatable results, also see force_new_search_p,
start_fn_type and randomize_random_p. NOTE: At least one of
"interactive_p", "max_duration", and "max_n_tries" must be
active. Otherwise AutoClass will run indefinitely. See below.
interactive_p = true
When false, allows run to continue until otherwise halted. When
true, standard input is queried on each cycle for the quit
character "q", which, when detected, triggers an immediate halt.
max_duration = 0
When = 0, allows run to continue until otherwise halted. When >
0, specifies the maximum number of seconds to run.
max_n_tries = 0
When = 0, allows run to continue until otherwise halted. When >
0, specifies the maximum number of tries to make.
n_save = 2
Save this many clsfs to disk in the .results[-bin] and .search
files. if 0, don’t save anything (no .search & .results[-bin]
files).
log_file_p = true
If false, do not write a log file.
search_file_p = true
If false, do not write a search file.
results_file_p = true
If false, do not write a results file.
min_save_period = 1800
CPU crash protection. This specifies the maximum time, in
seconds, that AutoClass will run before it saves the current
results to disk. The default time is 30 minutes.
max_n_store = 10
Specifies the maximum number of classifications stored
internally.
n_final_summary = 10
Specifies the number of trials to be printed out after search
ends.
start_fn_type = "random"
One of {"random", "block"}. This specifies the type of class
initialization. For normal search, use "random", which randomly
selects instances to be initial class means, and adds
appropriate variances. For testing with repeatable search, use
"block", which partitions the database into successive blocks of
near equal size. For repeatable results, also see
force_new_search_p, min_report_period, and randomize_random_p.
try_fn_type = "converge_search_3"
One of {"converge_search_3", "converge_search_4", "converge"}.
These specify alternate search stopping criteria. "converge"
merely tests the rate of change of the log_marginal
classification probability (clsf->log_a_x_h), without checking
rate of change of individual classes(see halt_range and
halt_factor). "converge_search_3" and "converge_search_4" each
monitor the ratio class->log_a_w_s_h_j/class->w_j for all
classes, and continue convergence until all pass the quiescence
criteria for n_average cycles. "converge_search_3" tests
differences between successive convergence cycles (see
rel_delta_range). This provides a reasonable, general purpose
stopping criteria. "converge_search_4" averages the ratio over
"sigma_beta_n_values" cycles (see cs4_delta_range). This is
preferred when converge_search_3 produces many similar classes.
initial_cycles_p = true
If true, perform base_cycle in initialize_parameters. false is
used only for testing.
save_compact_p = true
true saves classifications as machine dependent binary
(.results-bin & .chkpt-bin). false saves as ascii text
(.results & .chkpt)
read_compact_p = true
true reads classifications as machine dependent binary
(.results-bin & .chkpt-bin). false reads as ascii text
(.results & .chkpt).
randomize_random_p = true
false seeds lrand48, the pseudo-random number function with 1 to
give repeatable test cases. true uses universal time clock as
the seed, giving semi-random searches. For repeatable results,
also see force_new_search_p, min_report_period and
start_fn_type.
n_data = 0
With n_data = 0, the entire database is read from .db2. With
n_data > 0, only this number of data are read.
halt_range = 0.5
Passed to try_fn_type "converge". With the "converge"
try_fn_type, convergence is halted when the larger of halt_range
and (halt_factor * current_log_marginal) exceeds the difference
between successive cycle values of the classification
log_marginal (clsf->log_a_x_h). Decreasing this value may
tighten the convergence and increase the number of cycles.
halt_factor = 0.0001
Passed to try_fn_type "converge". With the "converge"
try_fn_type, convergence is halted when the larger of halt_range
and (halt_factor * current_log_marginal) exceeds the difference
between successive cycle values of the classification
log_marginal (clsf->log_a_x_h). Decreasing this value may
tighten the convergence and increase the number of cycles.
rel_delta_range = 0.0025
Passed to try function "converge_search_3", which monitors the
ratio of log approx-marginal-likelihood of class statistics
with-respect-to the class hypothesis (class->log_a_w_s_h_j)
divided by the class weight (class->w_j), for each class.
"converge_search_3" halts convergence when the difference
between cycles, of this ratio, for every class, has been
exceeded by "rel_delta_range" for "n_average" cycles.
Decreasing "rel_delta_range" tightens the convergence and
increases the number of cycles.
cs4_delta_range = 0.0025
Passed to try function "converge_search_4", which monitors the
ratio of (class->log_a_w_s_h_j)/(class->w_j), for each class,
averaged over "sigma_beta_n_values" convergence cycles.
"converge_search_4" halts convergence when the maximum
difference in average values of this ratio falls below
"cs4_delta_range". Decreasing "cs4_delta_range" tightens the
convergence and increases the number of cycles.
n_average = 3
Passed to try functions "converge_search_3" and "converge". The
number of cycles for which the convergence criterion must be
satisfied for the trial to terminate.
sigma_beta_n_values = 6
Passed to try_fn_type "converge_search_4". The number of past
values to use in computing sigma^2 (noise) and beta^2 (signal).
max_cycles = 200
This is the maximum number of cycles permitted for any one
convergence of a classification, regardless of any other
stopping criteria. This is very dependent upon your database
and choice of model and convergence parameters, but should be
about twice the average number of cycles reported in the screen
dump and .log file
converge_print_p = false
If true, the selected try function will print to the screen
values useful in specifying non-default values for halt_range,
halt_factor, rel_delta_range, n_average, sigma_beta_n_values,
and range_factor.
force_new_search_p = true
If true, will ignore any previous search results, discarding the
existing .search and .results[-bin] files after confirmation by
the user; if false, will continue the search using the existing
.search and .results[-bin] files. For repeatable results, also
see min_report_period, start_fn_type and randomize_random_p.
checkpoint_p = false
If true, checkpoints of the current classification will be
written every "min_checkpoint_period" seconds, with file
extension .chkpt[-bin]. This is only useful for very large
classifications
min_checkpoint_period = 10800
If checkpoint_p = true, the checkpointed classification will be
written this often - in seconds (default = 3 hours)
reconverge_type = "
Can be either "chkpt" or "results". If "checkpoint_p" = true
and "reconverge_type" = "chkpt", then continue convergence of
the classification contained in <...>.chkpt[-bin]. If
"checkpoint_p " = false and "reconverge_type" = "results",
continue convergence of the best classification contained in
<...>.results[-bin].
screen_output_p = true
If false, no output is directed to the screen. Assuming
log_file_p = true, output will be directed to the log file only.
break_on_warnings_p = true
The default value asks the user whether or not to continue, when
data definition warnings are found. If specified as false, then
AutoClass will continue, despite warnings -- the warning will
continue to be output to the terminal and the log file.
free_storage_p = true
The default value tells AutoClass to free the majority of its
allocated storage. This is not required, and in the case of the
DEC Alpha causes core dump [is this still true?]. If specified
as false, AutoClass will not attempt to free storage.
HOW TO GET AUTOCLASS C TO PRODUCE REPEATABLE RESULTS
In some situations, repeatable classifications are required: comparing
basic AutoClass C integrity on different platforms, porting AutoClass C
to a new platform, etc. In order to accomplish this two things are
necessary: 1) the same random number generator must be used, and 2) the
search parameters must be specified properly.
Random Number Generator. This implementation of AutoClass C uses the
Unix srand48/lrand48 random number generator which generates pseudo-
random numbers using the well-known linear congruential algorithm and
48-bit integer arithmetic. lrand48() returns non- negative long
integers uniformly distributed over the interval [0, 2**31].
Search Parameters. The following .s-params file parameters should be
specified:
force_new_search_p = true
start_fn_type "block"
randomize_random_p = false
;; specify the number of trials you wish to run
max_n_tries = 50
;; specify a time greater than duration of run
min_report_period = 30000
Note that no current best classification reports will be produced.
Only a final classification summary will be output.
CHECKPOINTING
With very large databases there is a significant probability of a
system crash during any one classification try. Under such
circumstances it is advisable to take the time to checkpoint the
calculations for possible restart.
Checkpointing is initiated by specifying "checkpoint_p = true" in the
".s-params" file. This causes the inner convergence step, to save a
copy of the classification onto the checkpoint file each time the
classification is updated, providing a certain period of time has
elapsed. The file extension is ".chkpt[-bin]".
Each time a AutoClass completes a cycle, a "." is output to the screen
to provide you with information to be used in setting the
min_checkpoint_period value (default 10800 seconds or 3 hours). There
is obviously a trade-off between frequency of checkpointing and the
probability that your machine may crash, since the repetitive writing
of the checkpoint file will slow the search process.
Restarting AutoClass Search:
To recover the classification and continue the search after rebooting
and reloading AutoClass, specify reconverge_type = "chkpt" in the ".s-
params" file (specify force_new_search_p as false).
AutoClass will reload the appropriate database and models, provided
there has been no change in their filenames since the time they were
loaded for the checkpointed classification run. The ".s-params" file
contains any non-default arguments that were provided to the original
call.
In the beginning of a search, before start_j_list has been emptied, it
will be necessary to trim the original list to what would have remained
in the crashed search. This can be determined by looking at the ".log"
file to determine what values were already used. If the start_j_list
has been emptied, then an empty start_j_list should be specified in the
".s-params" file. This is done either by
start_j_list =
or
start_j_list = -9999
Here is an a set of scripts to demonstrate check-pointing:
autoclass -search data/glass/glassc.db2 data/glass/glass-3c.hd2 \
data/glass/glass-mnc.model data/glass/glassc-chkpt.s-params
Run 1)
## glassc-chkpt.s-params
max_n_tries = 2
force_new_search_p = true
## --------------------
;; run to completion
Run 2)
## glassc-chkpt.s-params
force_new_search_p = false
max_n_tries = 10
checkpoint_p = true
min_checkpoint_period = 2
## --------------------
;; after 1 checkpoint, ctrl-C to simulate cpu crash
Run 3)
## glassc-chkpt.s-params
force_new_search_p = false
max_n_tries = 1
checkpoint_p = true
min_checkpoint_period = 1
reconverge_type = "chkpt"
## --------------------
;; checkpointed trial should finish
OUTPUT FILES
The standard reports are
1) Attribute influence values: presents the relative influence or
significance of the data’s attributes both globally (averaged
over all classes), and locally (specifically for each class). A
heuristic for relative class strength is also listed;
2) Cross-reference by case (datum) number: lists the primary class
probability for each datum, ordered by case number. When
report_mode = "data", additional lesser class probabilities
(greater than or equal to 0.001) are listed for each datum;
3) Cross-reference by class number: for each class the primary
class probability and any lesser class probabilities (greater
than or equal to 0.001) are listed for each datum in the class,
ordered by case number. It is also possible to list, for each
datum, the values of attributes, which you select.
The attribute influence values report attempts to provide relative
measures of the "influence" of the data attributes on the classes found
by the classification. The normalized class strengths, the normalized
attribute influence values summed over all classes, and the individual
influence values (I[jkl]) are all only relative measures and should be
interpreted with more meaning than rank ordering, but not like anything
approaching absolute values.
The reports are output to files whose names and pathnames are taken
from the ".r-params" file pathname. The report file types (extensions)
are:
influence values report
"influ-o-text-n" or "influ-no-text-n"
cross-reference by case
"case-text-n"
cross-reference by class
"class-text-n"
or, if report_mode is overridden to "data":
influence values report
"influ-o-data-n" or "influ-no-data-n"
cross-reference by case
"case-data-n"
cross-reference by class
"class-data-n"
where n is the classification number from the "results" file. The
first or best classification is numbered 1, the next best 2, etc. The
default is to generate reports only for the best classification in the
"results" file. You can produce reports for other saved
classifications by using report params keywords n_clsfs and
clsf_n_list. The "influ-o-text-n" file type is the default
(order_attributes_by_influence_p = true), and lists each class’s
attributes in descending order of attribute influence value. If the
value of order_attributes_by_influence_p is overridden to be false in
the <...>.r-params file, then each class’s attributes will be listed in
ascending order by attribute number. The extension of the file
generated will be "influ-no-text-n". This method of listing
facilitates the visual comparison of attribute values between classes.
For example, this command:
autoclass -reports sample/imports-85c.results-bin
sample/imports-85c.search sample/imports-85c.r-params
with this line in the ".r-params" file:
xref_class_report_att_list = 2, 5, 6
will generate these output files:
imports-85.influ-o-text-1
imports-85.case-text-1
imports-85.class-text-1
The AutoClass C reports provide the capability to compute sigma class
contour values for specified pairs of real valued attributes, when
generating the influence values report with the data option
(report_mode = "data"). Note that sigma class contours are not
generated from discrete type attributes.
The sigma contours are the two dimensional equivalent of n-sigma error
bars in one dimension. Specifically, for two independent attributes
the n-sigma contour is defined as the ellipse where
((x - xMean) / xSigma)^2 + ((y - yMean) / ySigma)^2 == n
With covariant attributes, the n-sigma contours are defined
identically, in the rotated coordinate system of the distribution’s
principle axes. Thus independent attributes give ellipses oriented
parallel with the attribute axes, while the axes of sigma contours of
covariant attributes are rotated about the center determined by the
means. In either case the sigma contour represents a line where the
class probability is constant, irrespective of any other class
probabilities.
With three or more attributes the n-sigma contours become k-dimensional
ellipsoidal surfaces. This code takes advantage of the fact that the
parallel projection of an n-dimensional ellipsoid, onto any 2-dim
plane, is bounded by an ellipse. In this simplified case of projecting
the single sigma ellipsoid onto the coordinate planes, it is also true
that the 2-dim covariances of this ellipse are equal to the
corresponding elements of the n-dim ellipsoid’s covariances. The
Eigen-system of the 2-dim covariance then gives the variances w.r.t.
the principal components of the eclipse, and the rotation that aligns
it with the data. This represents the best way to display a
distribution in the marginal plane.
To get contour values, set the keyword sigma_contours_att_list to a
list of real valued attribute indices (from .hd2 file), and request an
influence values report with the data option. For example,
report_mode = "data"
sigma_contours_att_list = 3, 4, 5, 8, 15
OUTPUT REPORT PARAMETERS
The contents of the output report are controlled by the ".r-params"
file. In this file, an empty line or a line starting with one of these
characters is treated as a comment: "#", "!", or ";". The parameter
name and its value can be separated by an equal sign, a space, or a
tab:
n_clsfs 1
n_clsfs = 1
n_clsfs<tab>1
Spaces are ignored if "=" or "<tab>" are used as separators. Note
there are no trailing semicolons.
The following are the allowed parameters and their default values:
n_clsfs = 1
number of clsfs in the .results file for which to generate
reports, starting with the first or "best".
clsf_n_list =
if specified, this is a one-based index list of clsfs in the
clsf sequence read from the .results file. It overrides
"n_clsfs". For example:
clsf_n_list = 1, 2
will produce the same output as
n_clsfs = 2
but
clsf_n_list = 2
will only output the "second best" classification report.
report_type =
type of reports to generate: "all", "influence_values",
"xref_case", or "xref_class".
report_mode =
mode of reports to generate. "text" is formatted text layout.
"data" is numerical -- suitable for further processing.
comment_data_headers_p = false
the default value does not insert # in column 1 of most
report_mode = "data" header lines. If specified as true, the
comment character will be inserted in most header lines.
num_atts_to_list =
if specified, the number of attributes to list in influence
values report. if not specified, all attributes will be listed.
(e.g. "num_atts_to_list = 5")
xref_class_report_att_list =
if specified, a list of attribute numbers (zero-based), whose
values will be output in the "xref_class" report along with the
case probabilities. if not specified, no attributes values will
be output. (e.g. "xref_class_report_att_list = 1, 2, 3")
order_attributes_by_influence_p = true
The default value lists each class’s attributes in descending
order of attribute influence value, and uses ".influ-o-text-n"
as the influence values report file type. If specified as
false, then each class’s attributes will be listed in ascending
order by attribute number. The extension of the file generated
will be "influ-no-text-n".
break_on_warnings_p = true
The default value asks the user whether to continue or not when
data definition warnings are found. If specified as false, then
AutoClass will continue, despite warnings -- the warning will
continue to be output to the terminal.
free_storage_p = true
The default value tells AutoClass to free the majority of its
allocated storage. This is not required, and in the case of the
DEC Alpha causes a core dump [is this still true?]. If
specified as false, AutoClass will not attempt to free storage.
max_num_xref_class_probs = 5
Determines how many lessor class probabilities will be printed
for the case and class cross-reference reports. The default is
to print the most probable class probability value and up to 4
lessor class prob- ibilities. Note this is true for both the
"text" and "data" class cross-reference reports, but only true
for the "data" case cross- reference report. The "text" case
cross-reference report only has the most probable class
probability.
sigma_contours_att_list =
If specified, a list of real valued attribute indices (from .hd2
file) will be to compute sigma class contour values, when
generating influence values report with the data option
(report_mode = "data"). If not specified, there will be no
sigma class contour output. (e.g. "sigma_contours_att_list = 3,
4, 5, 8, 15")
INTERPRETATION OF AUTOCLASS RESULTS
WHAT HAVE YOU GOT?
Now you have run AutoClass on your data set -- what have you got?
Typically, the AutoClass search procedure finds many classifications,
but only saves the few best. These are now available for inspection
and interpretation. The most important indicator of the relative
merits of these alternative classifications is Log total posterior
probability value. Note that since the probability lies between 1 and
0, the corresponding Log probability is negative and ranges from 0 to
negative infinity. The difference between these Log probability values
raised to the power e gives the relative probability of the
alternatives classifications. So a difference of, say 100, implies one
classification is e^100 ~= 10^43 more likely than the other. However,
these numbers can be very misleading, since they give the relative
probability of alternative classifications under the AutoClass
assumptions.
ASSUMPTIONS
Specifically, the most important AutoClass assumptions are the use of
normal models for real variables, and the assumption of independence of
attributes within a class. Since these assumptions are often violated
in practice, the difference in posterior probability of alternative
classifications can be partly due to one classification being closer to
satisfying the assumptions than another, rather than to a real
difference in classification quality. Another source of uncertainty
about the utility of Log probability values is that they do not take
into account any specific prior knowledge the user may have about the
domain. This means that it is often worth looking at alternative
classifications to see if you can interpret them, but it is worth
starting from the most probable first. Note that if the Log
probability value is much greater than that for the one class case, it
is saying that there is overwhelming evidence for some structure in the
data, and part of this structure has been captured by the AutoClass
classification.
INFLUENCE REPORT
So you have now picked a classification you want to examine, based on
its Log probability value; how do you examine it? The first thing to
do is to generate an "influence" report on the classification using the
report generation facilities documented in
/usr/share/doc/autoclass/reports-c.text. An influence report is
designed to summarize the important information buried in the AutoClass
data structures.
The first part of this report gives the heuristic class "strengths".
Class "strength" is here defined as the geometric mean probability that
any instance "belonging to" class, would have been generated from the
class probability model. It thus provides a heuristic measure of how
strongly each class predicts "its" instances.
The second part is a listing of the overall "influence" of each of the
attributes used in the classification. These give a rough heuristic
measure of the relative importance of each attribute in the
classification. Attribute "influence values" are a class probability
weighted average of the "influence" of each attribute in the classes,
as described below.
The next part of the report is a summary description of each of the
classes. The classes are arbitrarily numbered from 0 up to n, in order
of descending class weight. A class weight of say 34.1 means that the
weighted sum of membership probabilities for class is 34.1. Note that
a class weight of 34 does not necessarily mean that 34 cases belong to
that class, since many cases may have only partial membership in that
class. Within each class, attributes or attribute sets are ordered by
the "influence" of their model term.
CROSS ENTROPY
A commonly used measure of the divergence between two probability
distributions is the cross entropy: the sum over all possible values x,
of P(x|c...)*log[P(x|c...)/P(x|g...)], where c... and g... define the
distributions. It ranges from zero, for identical distributions, to
infinite for distributions placing probability 1 on differing values of
an attribute. With conditionally independent terms in the probability
distributions, the cross entropy can be factored to a sum over these
terms. These factors provide a measure of the corresponding modeled
attribute’s influence in differentiating the two distributions.
We define the modeled term’s "influence" on a class to be the cross
entropy term for the class distribution w.r.t. the global class
distribution of the single class classification. "Influence" is thus a
measure of how strongly the model term helps differentiate the class
from the whole data set. With independently modeled attributes, the
influence can legitimately be ascribed to the attribute itself. With
correlated or covariant attributes sets, the cross entropy factor is a
function of the entire set, and we distribute the influence value
equally over the modeled attributes.
ATTRIBUTE INFLUENCE VALUES
In the "influence" report on each class, the attribute parameters for
that class are given in order of highest influence value for the model
term attribute sets. Only the first few attribute sets usually have
significant influence values. If an influence value drops below about
20% of the highest value, then it is probably not significant, but all
attribute sets are listed for completeness. In addition to the
influence value for each attribute set, the values of the attribute set
parameters in that class are given along with the corresponding
"global" values. The global values are computed directly from the data
independent of the classification. For example, if the class mean of
attribute "temperature" is 90 with standard deviation of 2.5, but the
global mean is 68 with a standard deviation of 16.3, then this class
has selected out cases with much higher than average temperature, and a
rather small spread in this high range. Similarly, for discrete
attribute sets, the probability of each outcome in that class is given,
along with the corresponding global probability -- ordered by its
significance: the absolute value of (log {<local-probability> /
<global-probability>}). The sign of the significance value shows the
direction of change from the global class. This information gives an
overview of how each class differs from the average for all the data,
in order of the most significant differences.
CLASS AND CASE REPORTS
Having gained a description of the classes from the "influence" report,
you may want to follow-up to see which classes your favorite cases
ended up in. Conversely, you may want to see which cases belong to a
particular class. For this kind of cross-reference information two
complementary reports can be generated. These are more fully
documented in /usr/share/doc/autoclass/reports-c.text. The "class"
report, lists all the cases which have significant membership in each
class and the degree to which each such case belongs to that class.
Cases whose class membership is less than 90% in the current class have
their other class membership listed as well. The cases within a class
are ordered in increasing case number. The alternative "cases" report
states which class (or classes) a case belongs to, and the membership
probability in the most probable class. These two reports allow you to
find which cases belong to which classes or the other way around. If
nearly every case has close to 99% membership in a single class, then
it means that the classes are well separated, while a high degree of
cross-membership indicates that the classes are heavily overlapped.
Highly overlapped classes are an indication that the idea of
classification is breaking down and that groups of mutually highly
overlapped classes, a kind of meta class, is probably a better way of
understanding the data.
COMPARING CLASS WEIGHTS AND CLASS/CASE REPORT ASSIGNMENTS
The class weight given as the class probability parameter, is
essentially the sum over all data instances, of the normalized
probability that the instance is a member of the class. It is probably
an error on our part that we format this number as an integer in the
report, rather than emphasizing its real nature. You will find the
actual real value recorded as the w_j parameter in the class_DS
structures on any .results[-bin] file.
The .case and .class reports give probabilities that cases are members
of classes. Any assignment of cases to classes requires some decision
rule. The maximum probability assignment rule is often implicitly
assumed, but it cannot be expected that the resulting partition sizes
will equal the class weights unless nearly all class membership
probabilities are effectively one or zero. With non-1/0 membership
probabilities, matching the class weights requires summing the
probabilities.
In addition, there is the question of completeness of the EM
(expectation maximization) convergence. EM alternates between
estimating class parameters and estimating class membership
probabilities. These estimates converge on each other, but never
actually meet. AutoClass implements several convergence algorithms
with alternate stopping criteria using appropriate parameters in the
.s-params file. Proper setting of these parameters, to get reasonably
complete and efficient convergence may require experimentation.
ALTERNATIVE CLASSIFICATIONS
In summary, the various reports that can be generated give you a way of
viewing the current classification. It is usually a good idea to look
at alternative classifications even though they do not have the minimum
Log probability values. These other classifications usually have
classes that correspond closely to strong classes in other
classifications, but can differ in the weak classes. The "strength" of
a class within a classification can usually be judged by how
dramatically the highest influence value attributes in the class differ
from the corresponding global attributes. If none of the
classifications seem quite satisfactory, it is always possible to run
AutoClass again to generate new classifications.
WHAT NEXT?
Finally, the question of what to do after you have found an insightful
classification arises. Usually, classification is a preliminary data
analysis step for examining a set of cases (things, examples, etc.) to
see if they can be grouped so that members of the group are "similar"
to each other. AutoClass gives such a grouping without the user having
to define a similarity measure. The built-in "similarity" measure is
the mutual predictiveness of the cases. The next step is to try to
"explain" why some objects are more like others than those in a
different group. Usually, domain knowledge suggests an answer. For
example, a classification of people based on income, buying habits,
location, age, etc., may reveal particular social classes that were not
obvious before the classification analysis. To obtain further
information about such classes, further information, such as number of
cars, what TV shows are watched, etc., would reveal even more
information. Longitudinal studies would give information about how
social classes arise and what influences their attitudes -- all of
which is going way beyond the initial classification.
PREDICTIONS
Classifications can be used to predict class membership for new cases.
So in addition to possibly giving you some insight into the structure
behind your data, you can now use AutoClass directly to make
predictions, and compare AutoClass to other learning systems.
This technique for predicting class probabilities is applicable to all
attributes, regardless of data type/sub_type or likelihood model term
type.
In the event that the class membership of a data case does not exceed
0.0099999 for any of the "training" classes, the following message will
appear in the screen output for each case:
xref_get_data: case_num xxx => class 9999
Class 9999 members will appear in the "case" and "class" cross-
reference reports with a class membership of 1.0.
Cautionary Points:
The usual way of using AutoClass is to put all of your data in a
data_file, describe that data with model and header files, and run
"autoclass -search". Now, instead of one data_file you will have two,
a training_data_file and a test_data_file.
It is most important that both databases have the same AutoClass
internal representation. Should this not be true, AutoClass will exit,
or possibly in in some situations, crash. The prediction mode is
designed to hopefully direct the user into conforming to this
requirement.
Preparation:
Prediction requires having a training classification and a test
database. The training classification is generated by the running of
"autoclass -search" on the training data_file
("data/soybean/soyc.db2"), for example:
autoclass -search data/soybean/soyc.db2 data/soybean/soyc.hd2
data/soybean/soyc.model data/soybean/soyc.s-params
This will produce "soyc.results-bin" and "soyc.search". Then create a
"reports" parameter file, such as "soyc.r-params" (see
/usr/share/doc/autoclass/reports-c.text), and run AutoClass in
"reports" mode, such as:
autoclass -reports data/soybean/soyc.results-bin
data/soybean/soyc.search data/soybean/soyc.r-params
This will generate class and case cross-reference files, and an
influence values file. The file names are based on the ".r-params"
file name:
data/soybean/soyc.class-text-1
data/soybean/soyc.case-text-1
data/soybean/soyc.influ-text-1
These will describe the classes found in the training_data_file. Now
this classification can be used to predict the probabilistic class
membership of the test_data_file cases ("data/soybean/soyc-
predict.db2") in the training_data_file classes.
autoclass -predict data/soybean/soyc-predict.db2
data/soybean/soyc.results-bin data/soybean/soyc.search
data/soybean/soyc.r-params
This will generate class and case cross-reference files for the
test_data_file cases predicting their probabilistic class memberships
in the training_data_file classes. The file names are based on the
".db2" file name:
data/soybean/soyc-predict.class-text-1
data/soybean/soyc-predict.case-text-1
SEE ALSO
AutoClass is documented fully here:
/usr/share/doc/autoclass/introduction-c.text Guide to the documentation
/usr/share/doc/autoclass/preparation-c.text How to prepare data for use
by AutoClass
/usr/share/doc/autoclass/search-c.text How to run AutoClass to find
classifications.
/usr/share/doc/autoclass/reports-c.text How to examine the
classification in various ways.
/usr/share/doc/autoclass/interpretation-c.text How to interpret
AutoClass results.
/usr/share/doc/autoclass/checkpoint-c.text Protocols for running a
checkpointed search.
/usr/share/doc/autoclass/prediction-c.text Use classifications to
predict class membership for new cases.
These provide supporting documentation:
/usr/share/doc/autoclass/classes-c.text What classification is all
about, for beginners.
/usr/share/doc/autoclass/models-c.text Brief descriptions of the model
term implementations.
The mathematical theory behind AutoClass is explained in these
documents:
/usr/share/doc/autoclass/kdd-95.ps Postscript file containing: P.
Cheeseman, J. Stutz, "Bayesian Classification (AutoClass): Theory and
Results", in "Advances in Knowledge Discovery and Data Mining", Usama
M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, & Ramasamy
Uthurusamy, Eds. The AAAI Press, Menlo Park, expected fall 1995.
/usr/share/doc/autoclass/tr-fia-90-12-7-01.ps Postscript file
containing: R. Hanson, J. Stutz, P. Cheeseman, "Bayesian Classification
Theory", Technical Report FIA-90-12-7-01, NASA Ames Research Center,
Artificial Intelligence Branch, May 1991 (The figures are not included,
since they were inserted by "cut-and-paste" methods into the original
"camera-ready" copy.)
AUTHORS
Dr. Peter Cheeseman
Principal Investigator - NASA Ames, Computational Sciences Division
cheesem@ptolemy.arc.nasa.gov
John Stutz
Research Programmer - NASA Ames, Computational Sciences Division
stutz@ptolemy.arc.nasa.gov
Will Taylor
Support Programmer - NASA Ames, Computational Sciences Division
taylor@ptolemy.arc.nasa.gov
SEE ALSO
multimix(1).
December 9, 2001