autoclass - automatically discover classes in data

NAME

       autoclass - automatically discover classes in data

SYNOPSIS

       autoclass -search data_file header_file model_file s_param_file
       autoclass -report results_file search_file r_params_file
       autoclass -predict results_file search_file results_file

DESCRIPTION

       AutoClass  solves the problem of automatic discovery of classes in data
       (sometimes called clustering, or unsupervised  learning),  as  distinct
       from the generation of class descriptions from labeled examples (called
       supervised learning).  It aims to discover the "natural" classes in the
       data.   AutoClass  is  applicable to observations of things that can be
       described by a set of attributes, without referring  to  other  things.
       The  data  values  corresponding  to  each  attribute are limited to be
       either numbers or the elements of a fixed set of symbols.  With numeric
       data, a measurement error must be provided.

       AutoClass  is looking for the best classification(s) of the data it can
       find.  A classification is composed of:

       1)     A set of classes, each of which is described by a set  of  class
              parameters, which specify how the class is distributed along the
              various attributes.  For example, "height  normally  distributed
              with mean 4.67 ft and standard deviation .32 ft",

       2)     A  set of class weights, describing what percentage of cases are
              likely to be in each class.

       3)     A probabilistic  assignment  of  cases  in  the  data  to  these
              classes.   I.e.  for each case, the relative probability that it
              is a member of each class.

       As a strictly Bayesian system (accept  no  substitutes!),  the  quality
       measure  AutoClass  uses  is  the total probability that, had you known
       nothing about your data or its domain, you would have found this set of
       data  generated  by  this  underlying  model.   This includes the prior
       probability that the "world" would have chosen this number of  classes,
       this set of relative class weights, and this set of parameters for each
       class, and the likelihood  that  such  a  set  of  classes  would  have
       generated this set of values for the attributes in the data cases.

       These probabilities are typically very small, in the range of e^-30000,
       and so are usually expressed in exponential notation.

       When  run  with  the  -search  command,  AutoClass   searches   for   a
       classification.  The required arguments are the paths to the four input
       files,  which  supply  the  data,  the   data   format,   the   desired
       classification model, and the search parameters, respectively.

       By  default,  AutoClass  writes  intermediate results in a binary file.
       With the -report command, AutoClass generates  an  ASCII  report.   The
       arguments  are  the  full  path names of the .results, .search, and .r-
       params files.

       When run with  the  -predict  command,  AutoClass  predicts  the  class
       membership  of a "test" data set based on classes found in a "training"
       data set (see "PREDICTIONS" below).

INPUT FILES

An AutoClass data set resides in two files. There is a header file
(file type "hd2") that describes the specific data format and attribute
definitions. The actual data values are in a data file (file type
"db2"). We use two files to allow editing of data descriptions without
having to deal with the entire data set. This makes it easy to
experiment with different descriptions of the database without having
to reproduce the data set. Internally, an AutoClass database structure
is identified by its header and data files, and the number of data
loaded.

For more detailed information on the formats of these files, see
/usr/share/doc/autoclass/preparation-c.text.

DATA FILE
The data file contains a sequence of data objects (datum or case)
terminated by the end of the file. The number of values for each data
object must be equal to the number of attributes defined in the header
file. Data objects must be groups of tokens delimited by "new-line".
Attributes are typed as REAL, DISCRETE, or DUMMY. Real attribute
values are numbers, either integer or floating point. Discrete
attribute values can be strings, symbols, or integers. A dummy
attribute value can be any of these types. Dummys are read in but
otherwise ignored -- they will be set to zeros in the the internal
database. Thus the actual values will not be available for use in
report output. To have these attribute values available, use either
type REAL or type DISCRETE, and define their model type as IGNORE in
the .model file. Missing values for any attribute type may be
represented by either "?", or other token specified in the header file.
All are translated to a special unique value after being read, so this
symbol is effectively reserved for unknown/missing values.

For example:
white 38.991306 0.54248405 2 2 1
red 25.254923 0.5010235 9 2 1
yellow 32.407973 ? 8 2 1
all_white 28.953982 0.5267696 0 1 1

HEADER FILE
The header file specifies the data file format, and the definitions of
the data attributes. The header file functional specifications
consists of two parts -- the data set format definition specifications,
and the attribute descriptors. ";" in column 1 identifies a comment.

A header file follows this general format:

;; num_db2_format_defs value (number of format def lines
;; that follow), range of n is 1 -> 5
num_db2_format_defs n
;; number_of_attributes token and value required
number_of_attributes <as required>
;; following are optional - default values are specified
separator_char ’ ’
comment_char ’;’
unknown_token ’?’
separator_char ’,’

;; attribute descriptors
;; <zero-based att#> <att_type> <att_sub_type> <att_description>
;; <att_param_pairs>

Each attribute descriptor is a line of:

Attribute index (zero based, beginning in column 1)
Attribute type. See below.
Attribute subtype. See below
Attribute description: symbol (no embedded blanks) or
string; <= 40 characters
Specific property and value pairs.
Currently available combinations:

type subtype property type(s)
---- -------- ---------------
dummy none/nil --
discrete nominal range
real location error
real scalar zero_point rel_error

The ERROR property should represent your best estimate of the average
error expected in the measurement and recording of that real attribute.
Lacking better information, the error can be taken as 1/2 the minimum
possible difference between measured values. It can be argued that
real values are often truncated, so that smaller errors may be
justified, particularly for generated data. But AutoClass only sees
the recorded values. So it needs the error in the recorded values,
rather than the actual measurement error. Setting this error much
smaller than the minimum expressible difference implies the possibility
of values that cannot be expressed in the data. Worse, it implies that
two identical values must represent measurements that were much closer
than they might actually have been. This leads to over-fitting of the
classification.

The REL_ERROR property is used for SCALAR reals when the error is
proportional to the measured value. The ERROR property is not
supported.

AutoClass uses the error as a lower bound on the width of the normal
distribution. So small error estimates tend to give narrower peaks and
to increase both the number of classes and the classification
probability. Broad error estimates tend to limit the number of
classes.

The scalar ZERO_POINT property is the smallest value that the
measurement process could have produced. This is often 0.0, or less by
some error range. Similarly, the bounded real’s min and max properties
are exclusive bounds on the attributes generating process. For a
calculated percentage these would be 0-e and 100+e, where e is an error
value. The discrete attribute’s range is the number of possible values
the attribute can take on. This range must include unknown as a value
when such values occur.

Header File Example:

!#; AutoClass C header file -- extension .hd2
!#; the following chars in column 1 make the line a comment:
!#; ’!’, ’#’, ’;’, ’ ’, and ’\n’ (empty line)

;#! num_db2_format_defs <num of def lines -- min 1, max 4>
num_db2_format_defs 2
;; required
number_of_attributes 7
;; optional - default values are specified
;; separator_char ’ ’
;; comment_char ’;’
;; unknown_token ’?’
separator_char ’,’

;; <zero-based att#> <att_type> <att_sub_type> <att_description>
<att_param_pairs>
0 dummy nil "True class, range = 1 - 3"
1 real location "X location, m. in range of 25.0 - 40.0" error .25
2 real location "Y location, m. in range of 0.5 - 0.7" error .05
3 real scalar "Weight, kg. in range of 5.0 - 10.0" zero_point 0.0
rel_error .001
4 discrete nominal "Truth value, range = 1 - 2" range 2
5 discrete nominal "Color of foobar, 10 values" range 10
6 discrete nominal Spectral_color_group range 6

MODEL FILE
A classification of a data set is made with respect to a model which
specifies the form of the probability distribution function for classes
in that data set. Normally the model structure is defined in a model
file (file type "model"), containing one or more models. Internally, a
model is defined relative to a particular database. Thus it is
identified by the corresponding database, the model’s model file and
its sequential position in the file.

Each model is specified by one or more model group definition lines.
Each model group line associates attribute indices with a model term
type.

Here is an example model file:

# AutoClass C model file -- extension .model
model_index 0 7
ignore 0
single_normal_cn 3
single_normal_cn 17 18 21
multi_normal_cn 1 2
multi_normal_cn 8 9 10
multi_normal_cn 11 12 13
single_multinomial default

Here, the first line is a comment. The following characters in column
1 make the line a comment: ‘!’, ‘#’, ‘ ’, ‘;’, and ‘\n’ (empty line).

The tokens "model_index n m" must appear on the first non-comment line,
and precede the model term definition lines. n is the zero-based model
index, typically 0 where there is only one model -- the majority of
search situations. m is the number of model term definition lines that
follow.

The last seven lines are model group lines. Each model group line
consists of:

A model term type (one of single_multinomial, single_normal_cm,
single_normal_cn, multi_normal_cn, or ignore).

A list of attribute indices (the attribute set list), or the symbol
default. Attribute indices are zero-based. Single model terms may
have one or more attribute indices on each line, while multi model
terms require two or more attribute indices per line. An attribute
index must not appear more than once in a model list.

Notes:

1) At least one model definition is required (model_index token).

2) There may be multiple entries in a model for any model term
type.

3) Model term types currently consist of:

single_multinomial
models discrete attributes as multinomials, with missing
values.

single_normal_cn
models real valued attributes as normals; no missing
values.

single_normal_cm
models real valued attributes with missing values.

multi_normal_cn
is a covariant normal model without missing values.

ignore allows the model to ignore one or more attributes.
ignore is not a valid default model term type.

See the documentation in models-c.text for further information
about specific model terms.

4) Single_normal_cn, single_normal_cm, and multi_normal_cn modeled
data, whose subtype is scalar (value distribution is away from
0.0, and is thus not a "normal" distribution) will be log
transformed and modeled with the log-normal model. For data
whose subtype is location (value distribution is around 0.0), no
transform is done, and the normal model is used.

SEARCHING

       AutoClass, when invoked in the "search" mode will check the validity of
       the set of data, header, model, and  search  parameter  files.   Errors
       will  stop  the  search  from  starting, and warnings will ask the user
       whether to continue.  A history of the error and  warning  messages  is
       saved, by default, in the log file.

       Once  you have succeeded in describing your data with a header file and
       model file that passes the AUTOCLASS -SEARCH <...>  input  checks,  you
       will  have  entered  the  search domain where AutoClass classifies your
       data.  (At last!)

       The main function to use in finding a good classification of your  data
       is  AUTOCLASS  -SEARCH,  and using it will take most of the computation
       time.  Searches are invoked with:

       autoclass -search <.db2 file path> <.hd2 file path>
            <.model file path> <.s-params file path>

       All files must be specified as fully  qualified  relative  or  absolute
       pathnames.   File name extensions (file types) for all files are forced
       to canonical values required by the AutoClass program:

               data file   ("ascii")   db2
               data file   ("binary")  db2-bin
               header file             hd2
               model file              model
               search params file      s-params

       The sample-run  (/usr/share/doc/autoclass/examples/)  that  comes  with
       AutoClass  shows  some  sample searches, and browsing these is probably
       the fastest way to get familiar with how to do searches.  The test data
       sets  located  under  /usr/share/doc/autoclass/examples/  will show you
       some other header (.hd2), model (.model), and search params (.s-params)
       file  setups.   The  remainder  of  this  section  describes  how to do
       searches in somewhat more detail.

       The  bold  faced  tokens  below  are  generally  search   params   file
       parameters.   For  more  information  on  the s-params file, see SEARCH
       PARAMETERS below, or /usr/share/doc/autoclass/search-c.text.gz.

   WHAT RESULTS ARE
       AutoClass is looking for the best classification(s) of the data it  can
       find.  A classification is composed of:

       1)     a  set  of classes, each of which is described by a set of class
              parameters, which specify how the class is distributed along the
              various  attributes.   For example, "height normally distributed
              with mean 4.67 ft and standard deviation .32 ft",

       2)     a set of class weights, describing what percentage of cases  are
              likely to be in each class.

       3)     a  probabilistic  assignment  of  cases  in  the  data  to these
              classes.  I.e. for each case, the relative probability  that  it
              is a member of each class.

       As  a  strictly  Bayesian  system (accept no substitutes!), the quality
       measure AutoClass uses is the total probability  that,  had  you  known
       nothing about your data or its domain, you would have found this set of
       data generated by this  underlying  model.   This  includes  the  prior
       probability  that the "world" would have chosen this number of classes,
       this set of relative class weights, and this set of parameters for each
       class,  and  the  likelihood  that  such  a  set  of classes would have
       generated this set of values for the attributes in the data cases.

       These probabilities are typically very small, in the range of e^-30000,
       and so are usually expressed in exponential notation.

   WHAT RESULTS MEAN
       It  is  important to remember that all of these probabilities are GIVEN
       that the  real  model  is  in  the  model  family  that  AutoClass  has
       restricted  its  attention  to.   If  AutoClass is looking for Gaussian
       classes and the real classes are Poisson, then the fact that  AutoClass
       found  5  Gaussian  classes  may  not  say  much about how many Poisson
       classes there really are.

       The relative probability between different classifications found can be
       very  large,  like  e^1000,  so  the  very best classification found is
       usually overwhelmingly more probable than the rest (and  overwhelmingly
       less probable than any better classifications as yet undiscovered).  If
       AutoClass should manage to find two  classifications  that  are  within
       about  exp(5-10)  of  each  other (i.e. within 100 to 10,000 times more
       probable) then you should consider them to be about  equally  probable,
       as  our  computation  is  usually  not  more  accurate  than  this (and
       sometimes much less).

   HOW IT WORKS
       AutoClass repeatedly creates a random classification and then tries  to
       massage  this  into  a  high  probability  classification  though local
       changes, until it converges to some "local maximum".  It then remembers
       what  it  found  and starts over again, continuing until you tell it to
       stop.  Each effort is called a "try", and the computed  probability  is
       intended  to  cover  the  whole  volume  in parameter space around this
       maximum, rather than just the peak.

       The standard approach to massaging is to

       1)     Compute the probabilistic class memberships of cases  using  the
              class parameters and the implied relative likelihoods.

       2)     Using  the  new  class  members,  compute class statistics (like
              mean) and revise the class parameters.

       and  repeat  till  they  stop  changing.   There  are  three  available
       convergence     algorithms:    "converge_search_3"    (the    default),
       "converge_search_4" and "converge".  Their specification is  controlled
       by search params file parameter try_fn_type.

   WHEN TO STOP
       You can tell AUTOCLASS -SEARCH to stop by: 1) giving a max_duration (in
       seconds) argument  at  the  beginning;  2)  giving  a  max_n_tries  (an
       integer)  argument at the beginning; or 3) by typing a "q" and <return>
       after you have seen enough tries.   The  max_duration  and  max_n_tries
       arguments  are  useful  if you desire to run AUTOCLASS -SEARCH in batch
       mode.  If you are restarting AUTOCLASS -SEARCH from a previous  search,
       the  value  of  max_n_tries  you provide, for instance 3, will tell the
       program to compute 3 more tries in addition  to  however  many  it  has
       already   done.    The   same  incremental  behavior  is  exhibited  by
       max_duration.

       Deciding when to stop is a judgment call and it’s up to you.  Since the
       search  includes  a random component, there’s always the chance that if
       you let it keep going it will find something better.  So  you  need  to
       trade  off  how  much better it might be with how long it might take to
       find it.  The search status reports that are printed when  a  new  best
       classification is found are intended to provide you information to help
       you make this tradeoff.

       One clear sign that  you  should  probably  stop  is  if  most  of  the
       classifications found are duplicates of previous ones (flagged by "dup"
       as they are found).  This should only happen for  very  small  sets  of
       data or when fixing a very small number of classes, like two.

       Our  experience  is  that  for moderately large to extremely large data
       sets (~200 to ~10,000 datum), it is necessary to run AutoClass  for  at
       least 50 trials.

   WHAT GETS RETURNED
       Just  before  returning, AUTOCLASS -SEARCH will give short descriptions
       of the best classifications found.  How many will be described  can  be
       controlled with n_final_summary.

       By  default AUTOCLASS -SEARCH will write out a number of files, both at
       the end and periodically during the search (in case your system crashes
       before  it  finishes).   These files will all have the same name (taken
       from the search params pathname [<name>.s-params]), and differ only  in
       their  file extensions.  If your search runs are very long and there is
       a possibility that your machine may crash, you  can  have  intermediate
       "results"  files written out.  These can be used to restart your search
       run with minimum loss of search effort.   See  the  documentation  file
       /usr/share/doc/autoclass/checkpoint-c.text.

       A  ".log"  file  will hold a listing of most of what was printed to the
       screen during the run, unless you set log_file_p to false  to  say  you
       want  no  such  foolishness.   Unless results_file_p is false, a binary
       ".results-bin" file (the default) or an  ASCII  ".results"  text  file,
       will  hold  the  best  classifications  that  were returned, and unless
       search_file_p is false, a ".search" file will hold the  record  of  the
       search  tries.  save_compact_p controls whether the "results" files are
       saved as binary or ASCII text.

       If the C global variable "G_safe_file_writing_p" is defined as TRUE  in
       "autoclass-c/prog/globals.c",  the names of "results" files (those that
       contain the saved classifications) are modified internally  to  account
       for  redundant  file  writing.   If  the  search  params  file  name is
       "my_saved_clsfs" you  will  see  the  following  "results"  file  names
       (ignoring directories and pathnames for this example)

         save_compact_p = true --
         "my_saved_clsfs.results-bin"     - completely written file
         "my_saved_clsfs.results-tmp-bin" - partially written file, renamed
                             when complete

         save_compact_p = false --
         "my_saved_clsfs.results"    - completely written file
         "my_saved_clsfs.results-tmp"  - partially written file, renamed
                             when complete

       If check pointing is being done, these additional names will appear

         save_compact_p = true --
         "my_saved_clsfs.chkpt-bin"  - completely written checkpoint file
         "my_saved_clsfs.chkpt-tmp-bin" - partially written checkpoint file,
                                renamed when complete
         save_compact_p = false --
         "my_saved_clsfs.chkpt" - completely written checkpoint file
         "my_saved_clsfs.chkpt-tmp"    - partially written checkpoint file,
                                renamed when complete

   HOW TO GET STARTED
       The way to invoke AUTOCLASS -SEARCH is:

       autoclass -search <.db2 file path> <.hd2 file path>
            <.model file path> <.s-params file path>

       To  restart  a previous search, specify that force_new_search_p has the
       value false in the search params  file,  since  its  default  is  true.
       Specifying  false  tells  AUTOCLASS  -SEARCH  to try to find a previous
       compatible search  (<...>.results[-bin]  &  <...>.search)  to  continue
       from,  and  will  restart  using  it  if  found.  To force a new search
       instead of restarting an old one, give the parameter force_new_search_p
       the  value of true, or use the default.  If there is an existing search
       (<...>.results[-bin] & <...>.search), the user will be asked to confirm
       continuation since continuation will discard the existing search.

       If a previous search is continued, the message "RESTARTING SEARCH" will
       be given instead of the usual  "BEGINNING  SEARCH".   It  is  generally
       better  to  continue  a previous search than to start a new one, unless
       you are trying a significantly different search method, in  which  case
       statistics from the previous search may mislead the current one.

   STATUS REPORTS
       A running commentary on the search will be printed to the screen and to
       the log file (unless log_file_p is false).  Note that the  ".log"  file
       will  contain  a  listing  of all default search params values, and the
       values of all params that are overridden.

       After each try a very short report (only  a  few  characters  long)  is
       given.   After  each new best classification, a longer report is given,
       but no more often than min_report_period (default is 30 seconds).

   SEARCH VARIATIONS
       AUTOCLASS -SEARCH by default uses a certain standard search  method  or
       "try  function"  (try_fn_type  =  "converge_search_3").  Two others are
       also available: "converge_search_4" and "converge").  They are provided
       in  case  your problem is one that may happen to benefit from them.  In
       general  the   default   method   will   result   in   finding   better
       classifications  at  the  expense of a longer search time.  The default
       was chosen so as to be robust,  giving  even  performance  across  many
       problems.   The  alternatives  to  the  default  may  do better on some
       problems, but may do substantially worse on others.

       "converge_search_3"    uses    an    absolute    stopping     criterion
       (rel_delta_range, default value of 0.0025) which tests the variation of
       each class of the delta of the log  approximate-marginal-likelihood  of
       the    class    statistics   with-respect-to   the   class   hypothesis
       (class->log_a_w_s_h_j) divided by the class weight (class->w_j) between
       successive  convergence  cycles.   Increasing  this  value  loosens the
       convergence and reduces the number of cycles.   Decreasing  this  value
       tightens  the convergence and increases the number of cycles. n_average
       (default value of 3) specifies how many successive cycles must meet the
       stopping criterion before the trial terminates.

       "converge_search_4"     uses    an    absolute    stopping    criterion
       (cs4_delta_range, default value of 0.0025) which tests the variation of
       each  class  of  the  slope for each class of log approximate-marginal-
       likelihood of the class statistics with-respect-to the class hypothesis
       (class->log_a_w_s_h_j)  divided  by  the class weight (class->w_j) over
       sigma_beta_n_values (default value 6) convergence  cycles.   Increasing
       the  value  of  cs4_delta_range loosens the convergence and reduces the
       number of cycles.  Decreasing this value tightens the  convergence  and
       increases  the number of cycles.  Computationally, this try function is
       more expensive than "converge_search_3", but may prove  useful  if  the
       computational  "noise" is significant compared to the variations in the
       computed  values.   Key  calculations  are  done  in  double  precision
       floating  point,  and for the largest data base we have tested so far (
       5,420 cases of 93 attributes),  computational  noise  has  not  been  a
       problem,  although  the  value  of max_cycles needed to be increased to
       400.

       "converge" uses one of two absolute stopping criterion which  test  the
       variation  of  the classification (clsf) log_marginal (clsf->log_a_x_h)
       delta between successive convergence cycles.  The largest of halt_range
       (default  value  0.5)  and  halt_factor * current_clsf_log_marginal) is
       used (default value of halt_factor is 0.0001).  Increasing these values
       loosens  the  convergence and reduces the number of cycles.  Decreasing
       these values tightens the  convergence  and  increases  the  number  of
       cycles.   n_average (default value of 3) specifies how many cycles must
       meet the stopping criteria before the trial terminates.  This is a very
       approximate  stopping  criterion,  but  will give you some feel for the
       kind  of  classifications  to  expect.   It   would   be   useful   for
       "exploratory" searches of a data base.

       The  purpose of reconverge_type = "chkpt" is to complete an interrupted
       classification by continuing from its last checkpoint.  The purpose  of
       reconverge_type  =  "results"  is  to attempt further refinement of the
       best completed classification using a different  value  of  try_fn_type
       ("converge_search_3", "converge_search_4", "converge").  If max_n_tries
       is greater than 1, then in  each  case,  after  the  reconvergence  has
       completed,  AutoClass  will  perform further search trials based on the
       parameter values in the <...>.s-params file.

       With the use of reconverge_type ( default value ""), you may apply more
       than  one  try  function to a classification.  Say you generate several
       exploratory trials using try_fn_type = "converge", and quit the  search
       saving  .search  and  .results[-bin] files.  Then you can begin another
       search  with  try_fn_type  =  "converge_search_3",  reconverge_type   =
       "results",  and  max_n_tries  =  1.   This  will  result in the further
       convergence of the best classification  generated  with  try_fn_type  =
       "converge",  with  try_fn_type  =  "converge_search_3".  When AutoClass
       completes  this  search  try,  you  will  have  an  additional  refined
       classification.

       A  good  way  to  verify  that  any  of  the alternate try_fun_type are
       generating a well converged  classification  is  to  run  AutoClass  in
       prediction   mode   on   the   same   data   used  for  generating  the
       classification.  Then generate and compare the  corresponding  case  or
       class  cross  reference  files  for the original classification and the
       prediction.  Small differences between these files are to be  expected,
       while  large  differences indicate incomplete convergence.  Differences
       between such file pairs should, on average and modulo class  deletions,
       decrease monotonically with further convergence.

       The  standard  way  to create a random classification to begin a try is
       with the default value of "random" for start_fn_type.   At  this  point
       there  are  no  alternatives.   Specifying  "block"  for  start_fn_type
       produces repeatable non-random searches.  That is how the <..>.s-params
       files  in  the autoclass-c/data/.. sub-directories are specified.  This
       is how development testing is done.

       max_cycles controls the maximum number of convergence cycles that  will
       be  performed  in  any  one  trial  by  the convergence functions.  Its
       default value is 200.  The screen output shows a period (".") for  each
       cycle  completed. If your search trials run for 200 cycles, then either
       your data base is very complex (increase the value), or the try_fn_type
       is  not  adequate for situation (try another of the available ones, and
       use converge_print_p to get more information on what is going on).

       Specifying converge_print_p to be true will generate a brief  print-out
       for  each  cycle  which will provide information so that you can modify
       the   default   values   of    rel_delta_range    &    n_average    for
       "converge_search_3";    cs4_delta_range   &   sigma_beta_n_values   for
       "converge_search_4"; and halt_range,  halt_factor,  and  n_average  for
       "converge".   Their default values are given in the <..>.s-params files
       in the autoclass-c/data/..  sub-directories.

   HOW MANY CLASSES?
       Each new try begins with a certain number of classes  and  may  end  up
       with a smaller number, as some classes may drop out of the convergence.
       In general, you want to begin the try with some number of classes  that
       previous  tries  have indicated look promising, and you want to be sure
       you are fishing around elsewhere in case you missed something before.

       n_classes_fn_type = "random_ln_normal" is the default way to make  this
       choice.   It fits a log normal to the number of classes (usually called
       "j" for short) of  the  10  best  classifications  found  so  far,  and
       randomly selects from that.  There is currently no alternative.

       To  start  the game off, the default is to go down start_j_list for the
       first few tries, and then switch to n_classes_fn_type.  If you  believe
       that  the  probable number of classes in your data base is say 75, then
       instead of using the default value of start_j_list (2, 3, 5, 7, 10, 15,
       25), specify something like 50, 60, 70, 80, 90, 100.

       If  one  wants  to  always  look  for,  say, three classes, one can use
       fixed_j and override the above.  Search status  reports  will  describe
       what the current method for choosing j is.

   DO I HAVE ENOUGH MEMORY AND DISK SPACE?
       Internally, the storage requirements in the current system are of order
       n_classes_per_clsf  *  (n_data  +  n_stored_clsfs  *   n_attributes   *
       n_attribute_values).   This  depends on the number of cases, the number
       of attributes, the values per attribute (use 2 if a  real  value),  and
       the  number  of  classifications  stored  away for comparison to see if
       others are duplicates -- controlled by  max_n_store  (default  value  =
       10).   The  search  process does not itself consume significant memory,
       but storage of the results may do so.

       AutoClass C is configured to handle a maximum of  999  attributes.   If
       you  attempt  to  run  with  more  than  that  you will get array bound
       violations.  In that case, change  these  configuration  parameters  in
       prog/autoclass.h and recompile AutoClass C:

       #define ALL_ATTRIBUTES                  999
       #define VERY_LONG_STRING_LENGTH         20000
       #define VERY_LONG_TOKEN_LENGTH          500

       For example, these values will handle several thousand attributes:

       #define ALL_ATTRIBUTES                  9999
       #define VERY_LONG_STRING_LENGTH         50000
       #define VERY_LONG_TOKEN_LENGTH          50000

       Disk  space  taken  up  by  the "log" file will of course depend on the
       duration of the search.  n_save (default value = 2) determines how many
       best   classifications   are  saved  into  the  ".results[-bin]"  file.
       save_compact_p controls whether the "results"  and  "checkpoint"  files
       are saved as binary.  Binary files are faster and more compact, but are
       not portable.  The default  value  of  save_compact_p  is  true,  which
       causes binary files to be written.

       If  the  time  taken to save the "results" files is a problem, consider
       increasing  min_save_period  (default  value  =  1800  seconds  or   30
       minutes).   Files  are  saved  to  disk this often if there is anything
       different to report.

   JUST HOW SLOW IS IT?
       Compute time is of order n_data * n_attributes * n_classes * n_tries  *
       converge_cycles_per_try. The major uncertainties in this are the number
       of basic back and forth cycles till convergence in  each  try,  and  of
       course  the  number  of  tries.   The  number  of  cycles  per trial is
       typically  10-100  for  try_fn_type   "converge",   and   10-200+   for
       "converge_search_3"  and  "converge_search-4".   The  maximum number is
       specified by max_n_tries (default value = 200).  The number  of  trials
       is up to you and your available computing resources.

       The  running  time of very large data sets will be quite uncertain.  We
       advise that a few small scale test runs  be  made  on  your  system  to
       determine  a  baseline.   Specify n_data to limit how many data vectors
       are read.  Given a very large quantity of data, AutoClass may find  its
       most probable classifications at upwards of a hundred classes, and this
       will require that start_j_list be specified  appropriately  (See  above
       section  HOW  MANY  CLASSES?).   If you are quite certain that you only
       want a few classes, you can force AutoClass  to  search  with  a  fixed
       number  of  classes  specified  by  fixed_j.  You will then need to run
       separate searches with each different fixed number of classes.

   CHANGING FILENAMES IN A SAVED CLASSIFICATION FILE
       AutoClass caches the data, header, and  model  file  pathnames  in  the
       saved  classification structure of the binary (".results-bin") or ASCII
       (".results") "results" files.  If the "results" and "search" files  are
       moved   to  a  different  directory  location,  the  search  cannot  be
       successfully restarted if you have used absolute pathnames.  Thus it is
       advantageous to run invoke AutoClass in a parent directory of the data,
       header, and model files, so that relative pathnames can be used.  Since
       the pathnames cached will then be relative, the files can be moved to a
       different host or file system  and  restarted  --  providing  the  same
       relative pathname hierarchy exists.

       However, since the ".results" file is ASCII text, those pathnames could
       be changed with a text editor  (save_compact_p  must  be  specified  as
       false).

   SEARCH PARAMETERS
       The  search  is  controlled  by the ".s-params" file.  In this file, an
       empty line or a line starting with one of these characters  is  treated
       as  a  comment: "#", "!", or ";".  The parameter name and its value can
       be separated by an equal sign, a space, or a tab:

            n_clsfs 1
            n_clsfs = 1
            n_clsfs<tab>1

       Spaces are ignored if "=" or "<tab>"  are  used  as  separators.   Note
       there are no trailing semicolons.

       The search parameters, with their default values, are as follows:

       rel_error = 0.01
              Specifies  the  relative  difference measure used by clsf-DS-%=,
              when deciding if a new clsf is a duplicate of an old one.

       start_j_list = 2, 3, 5, 7, 10, 15, 25
              Initially try these numbers of classes, so as not to narrow  the
              search  too  quickly.   The  state  of this list is saved in the
              <..>.search file  and  used  on  restarts,  unless  an  override
              specification  of start_j_list is made in the .s-params file for
              the restart run.  This list should bracket your expected  number
              of  classes,  and  by  a  wide  margin!   "start_j_list  = -999"
              specifies an empty list (allowed only on restarts)

       n_classes_fn_type = "random_ln_normal"
              Once  start_j_list  is  exhausted,  AutoClass  will  call   this
              function  to  decide  how many classes to start with on the next
              try,  based  on  the  10  best  classifications  found  so  far.
              Currently only "random_ln_normal" is available.

       fixed_j = 0
              When  fixed_j > 0, overrides start_j_list and n_classes_fn_type,
              and AutoClass will always use this value for the initial  number
              of classes.

       min_report_period = 30
              Wait  at  least  this  time (in seconds) since last report until
              reporting verbosely  again.   Should  be  set  longer  than  the
              expected  run  time  when checking for repeatability of results.
              For   repeatable   results,   also    see    force_new_search_p,
              start_fn_type  and  randomize_random_p.  NOTE:  At  least one of
              "interactive_p",  "max_duration",  and  "max_n_tries"  must   be
              active.   Otherwise AutoClass will run indefinitely.  See below.

       interactive_p = true
              When false, allows run to continue until otherwise halted.  When
              true,  standard  input  is  queried  on  each cycle for the quit
              character "q", which, when detected, triggers an immediate halt.

       max_duration = 0
              When = 0, allows run to continue until otherwise halted.  When >
              0, specifies the maximum number of seconds to run.

       max_n_tries = 0
              When = 0, allows run to continue until otherwise halted.  When >
              0, specifies the maximum number of tries to make.

       n_save = 2
              Save  this  many clsfs to disk in the .results[-bin] and .search
              files.  if 0, don’t save anything (no .search  &  .results[-bin]
              files).

       log_file_p = true
              If false, do not write a log file.

       search_file_p = true
              If false, do not write a search file.

       results_file_p = true
              If false, do not write a results file.

       min_save_period = 1800
              CPU  crash  protection.   This  specifies  the  maximum time, in
              seconds, that AutoClass will run before  it  saves  the  current
              results to disk.  The default time is 30 minutes.

       max_n_store = 10
              Specifies   the   maximum   number   of  classifications  stored
              internally.

       n_final_summary = 10
              Specifies the number of trials to be printed  out  after  search
              ends.

       start_fn_type = "random"
              One  of  {"random",  "block"}.  This specifies the type of class
              initialization.  For normal search, use "random", which randomly
              selects   instances   to   be  initial  class  means,  and  adds
              appropriate variances. For testing with repeatable  search,  use
              "block", which partitions the database into successive blocks of
              near   equal   size.    For   repeatable   results,   also   see
              force_new_search_p, min_report_period, and randomize_random_p.

       try_fn_type = "converge_search_3"
              One  of  {"converge_search_3", "converge_search_4", "converge"}.
              These specify alternate search  stopping  criteria.   "converge"
              merely   tests   the   rate   of   change  of  the  log_marginal
              classification probability (clsf->log_a_x_h),  without  checking
              rate   of   change  of  individual  classes(see  halt_range  and
              halt_factor).  "converge_search_3" and "converge_search_4"  each
              monitor   the   ratio  class->log_a_w_s_h_j/class->w_j  for  all
              classes, and continue convergence until all pass the  quiescence
              criteria   for   n_average  cycles.   "converge_search_3"  tests
              differences   between   successive   convergence   cycles   (see
              rel_delta_range).   This  provides a reasonable, general purpose
              stopping criteria.  "converge_search_4" averages the ratio  over
              "sigma_beta_n_values"  cycles  (see  cs4_delta_range).   This is
              preferred when converge_search_3 produces many similar  classes.

       initial_cycles_p = true
              If  true, perform base_cycle in initialize_parameters.  false is
              used only for testing.

       save_compact_p = true
              true  saves  classifications   as   machine   dependent   binary
              (.results-bin   &   .chkpt-bin).   false  saves  as  ascii  text
              (.results & .chkpt)

       read_compact_p = true
              true  reads  classifications   as   machine   dependent   binary
              (.results-bin   &   .chkpt-bin).   false  reads  as  ascii  text
              (.results & .chkpt).

       randomize_random_p = true
              false seeds lrand48, the pseudo-random number function with 1 to
              give  repeatable  test cases.  true uses universal time clock as
              the seed, giving semi-random searches.  For repeatable  results,
              also     see     force_new_search_p,    min_report_period    and
              start_fn_type.

       n_data = 0
              With n_data = 0, the entire database is read  from  .db2.   With
              n_data > 0, only this number of data are read.

       halt_range = 0.5
              Passed   to   try_fn_type   "converge".    With  the  "converge"
              try_fn_type, convergence is halted when the larger of halt_range
              and  (halt_factor * current_log_marginal) exceeds the difference
              between  successive   cycle   values   of   the   classification
              log_marginal   (clsf->log_a_x_h).   Decreasing  this  value  may
              tighten the convergence and increase the number of cycles.

       halt_factor = 0.0001
              Passed  to  try_fn_type   "converge".    With   the   "converge"
              try_fn_type, convergence is halted when the larger of halt_range
              and (halt_factor * current_log_marginal) exceeds the  difference
              between   successive   cycle   values   of   the  classification
              log_marginal  (clsf->log_a_x_h).   Decreasing  this  value   may
              tighten the convergence and increase the number of cycles.

       rel_delta_range = 0.0025
              Passed  to  try function "converge_search_3", which monitors the
              ratio of  log  approx-marginal-likelihood  of  class  statistics
              with-respect-to   the  class  hypothesis  (class->log_a_w_s_h_j)
              divided by  the  class  weight  (class->w_j),  for  each  class.
              "converge_search_3"   halts   convergence  when  the  difference
              between cycles,  of  this  ratio,  for  every  class,  has  been
              exceeded    by   "rel_delta_range"   for   "n_average"   cycles.
              Decreasing  "rel_delta_range"  tightens  the   convergence   and
              increases the number of cycles.

       cs4_delta_range = 0.0025
              Passed  to  try function "converge_search_4", which monitors the
              ratio of (class->log_a_w_s_h_j)/(class->w_j),  for  each  class,
              averaged    over   "sigma_beta_n_values"   convergence   cycles.
              "converge_search_4"   halts   convergence   when   the   maximum
              difference   in   average  values  of  this  ratio  falls  below
              "cs4_delta_range".  Decreasing  "cs4_delta_range"  tightens  the
              convergence and increases the number of cycles.

       n_average = 3
              Passed to try functions "converge_search_3" and "converge".  The
              number of cycles for which the  convergence  criterion  must  be
              satisfied for the trial to terminate.

       sigma_beta_n_values = 6
              Passed  to  try_fn_type "converge_search_4".  The number of past
              values to use in computing sigma^2 (noise) and beta^2  (signal).

       max_cycles = 200
              This  is  the  maximum  number  of  cycles permitted for any one
              convergence  of  a  classification,  regardless  of  any   other
              stopping  criteria.   This  is very dependent upon your database
              and choice of model and convergence parameters,  but  should  be
              about  twice the average number of cycles reported in the screen
              dump and .log file

       converge_print_p = false
              If true, the selected try function  will  print  to  the  screen
              values  useful  in specifying non-default values for halt_range,
              halt_factor,  rel_delta_range,  n_average,  sigma_beta_n_values,
              and range_factor.

       force_new_search_p = true
              If true, will ignore any previous search results, discarding the
              existing .search and .results[-bin] files after confirmation  by
              the  user; if false, will continue the search using the existing
              .search and .results[-bin] files.  For repeatable results,  also
              see min_report_period, start_fn_type and randomize_random_p.

       checkpoint_p = false
              If  true,  checkpoints  of  the  current  classification will be
              written  every  "min_checkpoint_period"   seconds,   with   file
              extension  .chkpt[-bin].  This  is  only  useful  for very large
              classifications

       min_checkpoint_period = 10800
              If checkpoint_p = true, the checkpointed classification will  be
              written this often - in seconds (default = 3 hours)

       reconverge_type = "
              Can  be  either  "chkpt" or "results".  If "checkpoint_p" = true
              and "reconverge_type" = "chkpt", then  continue  convergence  of
              the   classification   contained   in   <...>.chkpt[-bin].    If
              "checkpoint_p "  =  false  and  "reconverge_type"  =  "results",
              continue  convergence  of  the  best classification contained in
              <...>.results[-bin].

       screen_output_p = true
              If false,  no  output  is  directed  to  the  screen.   Assuming
              log_file_p = true, output will be directed to the log file only.

       break_on_warnings_p = true
              The default value asks the user whether or not to continue, when
              data definition warnings are found.  If specified as false, then
              AutoClass will continue, despite warnings --  the  warning  will
              continue to be output to the terminal and the log file.

       free_storage_p = true
              The  default  value  tells AutoClass to free the majority of its
              allocated storage.  This is not required, and in the case of the
              DEC  Alpha causes core dump [is this still true?].  If specified
              as false, AutoClass will not attempt to free storage.

   HOW TO GET AUTOCLASS C TO PRODUCE REPEATABLE RESULTS
       In some situations, repeatable classifications are required:  comparing
       basic AutoClass C integrity on different platforms, porting AutoClass C
       to a new platform, etc.  In order to accomplish  this  two  things  are
       necessary: 1) the same random number generator must be used, and 2) the
       search parameters must be specified properly.

       Random Number Generator. This implementation of AutoClass  C  uses  the
       Unix  srand48/lrand48  random  number generator which generates pseudo-
       random numbers using the well-known linear congruential  algorithm  and
       48-bit  integer  arithmetic.   lrand48()  returns  non-  negative  long
       integers uniformly distributed over the interval [0, 2**31].

       Search Parameters.  The following .s-params file parameters  should  be
       specified:

       force_new_search_p = true
       start_fn_type   "block"
       randomize_random_p = false
       ;; specify the number of trials you wish to run
       max_n_tries = 50
       ;; specify a time greater than duration of run
       min_report_period = 30000

       Note  that  no  current  best  classification reports will be produced.
       Only a final classification summary will be output.

CHECKPOINTING

       With very large databases there  is  a  significant  probability  of  a
       system   crash   during   any   one  classification  try.   Under  such
       circumstances it is advisable  to  take  the  time  to  checkpoint  the
       calculations for possible restart.

       Checkpointing  is  initiated by specifying "checkpoint_p = true" in the
       ".s-params" file.  This causes the inner convergence step,  to  save  a
       copy  of  the  classification  onto  the  checkpoint file each time the
       classification is updated, providing  a  certain  period  of  time  has
       elapsed.  The file extension is ".chkpt[-bin]".

       Each  time a AutoClass completes a cycle, a "." is output to the screen
       to  provide  you  with  information  to  be   used   in   setting   the
       min_checkpoint_period  value (default 10800 seconds or 3 hours).  There
       is obviously a trade-off between frequency  of  checkpointing  and  the
       probability  that  your machine may crash, since the repetitive writing
       of the checkpoint file will slow the search process.

       Restarting AutoClass Search:

       To recover the classification and continue the search  after  rebooting
       and  reloading AutoClass, specify reconverge_type = "chkpt" in the ".s-
       params" file (specify force_new_search_p as false).

       AutoClass will reload the appropriate  database  and  models,  provided
       there  has  been  no change in their filenames since the time they were
       loaded for the checkpointed classification run.  The  ".s-params"  file
       contains  any  non-default arguments that were provided to the original
       call.

       In the beginning of a search, before start_j_list has been emptied,  it
       will be necessary to trim the original list to what would have remained
       in the crashed search.  This can be determined by looking at the ".log"
       file  to  determine what values were already used.  If the start_j_list
       has been emptied, then an empty start_j_list should be specified in the
       ".s-params" file.  This is done either by

               start_j_list =

       or

               start_j_list = -9999

       Here is an a set of scripts to demonstrate check-pointing:

       autoclass -search data/glass/glassc.db2 data/glass/glass-3c.hd2 \
            data/glass/glass-mnc.model data/glass/glassc-chkpt.s-params

       Run 1)
         ## glassc-chkpt.s-params
         max_n_tries = 2
         force_new_search_p = true
         ## --------------------
         ;; run to completion

       Run 2)
         ## glassc-chkpt.s-params
         force_new_search_p = false
         max_n_tries = 10
         checkpoint_p = true
         min_checkpoint_period = 2
         ## --------------------
         ;; after 1 checkpoint, ctrl-C to simulate cpu crash

       Run 3)
         ## glassc-chkpt.s-params
         force_new_search_p = false
         max_n_tries = 1
         checkpoint_p = true
         min_checkpoint_period = 1
         reconverge_type = "chkpt"
         ## --------------------
         ;; checkpointed trial should finish

OUTPUT FILES

       The standard reports are

       1)     Attribute  influence  values: presents the relative influence or
              significance of the data’s attributes  both  globally  (averaged
              over  all classes), and locally (specifically for each class). A
              heuristic for relative class strength is also listed;

       2)     Cross-reference by case (datum) number: lists the primary  class
              probability  for  each  datum,  ordered  by  case  number.  When
              report_mode =  "data",  additional  lesser  class  probabilities
              (greater than or equal to 0.001) are listed for each datum;

       3)     Cross-reference  by  class  number:  for  each class the primary
              class probability and any lesser  class  probabilities  (greater
              than  or equal to 0.001) are listed for each datum in the class,
              ordered by case number. It is also possible to  list,  for  each
              datum, the values of attributes, which you select.

       The  attribute  influence  values  report  attempts to provide relative
       measures of the "influence" of the data attributes on the classes found
       by  the classification.  The normalized class strengths, the normalized
       attribute influence values summed over all classes, and the  individual
       influence  values (I[jkl]) are all only relative measures and should be
       interpreted with more meaning than rank ordering, but not like anything
       approaching absolute values.

       The  reports  are  output  to files whose names and pathnames are taken
       from the ".r-params" file pathname.  The report file types (extensions)
       are:

       influence values report
              "influ-o-text-n" or "influ-no-text-n"

       cross-reference by case
              "case-text-n"

       cross-reference by class
              "class-text-n"

       or, if report_mode is overridden to "data":

       influence values report
              "influ-o-data-n" or "influ-no-data-n"

       cross-reference by case
              "case-data-n"

       cross-reference by class
              "class-data-n"

       where  n  is  the  classification  number from the "results" file.  The
       first or best classification is numbered 1, the next best 2, etc.   The
       default  is to generate reports only for the best classification in the
       "results"  file.    You   can   produce   reports   for   other   saved
       classifications   by   using   report   params   keywords  n_clsfs  and
       clsf_n_list.   The  "influ-o-text-n"   file   type   is   the   default
       (order_attributes_by_influence_p   =  true),  and  lists  each  class’s
       attributes in descending order of attribute influence  value.   If  the
       value  of  order_attributes_by_influence_p is overridden to be false in
       the <...>.r-params file, then each class’s attributes will be listed in
       ascending  order  by  attribute  number.   The  extension  of  the file
       generated  will  be  "influ-no-text-n".    This   method   of   listing
       facilitates  the visual comparison of attribute values between classes.

       For example, this command:

            autoclass -reports sample/imports-85c.results-bin
                 sample/imports-85c.search sample/imports-85c.r-params

       with this line in the ".r-params" file:

            xref_class_report_att_list = 2, 5, 6

       will generate these output files:

            imports-85.influ-o-text-1
            imports-85.case-text-1
            imports-85.class-text-1

       The AutoClass C reports provide the capability to compute  sigma  class
       contour  values  for  specified  pairs  of real valued attributes, when
       generating  the  influence  values  report   with   the   data   option
       (report_mode  =  "data").   Note  that  sigma  class  contours  are not
       generated from discrete type attributes.

       The sigma contours are the two dimensional equivalent of n-sigma  error
       bars  in  one  dimension.  Specifically, for two independent attributes
       the n-sigma contour is defined as the ellipse where

       ((x - xMean) / xSigma)^2 + ((y - yMean) / ySigma)^2 == n

       With  covariant  attributes,   the   n-sigma   contours   are   defined
       identically,  in  the  rotated  coordinate system of the distribution’s
       principle axes.  Thus independent  attributes  give  ellipses  oriented
       parallel  with  the attribute axes, while the axes of sigma contours of
       covariant attributes are rotated about the  center  determined  by  the
       means.   In  either  case the sigma contour represents a line where the
       class  probability  is  constant,  irrespective  of  any  other   class
       probabilities.

       With three or more attributes the n-sigma contours become k-dimensional
       ellipsoidal surfaces.  This code takes advantage of the fact  that  the
       parallel  projection  of  an  n-dimensional  ellipsoid,  onto any 2-dim
       plane, is bounded by an ellipse.  In this simplified case of projecting
       the  single sigma ellipsoid onto the coordinate planes, it is also true
       that  the  2-dim  covariances  of  this  ellipse  are  equal   to   the
       corresponding  elements  of  the  n-dim  ellipsoid’s  covariances.  The
       Eigen-system of the 2-dim covariance then gives  the  variances  w.r.t.
       the  principal  components of the eclipse, and the rotation that aligns
       it  with  the  data.   This  represents  the  best  way  to  display  a
       distribution in the marginal plane.

       To  get  contour  values,  set the keyword sigma_contours_att_list to a
       list of real valued attribute indices (from .hd2 file), and request  an
       influence values report with the data option.  For example,

            report_mode = "data"
            sigma_contours_att_list = 3, 4, 5, 8, 15

   OUTPUT REPORT PARAMETERS
       The  contents  of  the  output report are controlled by the ".r-params"
       file.  In this file, an empty line or a line starting with one of these
       characters  is  treated  as a comment: "#", "!", or ";".  The parameter
       name and its value can be separated by an equal sign,  a  space,  or  a
       tab:

            n_clsfs 1
            n_clsfs = 1
            n_clsfs<tab>1

       Spaces  are  ignored  if  "="  or "<tab>" are used as separators.  Note
       there are no trailing semicolons.

       The following are the allowed parameters and their default values:

       n_clsfs = 1
              number of clsfs in the  .results  file  for  which  to  generate
              reports, starting with the first or "best".

       clsf_n_list =
              if  specified,  this  is  a one-based index list of clsfs in the
              clsf  sequence  read  from  the  .results  file.   It  overrides
              "n_clsfs".  For example:

                   clsf_n_list = 1, 2

              will produce the same output as

                   n_clsfs = 2

              but

                   clsf_n_list = 2

              will only output the "second best" classification report.

       report_type =
              type   of   reports   to  generate:  "all",  "influence_values",
              "xref_case", or "xref_class".

       report_mode =
              mode of reports to generate. "text" is  formatted  text  layout.
              "data" is numerical -- suitable for further processing.

       comment_data_headers_p = false
              the  default  value  does  not  insert  #  in  column  1 of most
              report_mode = "data" header lines.  If specified  as  true,  the
              comment character will be inserted in most header lines.

       num_atts_to_list =
              if  specified,  the  number  of  attributes to list in influence
              values report.  if not specified, all attributes will be listed.
              (e.g. "num_atts_to_list = 5")

       xref_class_report_att_list =
              if  specified,  a  list of attribute numbers (zero-based), whose
              values will be output in the "xref_class" report along with  the
              case probabilities.  if not specified, no attributes values will
              be output.  (e.g. "xref_class_report_att_list = 1, 2, 3")

       order_attributes_by_influence_p = true
              The default value lists each class’s  attributes  in  descending
              order  of  attribute influence value, and uses ".influ-o-text-n"
              as the influence values  report  file  type.   If  specified  as
              false,  then each class’s attributes will be listed in ascending
              order by attribute number.  The extension of the file  generated
              will be "influ-no-text-n".

       break_on_warnings_p = true
              The  default value asks the user whether to continue or not when
              data definition warnings are found.  If specified as false, then
              AutoClass  will  continue,  despite warnings -- the warning will
              continue to be output to the terminal.

       free_storage_p = true
              The default value tells AutoClass to free the  majority  of  its
              allocated storage.  This is not required, and in the case of the
              DEC Alpha  causes  a  core  dump  [is  this  still  true?].   If
              specified  as false, AutoClass will not attempt to free storage.

       max_num_xref_class_probs = 5
              Determines how many lessor class probabilities will  be  printed
              for  the case and class cross-reference reports.  The default is
              to print the most probable class probability value and up  to  4
              lessor  class  prob-  ibilities.  Note this is true for both the
              "text" and "data" class cross-reference reports, but  only  true
              for  the  "data"  case cross- reference report.  The "text" case
              cross-reference  report  only  has  the  most   probable   class
              probability.

       sigma_contours_att_list =
              If specified, a list of real valued attribute indices (from .hd2
              file) will be  to  compute  sigma  class  contour  values,  when
              generating   influence   values  report  with  the  data  option
              (report_mode = "data").  If not  specified,  there  will  be  no
              sigma class contour output.  (e.g. "sigma_contours_att_list = 3,
              4, 5, 8, 15")

INTERPRETATION OF AUTOCLASS RESULTS

   WHAT HAVE YOU GOT?
       Now you have run AutoClass on your data  set  --  what  have  you  got?
       Typically,  the  AutoClass search procedure finds many classifications,
       but only saves the few best.  These are now  available  for  inspection
       and  interpretation.   The  most  important  indicator  of the relative
       merits of these alternative  classifications  is  Log  total  posterior
       probability  value.  Note that since the probability lies between 1 and
       0, the corresponding Log probability is negative and ranges from  0  to
       negative  infinity. The difference between these Log probability values
       raised  to  the  power  e  gives  the  relative  probability   of   the
       alternatives classifications.  So a difference of, say 100, implies one
       classification is e^100 ~= 10^43 more likely than the other.   However,
       these  numbers  can  be  very  misleading, since they give the relative
       probability  of  alternative  classifications   under   the   AutoClass
       assumptions.

   ASSUMPTIONS
       Specifically,  the  most important AutoClass assumptions are the use of
       normal models for real variables, and the assumption of independence of
       attributes  within a class.  Since these assumptions are often violated
       in practice, the difference in  posterior  probability  of  alternative
       classifications can be partly due to one classification being closer to
       satisfying  the  assumptions  than  another,  rather  than  to  a  real
       difference  in  classification  quality.  Another source of uncertainty
       about the utility of Log probability values is that they  do  not  take
       into  account  any specific prior knowledge the user may have about the
       domain.  This means that it  is  often  worth  looking  at  alternative
       classifications  to  see  if  you  can  interpret them, but it is worth
       starting  from  the  most  probable  first.   Note  that  if  the   Log
       probability  value is much greater than that for the one class case, it
       is saying that there is overwhelming evidence for some structure in the
       data,  and  part  of  this structure has been captured by the AutoClass
       classification.

   INFLUENCE REPORT
       So you have now picked a classification you want to examine,  based  on
       its  Log  probability value; how do you examine it?  The first thing to
       do is to generate an "influence" report on the classification using the
       report         generation        facilities        documented        in
       /usr/share/doc/autoclass/reports-c.text.   An   influence   report   is
       designed to summarize the important information buried in the AutoClass
       data structures.

       The first part of this report gives the  heuristic  class  "strengths".
       Class "strength" is here defined as the geometric mean probability that
       any instance "belonging to" class, would have been generated  from  the
       class  probability  model.  It thus provides a heuristic measure of how
       strongly each class predicts "its" instances.

       The second part is a listing of the overall "influence" of each of  the
       attributes  used  in  the classification.  These give a rough heuristic
       measure  of  the  relative  importance  of  each   attribute   in   the
       classification.   Attribute  "influence values" are a class probability
       weighted average of the "influence" of each attribute in  the  classes,
       as described below.

       The  next  part  of  the report is a summary description of each of the
       classes.  The classes are arbitrarily numbered from 0 up to n, in order
       of  descending class weight.  A class weight of say 34.1 means that the
       weighted sum of membership probabilities for class is 34.1.  Note  that
       a  class weight of 34 does not necessarily mean that 34 cases belong to
       that class, since many cases may have only partial membership  in  that
       class.   Within each class, attributes or attribute sets are ordered by
       the "influence" of their model term.

   CROSS ENTROPY
       A commonly used measure  of  the  divergence  between  two  probability
       distributions is the cross entropy: the sum over all possible values x,
       of P(x|c...)*log[P(x|c...)/P(x|g...)], where c... and g...  define  the
       distributions.   It  ranges  from zero, for identical distributions, to
       infinite for distributions placing probability 1 on differing values of
       an  attribute.  With conditionally independent terms in the probability
       distributions, the cross entropy can be factored to a  sum  over  these
       terms.   These  factors  provide a measure of the corresponding modeled
       attribute’s influence in differentiating the two distributions.

       We define the modeled term’s "influence" on a class  to  be  the  cross
       entropy  term  for  the  class  distribution  w.r.t.  the  global class
       distribution of the single class classification.  "Influence" is thus a
       measure  of  how  strongly the model term helps differentiate the class
       from the whole data set.  With independently  modeled  attributes,  the
       influence  can  legitimately be ascribed to the attribute itself.  With
       correlated or covariant attributes sets, the cross entropy factor is  a
       function  of  the  entire  set,  and  we distribute the influence value
       equally over the modeled attributes.

   ATTRIBUTE INFLUENCE VALUES
       In the "influence" report on each class, the attribute  parameters  for
       that  class are given in order of highest influence value for the model
       term attribute sets.  Only the first few attribute  sets  usually  have
       significant  influence values.  If an influence value drops below about
       20% of the highest value, then it is probably not significant, but  all
       attribute  sets  are  listed  for  completeness.   In  addition  to the
       influence value for each attribute set, the values of the attribute set
       parameters  in  that  class  are  given  along  with  the corresponding
       "global" values.  The global values are computed directly from the data
       independent  of  the classification.  For example, if the class mean of
       attribute "temperature" is 90 with standard deviation of 2.5,  but  the
       global  mean  is  68 with a standard deviation of 16.3, then this class
       has selected out cases with much higher than average temperature, and a
       rather  small  spread  in  this  high  range.   Similarly, for discrete
       attribute sets, the probability of each outcome in that class is given,
       along  with  the  corresponding  global  probability  -- ordered by its
       significance:  the  absolute  value  of  (log  {<local-probability>   /
       <global-probability>}).   The  sign of the significance value shows the
       direction of change from the global class.  This information  gives  an
       overview  of  how each class differs from the average for all the data,
       in order of the most significant differences.

   CLASS AND CASE REPORTS
       Having gained a description of the classes from the "influence" report,
       you  may  want  to  follow-up  to see which classes your favorite cases
       ended up in.  Conversely, you may want to see which cases belong  to  a
       particular  class.   For  this  kind of cross-reference information two
       complementary  reports  can  be  generated.   These  are   more   fully
       documented   in  /usr/share/doc/autoclass/reports-c.text.  The  "class"
       report, lists all the cases which have significant membership  in  each
       class  and  the  degree  to which each such case belongs to that class.
       Cases whose class membership is less than 90% in the current class have
       their  other class membership listed as well.  The cases within a class
       are ordered in increasing case number.  The alternative "cases"  report
       states  which  class (or classes) a case belongs to, and the membership
       probability in the most probable class.  These two reports allow you to
       find  which  cases belong to which classes or the other way around.  If
       nearly every case has close to 99% membership in a single  class,  then
       it  means  that  the classes are well separated, while a high degree of
       cross-membership indicates that the  classes  are  heavily  overlapped.
       Highly   overlapped   classes  are  an  indication  that  the  idea  of
       classification is breaking down and  that  groups  of  mutually  highly
       overlapped  classes,  a kind of meta class, is probably a better way of
       understanding the data.

   COMPARING CLASS WEIGHTS AND CLASS/CASE REPORT ASSIGNMENTS
       The  class  weight  given  as  the  class  probability  parameter,   is
       essentially  the  sum  over  all  data  instances,  of  the  normalized
       probability that the instance is a member of the class.  It is probably
       an  error  on  our part that we format this number as an integer in the
       report, rather than emphasizing its real nature.   You  will  find  the
       actual  real  value  recorded  as  the  w_j  parameter  in the class_DS
       structures on any .results[-bin] file.

       The .case and .class reports give probabilities that cases are  members
       of  classes.  Any assignment of cases to classes requires some decision
       rule.  The maximum probability  assignment  rule  is  often  implicitly
       assumed,  but  it cannot be expected that the resulting partition sizes
       will equal  the  class  weights  unless  nearly  all  class  membership
       probabilities  are  effectively  one  or zero.  With non-1/0 membership
       probabilities,  matching  the  class  weights  requires   summing   the
       probabilities.

       In   addition,  there  is  the  question  of  completeness  of  the  EM
       (expectation  maximization)   convergence.    EM   alternates   between
       estimating   class   parameters   and   estimating   class   membership
       probabilities.  These estimates  converge  on  each  other,  but  never
       actually  meet.   AutoClass  implements  several convergence algorithms
       with alternate stopping criteria using appropriate  parameters  in  the
       .s-params  file.  Proper setting of these parameters, to get reasonably
       complete and efficient convergence may require experimentation.

   ALTERNATIVE CLASSIFICATIONS
       In summary, the various reports that can be generated give you a way of
       viewing  the current classification.  It is usually a good idea to look
       at alternative classifications even though they do not have the minimum
       Log  probability  values.   These  other  classifications  usually have
       classes  that  correspond  closely   to   strong   classes   in   other
       classifications, but can differ in the weak classes.  The "strength" of
       a  class  within  a  classification  can  usually  be  judged  by   how
       dramatically the highest influence value attributes in the class differ
       from  the  corresponding   global   attributes.    If   none   of   the
       classifications  seem  quite satisfactory, it is always possible to run
       AutoClass again to generate new classifications.

   WHAT NEXT?
       Finally, the question of what to do after you have found an  insightful
       classification  arises.   Usually, classification is a preliminary data
       analysis step for examining a set of cases (things, examples, etc.)  to
       see  if  they can be grouped so that members of the group are "similar"
       to each other.  AutoClass gives such a grouping without the user having
       to  define  a similarity measure.  The built-in "similarity" measure is
       the mutual predictiveness of the cases.  The next step  is  to  try  to
       "explain"  why  some  objects  are  more  like  others  than those in a
       different group.  Usually, domain knowledge suggests  an  answer.   For
       example,  a  classification  of  people based on income, buying habits,
       location, age, etc., may reveal particular social classes that were not
       obvious   before   the  classification  analysis.   To  obtain  further
       information about such classes, further information, such as number  of
       cars,  what  TV  shows  are  watched,  etc.,  would  reveal  even  more
       information.  Longitudinal studies would  give  information  about  how
       social  classes  arise  and  what  influences their attitudes -- all of
       which is going way beyond the initial classification.

PREDICTIONS

       Classifications can be used to predict class membership for new  cases.
       So  in  addition to possibly giving you some insight into the structure
       behind  your  data,  you  can  now  use  AutoClass  directly  to   make
       predictions, and compare AutoClass to other learning systems.

       This  technique for predicting class probabilities is applicable to all
       attributes, regardless of data type/sub_type or likelihood  model  term
       type.

       In  the  event that the class membership of a data case does not exceed
       0.0099999 for any of the "training" classes, the following message will
       appear in the screen output for each case:

               xref_get_data: case_num xxx => class 9999

       Class  9999  members  will  appear  in  the  "case"  and "class" cross-
       reference reports with a class membership of 1.0.

       Cautionary Points:

       The usual way of using AutoClass is to  put  all  of  your  data  in  a
       data_file,  describe  that  data  with  model and header files, and run
       "autoclass -search".  Now, instead of one data_file you will have  two,
       a training_data_file and a test_data_file.

       It  is  most  important  that  both  databases  have the same AutoClass
       internal representation.  Should this not be true, AutoClass will exit,
       or  possibly  in  in  some  situations,  crash.  The prediction mode is
       designed  to  hopefully  direct  the  user  into  conforming  to   this
       requirement.

       Preparation:

       Prediction  requires  having  a  training  classification  and  a  test
       database.  The training classification is generated by the  running  of
       "autoclass       -search"      on      the      training      data_file
       ("data/soybean/soyc.db2"), for example:

           autoclass -search data/soybean/soyc.db2 data/soybean/soyc.hd2
               data/soybean/soyc.model data/soybean/soyc.s-params

       This will produce "soyc.results-bin" and "soyc.search".  Then create  a
       "reports"    parameter    file,    such    as    "soyc.r-params"   (see
       /usr/share/doc/autoclass/reports-c.text),   and   run   AutoClass    in
       "reports" mode, such as:

           autoclass -reports data/soybean/soyc.results-bin
               data/soybean/soyc.search data/soybean/soyc.r-params

       This  will  generate  class  and  case  cross-reference  files,  and an
       influence values file.  The file names are  based  on  the  ".r-params"
       file name:

               data/soybean/soyc.class-text-1
               data/soybean/soyc.case-text-1
               data/soybean/soyc.influ-text-1

       These  will  describe the classes found in the training_data_file.  Now
       this classification can be used  to  predict  the  probabilistic  class
       membership    of    the   test_data_file   cases   ("data/soybean/soyc-
       predict.db2") in the training_data_file classes.

           autoclass -predict data/soybean/soyc-predict.db2
               data/soybean/soyc.results-bin data/soybean/soyc.search
               data/soybean/soyc.r-params

       This will  generate  class  and  case  cross-reference  files  for  the
       test_data_file  cases  predicting their probabilistic class memberships
       in the training_data_file classes.  The file names  are  based  on  the
       ".db2" file name:

               data/soybean/soyc-predict.class-text-1
               data/soybean/soyc-predict.case-text-1

AUTHORS

       Dr. Peter Cheeseman
       Principal Investigator - NASA Ames, Computational Sciences Division
       cheesem@ptolemy.arc.nasa.gov

       John Stutz
       Research Programmer - NASA Ames, Computational Sciences Division
       stutz@ptolemy.arc.nasa.gov

       Will Taylor
       Support Programmer - NASA Ames, Computational Sciences Division
       taylor@ptolemy.arc.nasa.gov