NAME
multimix, multimix-prep - automatically discover classes in data
SYNOPSIS
multimix
multimix-prep
DESCRIPTION
multimix fits a mixture of multivariate distributions to a set of
observations using the EM algorithm. The data file may contain both
categorical and continuous variables.
multimix prompts for the names of the data and parameter files.
The assignment of the observations to groups and the posterior
probabilities are written to GROUPS.OUT. Parameter estimates,
convergence information, and group assignment probabilities are written
to GENERAL.OUT.
If multimix does not converge after ITER=200 iterations, the estimates
of the parameters will be written to EMPARAMEST.OUT. This file can then
be used as the parameter input file for multimix if desired.
multimix is limited to a maximum of
1500 observations (IOB=1500)
6 groups (IK6=6)
15 attributes and partition cells (IP15=15)
10 levels of categories (IM10=10)
200 iterations to convergence (ITER=200)
Recompilation is required to change these parameters.
DATA FILE
The data file has one line for each observation. Each line has one
entry for each variable. Only the first NVAR entries on each line are
read.
PARAMETER FILE
The parameter file contains free field values which describe the data
and the fitting models. multimix-prep will ask the user a series of
questions and write a suitable parameter file. If the starting point
for the fit is given by specifying initial group assignments for the
observations, then the user should prepare the file of group
assignments before starting multimix-prep. The file format is simple:
the Ith line of the file contains an integer between 1 and NG giving
the group number of the Ith observation. (The experienced user finds
it faster to edit old parameter files into new ones.)
multimix requires variables in a partition to be stored contiguously.
Hence the data is read in with the variable order being specified by
JP(J). INTYPE(J) and NCAT(J) both refer to the rearranged data.
The first five values are
NG The number of groups (distributions) in the finite mixture to be
fitted.
NOBS The number of observations.
NVAR The number of attributes.
NPAR The number of partition cells (sets of attributes associated
within each distribution).
ISPEC Flag indicating how the starting point is specified for the fit:
1 Initial parameter estimates are specified.
2 Observations are assigned to groups.
Next come eight arrays of data:
JP JP(J) is the column of the data array into which the Jth
attribute of the data file will be stored, where J varies from 1
to NVAR. For example, suppose we want the third attribute in
the first column, attribute 4 in the second column, attribute 7
in the 3rd column, and then attributes 1, 2, 5, and 6. Then
JP(J) = 4 5 1 2 6 7 3, for J=1,...,7.
IP IP(L) is the number of attributes in the Lth partition cell,
L=1,...,NPAR.
IPC IPC(L) is the number of continuous attributes in the Lth
partition cell.
ISV ISV(L) gives the index J of the start of partition cell L. E.g.
if attributes 6, 7, and 8 are in the same partition cell L, then
ISV(L)=6 and IEV(L)=8.
IEV IEV(L) gives the index J of the end of partition cell L.
IPARTYPE
IPARTYPE(L) is an indicator giving the type of model for
partition L:
1 for a categorical model.
2 for a multivariate normal model.
3 for a location model.
IVARTYPE
IVARTYPE(J) is an indicator for the type of attribute J:
1 for a categorical attribute.
2 for a multivariate normal attribute;
3 for a categorical attribute in a location model;
4 for a multivariate normal attribute in a location
model.
NCAT NCAT(J) is the number of categories for the Jth categorical
attribute. For continuous attributes, NCAT(J) should be 0.
If observations are assigned to groups (ISPEC=2), then those
assignments are next:
IGRP IGRP(I) is the index of the group that observation I is in.
If observations are not assigned to groups (ISPEC=1), then estimates of
the parameters are next:
PI PI(K) is the estimated mixing proportion for group K
(K=1,...,NG).
The parameters for each group depend on the type of attribute:
THETA THETA(K,J,M) is the estimated probability that the Jth
categorical attribute is at level M, given that in group K.
Repeat for each attribute, J=ISV(L),IEV(L). categorical
attributes only
EMU EMU(K,L,J) is the estimated mean vector for group K, partition
cell L and attribute J. multivariate normal model only
THETA THETA(K,J,M) is the estimated probability that the Jth
categorical attribute in the location model is at level M, given
that in group K. categorical attributes only
EMUL EMUL(K,L,J,M) is the estimated mean vector for group K,
partition cell L and attribute J, at the Mth level of the
categorical attribute in the location model. multivariate
normal model only
VARIX ((VARIX(K,L,I,J),J=1,IPC(L)), I=1,IPC(L)) An entry in VARIX is
the estimated covariance between attributes I and J for group K,
partition cell L, where I=1,...,IPC(L), and J=1,...,IPC(L).
The required parameters are read in for each partition cell,
L=1,...,NPAR. For example, if the attributes within the partition cell
are all categorical, that is, ITYPE(L)=1, then THETA(K,J,M), for
M=1,...,NCAT(J) is required for the attribute in that partition cell.
If the attributes within the partition cell are continuous,
multivariate normal attributes, that is ITYPE(L)=2, then estimates of
EMU(K,L,J) are required for each attribute.
If the attributes within the partition cell follow the location model,
that is, ITYPE(L)=3, then THETA(K,J,M),M=1,...,NCAT(J) is required for
the categorical attribute, and EMUL(K,L,J,M),M=1,...,IM(L) is required
for each continuous multivariate normal attribute. (Note that IM(L) is
the number of categories of the categorical attribute associated with
the location model.)
The estimates are read in first for group 1, then for group 2, etc.
EXAMPLES
See /usr/share/doc/multimix/examples.
FILES
GROUPS.OUT multimix output: the assignment of the observations to
groups and the posterior probabilities. If observations were initially
assigned to groups (ISPEC=2), these assignments may be different. Some
are likely to be different if the fitting distributions overlap.
GENERAL.OUT multimix output: parameter estimates, convergence
information, and group assignment probabilities.
EMPARAMEST.OUT multimix output on failure to converge: current
parameter estimates. This file can then be used as the parameter input
file for multimix if desired.
AUTHORS
Lynette A. Hunt <lah@waikato.ac.nz> and Murray Jorgensen
<maj@waikato.ac.nz>.
SEE ALSO
/usr/share/doc/multimix/paper.ps.gz
/usr/share/doc/multimix/talk.ps.gz
/usr/share/doc/multimix/notes.ps.gz
/usr/share/doc/multimix/PPAPER.ps.gz
/usr/share/doc/multimix/alltables.ps.gz
autoclass(1).
December 10, 2001