NAME
cmcalibrate - fit exponential tails for E-values and determine HMM
filter thresholds for a CM
SYNOPSIS
cmcalibrate [options] cmfile
DESCRIPTION
cmcalibrate calibrates E-value statistics and HMM filter thresholds for
the covariance models (CMs) in cmfile. The E-values and HMM filter
threshold statistics are added to the cmfile and are used by cmsearch
for increased sensitivity and acceleration in RNA homology search.
CMs are profiles of RNA consensus sequence and secondary structure. A
CM file is produced by the cmbuild program, from a given RNA sequence
alignment of known consensus structure.
cmcalibrate is very slow. It takes several hours to calibrate a single
average sized CM. cmcalibrate can be run in parallel with MPI. To do
this, use the --mpi option and run cmsearch inside a MPI wrapper
program such as mpirun. For example: mpirun C cmcalibrate --mpi [other
options] cmfile. Note that cmcalibrate must have been compiled in MPI
mode to use the --mpi option. See the Installation seciton of the
user’s guide for instructions on how to compile in MPI mode.
The --forecast <n> option can be used to estimate how long the program
will take to run on <n> processors. Unless you plan on running
cmcalibrate in MPI mode, <n> should be set as 1.
cmcalibrate performs two main tasks. The first is to calibrate E-value
statistics. This is done by generating random sequences and searching
them with the CM and collecting hits. The histogram of the bit scores
of the hits is fit to an exponential tail, and the parameters of the
fitted tail are saved to the CM file. The exponential tail is used to
predict the expected number of hits (E-values) at a given bit score in
cmsearch. The random sequences are generated by an HMM that was
trained on real genomic sequences with various GC contents. The goal is
to have the GC distributions in the random sequences to be similar to
actual genomic sequences.
The second task is to determine appropriate HMM filter thresholds for
the CM over the possible range of final CM bit score thresholds. This
is done by sampling 10,000 sequences from the CM itself and searching
them with the CM and HMM. The appropriate HMM bit score threshold for a
given CM threshold is set as the HMM threshold that will recognize
99.5% of the hits that score above the CM threshold. This HMM threshold
is calculated over the range of reasonable CM thresholds.
Both tasks must be performed for each configuration and algorithm that
cmsearch might use. These include HMM Viterbi, HMM Forward, CM CYK and
CM Inside algorithms for E-value calibration, and CM CYK and CM Inside
algorithms for HMM filter thresholds. Additionally, for each algorithm,
each task must be performed twice, once for a locally configured model
and once for a globally configured model.
The E-values and HMM filter thresholds determined by cmcalibrate are
only used by the cmsearch program. If you are not going to use
cmsearch, do not waste time calibrating your models.
The majority of the options to cmcalibrate fall into one of two
categories, depending on which of the two main tasks they’re associated
with. Options that affect the exponential tail E-value fitting are
prefixed with --exp. Options that affect the HMM filter threshold
determination are prefixed with --fil.
The calibration of E-value statistics takes the majority of the running
time of cmcalibrate. This is because CM search algorithms are slow,
and the random sequences that must be searched have to be long enough
to include enough random hits that can be binned into a histogram to
which an exponential tail can be reliably fit. By default the random
sequence length for CM searches is 1.5 megabases (Mb), for all search
modes, but 1.5 can be changed to <x> with --exp-cmL-glc <x> or --exp-
cmL-loc <x> options for glocal and local CM search calibrations
respectively. Because cmsearch uses HMM search algorithms to filter,
cmcalibrate must also fit exponential tails for HMM search algorithms.
HMMs are much faster than CMs so it is possible to search much longer
random sequence than 1.5 MB and not significantly increase the running
time of cmcalibrate. The length of sequence searched with the HMM is
controlled by the --exp-fract <x>, --exp-hmmLn-glc <x>, --exp-hmmLn-loc
<x>, and the --exp-hmmLx <x> options. By default, the sequence length
for HMM calibration is set as the length that will require 0.10 times
the number of dynamic programming calculations as a CM E-value
calibration step. (The value 0.10 can be changed to <x> with the --exp-
fract <x> option). If this sequence length is less than a minimum
value, which by default is 15.0 MB, then the minimum value is used. The
minimum value can be changed to <x> with --exp-hmmLn-glc <x> and --exp-
hmmLn-loc <x> for glocal and local HMM search calibrations separately.
Similarily if this value is more than a maximum value, which by default
is 1000.0 MB, then the maximum value is used. The maximum value can be
changed to <x> with the --exp-hmmLx <x> option.
OPTIONS
-h Print brief help; includes version number and summary of all
options, including expert options.
-s <n> Set the random number generator seed to <n>, where <n> is a
positive integer. The default is to use time() to generate a
different seed for each run, which means that two different runs
of cmcalibrate on the same CM will give slightly different E-
value and HMM filter threshold parameters. You can use this
option to generate reproducible results.
--forecast <n>
Predict the running time of the calibration for cmfile and
provided options and exit, DO NOT perform the calibration. The
predictions should be used as rough estimates. The value <n> is
the number of processors the calibration will be run on, so <n>
equal to 1 is appropriate unless you will run cmcalibrate in
parallel with MPI.
--devhelp
Print help, as with -h , but also include undocumented developer
options. These options are not listed below, are under
development or experimental, and are not guaranteed to even work
correctly. Use developer options at your own risk. The only
resources for understanding what they actually do are the brief
one-line description printed when --devhelp is enabled, and the
source code.
--mpi Run as an MPI parallel program. This option will only be
available if Infernal has been configured and built with the
"--enable-mpi" flag (see User’s Guide for details).
EXPERT OPTIONS
--exp-cmL-glc <x>
Set the length of random sequence to search for the CM glocal
exponential tail fits to <x> megabases (Mb). By default, <x> is
1.5 Mb. Searching more sequences will make the exponential tail
fits more precise, but will take longer: using <x> of 3.0
instead of the default of 1.5 will cause the running time of
cmcalibrate to increase by roughly 50%.
--exp-cmL-loc <x>
Set the length of random sequence to search for the CM local
exponential tail fits to <x> megabases (Mb). By default, <x> is
1.5 Mb. Searching more sequences will make the exponential tail
fits more precise, but will take longer: using <x> of 3.0
instead of the default of 1.5 will cause the running time of
cmcalibrate to increase by roughly 50%.
--exp-hmmLn-glc <x>
Set the minimum random sequence length to search for the HMM
glocal exponential tail fits to <x> megabases (Mb). By default,
<x> is 15.0. For more information, see the explanation
regarding sequence lengths for E-value calibration above before
the Options section.
--exp-hmmLn-loc <x>
Set the minimum random sequence length to search for the HMM
local exponential tail fits to <x> megabases (Mb). By default,
<x> is 15.0. For more information, see the explanation
regarding sequence lengths for E-value calibration above before
the Options section.
--exp-hmmLx <x>
Set the maximum random sequence length to search when
determining HMM E-values to <x> megabases (Mb). By default, <x>
is 1000.0. For more information, see the explanation regarding
sequence lengths for E-value calibration above before the
Options section.
--exp-fract <x>
Set the HMM/CM fraction of dynamic programming calculations to
<x>. By default, <x> is 0.10. For more information, see the
explanation regarding sequence lengths for E-value calibration
above before the Options section.
--exp-tailn-cglc <x>
During E-value calibration of glocal CM search modes fit the
exponential tail to the high scores in the histogram tail that
includes <x> hits per Mb searched. By default this <x> is 25.
The value 25 was chosen because it works well empirically for
glocal CM modes relative to other values.
--exp-tailn-cloc <x>
During E-value calibration of local CM search modes fit the
exponential tail to the high scores in the histogram tail that
includes <x> hits per Mb searched. By default this <x> is 75.
The value 75 was chosen because it works well empirically for
local CM modes relative to other values.
--exp-tailn-hglc <x>
During E-value calibration of glocal HMM search modes fit the
exponential tail to the high scores in the histogram tail that
includes <x> hits per Mb searched. By default this <x> is 250.
The value 250 was chosen because it works well empirically for
glocal HMM modes relative to other values.
--exp-tailn-hloc <x>
During E-value calibration of local HMM search modes fit the
exponential tail to the high scores in the histogram tail that
includes <x> hits per Mb searched. By default this <x> is 750.
The value 750 was chosen because it works well empirically for
glocal HMM modes relative to other values.
--exp-tailp <x>
Ignore the --exp-tailn prefixed options and fit the <x> fraction
right tail of the histogram to exponential tails, for all search
modes.
--exp-tailxn <n>
With --exp-tailp enforce that the maximum number of hits in the
tail that is fit is <n>.
--exp-beta <x>
During E-value calibration, by default query-dependent banding
(QDB) is used to accelerate the CM search algorithms with a beta
tail loss probability of 1E-15. This beta value can be changed
to <x> using the --exp-beta <x> option. The beta parameter is
the amount of probability mass excluded during band calculation,
higher values of beta give greater speedups but sacrifice more
accuracy than lower values. A recommended value is 1E-7
(0.00001). QDB is explained in more detail in the manual page
for cmsearch and in (Nawrocki and Eddy, PLoS Computational
Biology 3(3): e56).
--exp-no-qdb
Turn of QDB during E-value calibration. This will slow down
calibration, and is not recommended unless you plan on using
--no-qdb in cmsearch.
--exp-hfile <f>
Save the histograms fit for the E-value calibration to file <f>.
The format of this file is two tab delimited columns. The first
column is the x-axis values of bit scores of each bin. The
second column is the y-axis values of number of hits per bin.
Each series is delimited by a line with a single character "&".
The file will contain one series for each exponential tail fit,
i.e. one series of empirical data for each line of output from
cmcalibrate that begins with "exp tail".
--exp-sfile <f>
Save a survival plot for the E-value calibration to file <f>.
The format of this file is two tab delimited columns. The first
column is the x-axis values of bit scores of each bin. The
second column is the y-axis values of fraction of hits that meet
or exceed the score for each bin. Each series is delimited by a
line with a single character "&". The file will contain three
series’ of data for each exponential tail fit, i.e. three series
for each line of output from cmcalibrate that begins with "exp
tail". The first series is the empirical survival plot from the
histogram of hits to the random sequence. The second series is
the exponential tail fit to the empirical distribution. The
third series is the exponential tail fit if lambda were fixed
and set as the natural log of 2 (0.691314718).
--exp-qqfile <f>
Save a quantile-quantile plot for the E-value calibration to
file <f>. The format of this file is two tab delimited columns.
The first column is the x-axis values, and the second column is
the y-axis values. The distance of the points from the identity
line (y=x) is a measure of how good the exponential tail fit is,
the closer the points are to the identity line, the better the
fit is. Each series is delimited by a line with a single
character "&". The file will contain one series of empirical
data for each exponential tail fit, i.e. one series for each
line of output from cmcalibrate that begins with "exp tail".
--exp-ffile <f>
Save statistics on the exponential tail statistics to file <f>.
The file will contain the lambda and mu values for exponential
tails fit to tails of different sizes. For example, by default
cmcalibrate fits exponential tails to the rightmost 0.01 (1) of
the score histogram and stores the parameters of that
exponential tail to the CM file. (The value of 0.01 can be
changed to <x> with the --exp-tailp <x> option). When --exp-
ffile <f> is used the file <f> will include the exponential
tail parameters for fits to various fractions of the histogram
tail, instead of just to 0.01.
--fil-N <n>
Set the number of sequences sampled and searched for the HMM
filter threshold calibration to <n>. By default, <n> is 10,000.
--fil-F <x>
Set the fraction of sample sequences the HMM filter must be able
to recognize, and allow to survive, to <x>, where <x> is a
positive real number less than or equal to 1.0. By default, <x>
is 0.993.
--fil-tau <x>
Set the tail loss probability during HMM band calculation for
HMM filter threshold calibration to <x>. This is the amount of
probability mass within the HMM posterior probabilities that is
considered negligible. The default value is 1E-7. In general,
higher values will result in greater acceleration, but increase
the chance of missing the optimal alignment due to the HMM
bands.
--fil-gemit
During HMM filter calibration, always sample sequences from a
globally configured CM, even when calibrating local modes. By
default, sequences are sampled from a globally configured CM
when calibrating the global search modes, and sampled from a
locally configured CM when calibrating the local search modes.
--fil-dfile <f>
Save statistics on filter threshold calibration, including HMM
and CM scores for all sampled sequences, to file <f>.
--mxsize <x>
Set the maximum allowable DP matrix size to <x> megabytes. By
default this size is 2,048 Mb. This should be large enough for
the vast majority of calibrations, however if it is not
cmcalibrate will exit prematurely and report an error message
that the matrix exceeded it’s maximum allowable size. In this
case, the --mxsize can be used to raise the limit.
SEE ALSO
For complete documentation, see the User’s Guide (Userguide.pdf) that
came with the distribution; or see the Infernal web page,
http://infernal.janelia.org/.
COPYRIGHT
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on
redistribution conditions.
AUTHOR
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/