NAME
cmscore - align and score one or more sequences to a CM
SYNOPSIS
cmscore [options] cmfile seqfile
DESCRIPTION
cmscore uses the covariance model (CM) in cmfile to align and score the
sequences in seqfile, and output summary statistics on timings and
scores. cmscore is a testbed for new CM alignment algorithms, and it
is also used by the testsuite. It is not intended to be particularly
useful in the real world. Documentation is provided for completeness,
and to aid our own memories.
CMs are profiles of RNA consensus sequence and secondary structure. A
CM file is produced by the cmbuild program, from a given RNA sequence
alignment of known consensus structure.
cmscore aligns the sequence(s) in seqfile using two alignment
algorithms and compares the scores and timings of each algorithm. By
default the two algorithms compared are the divide and conquer (D&C)
CYK algorithm (SR Eddy, BMC Bioinformatics 3:18, 2002), and the HMM
banded standard CYK algorithm. When the --nonbanded option is enabled
D&C CYK is compared with the standard CYK alignment algorithm. In this
case, because both algorithms should find the optimal alignment the
parsetree scores should be nearly identical (within 0.01 bits), if this
is not the case for any sequence cmscore exits and prints an error
message. When the --viterbi option is enabled D&C CYK is compared with
Viterbi alignment to a CM Plan 9 HMM constructed to be maximally
similar to the CM.
While non-banded CYK variants are guaranteed to find the optimal
alignment and score of each sequence, HMM banded CYK sacrifices this
guarantee for acceleration. The level of acceleration can be controlled
by the tau parameter, which is set with the --tau <x> option. This is
described in more detail in the man page for cmalign, but in short, <x>
is a rough estimate at the probability that the optimal alignment will
be missed. The greater <x> is, the greater the acceleration, but the
greater the chance of missing the optimal alignment. By default tau is
set as 1E-7. cmscore is useful for testing for values of tau that give
the best trade-off between acceleration versus accuracy. To make this
testing easier, multiple tau values can be tested within a single
cmscore call. The --taus <x> and --taue <x> option combination allow
the user to specify a beginning tau value and an ending tau value. For
example, --taus 3 and and --taue 5 would first align the sequences in
seqfile with non-banded D&C CYK, and then perform 3 additional HMM
banded alignments, first with tau=1E-3, next with tau=1E-4 and finally
with tau=1E-5. Currently, only values of 1E-<x> can be used. Summary
statistics on timings and how often the optimal alignment is missed for
each value or tau are printed to stdout.
When comparing the non-banded standard CYK and D&C CYK algorithms with
the --nonbanded option, the two parse trees should usually be identical
for any sequence, because the optimal alignment score is guaranteed.
However, there can be cases of ties, where two or more different parse
trees have identical scores. In such cases, it is possible for the two
parse trees to differ. The parse tree selected as "optimal" from
amongst the ties is arbitrary, dependent on order of evaluation in the
DP traceback, and the order of evaluation for D&C vs. standard CYK is
different. Thus, in its testsuite role, cmscore checks that the scores
are within 0.01 bits of each other, but does not check that the parse
trees are absolutely identical.
The alignment algorithms can be run in "search" mode within cmscore by
using the --search option. When --search is enabled, --inside
specifies that the Inside algorithm be used instead of CYK and
--forward specifies that the HMM Forward algorithm be used instead of
CYK.
The sequences are treated as single stranded RNAs; that is, only the
given strand of each sequence is aligned and scored, and no reverse
complementing is done.
OPTIONS
-h Print brief help; includes version number and summary of all
options, including expert options.
-n <n> Set the number of sequences to generate and align to <n>. This
option is incompatible with the --infile option.
-l Turn on the local alignment algorithm, which allows the
alignment to span two or more subsequences if necessary (e.g. if
the structures of the query model and target sequence are only
partially shared), allowing certain large insertions and
deletions in the structure to be penalized differently than
normal indels. The default is to globally align the query model
to the target sequences.
-s <n> Set the random seed to <n>, where <n> is a positive integer. The
default is to use time() to generate a different seed for each
run, which means that two different runs of cmscore on the same
CM will give different results. You can use this option to
generate reproducible results. The random number generator is
used to generate sequences to score, so -s is incompatible with
the --infile option which supplies the sequences to score in an
input file.
-a Print individual timings and score comparisons for each sequence
in seqfile. By default only summary statistics are printed.
--sub Turn on the sub model construction and alignment procedure. For
each sequence, an HMM is first used to predict the model start
and end consensus columns, and a new sub CM is constructed that
only models consensus columns from start to end. The sequence is
then aligned to this sub CM. This option is useful for aligning
sequences that are known to truncated, non-full length
sequences. This "sub CM" procedure is not the same as the "sub
CMs" described by Weinberg and Ruzzo. When used in combination
with --tfile the parsetree printed is not the sub CM parsetree,
but rather the sub CM parstree mapped onto the topology of the
original CM. This mapped parsetree will likely have a different
score (sometimes much worse) than the sub CM parsetree, both of
those scores are printed to the parsetree file for each
sequence.
--mxsize <x>
Set the maximum allowable DP matrix size to <x> megabytes. By
default this size is 2048 Mb. This should be large enough for
the most alignments, however if it is not cmscore will exit
prematurely and report an error message that the matrix exceeded
it’s maximum allowable size. In this case, the --mxsize can be
used to raise the limit, or if --nonbanded is enabled, the
--scoreonly option will solve the memory issue. This memory
error is most likely to occur when the --nonbanded option is
used without the --scoreonly option, but can still occur when
--nonbanded is not used.
--devhelp
Print help, as with -h , but also include undocumented developer
options. These options are not listed below, are under
development or experimental, and are not guaranteed to even work
correctly. Use developer options at your own risk. The only
resources for understanding what they actually do are the brief
one-line description printed when --devhelp is enabled, and the
source code.
--mpi Run as an MPI parallel program. This option will only be
available if Infernal has been configured and built with the
--enable-mpi flag (see User’s Guide for details).
EXPERT OPTIONS
--emit Generate sequences to score by sampling from the CM. This
option is on by default. The number of sequences generated is 10
by default but can be changed with the -n option. The sequences
generated from the CM can be saved to an output file in FASTA
format with the --outfile option.
--random
Generate sequences to score by sampling from the CMs null
distribution. This option turns the --emit option off. By
default the CM distribution will be 0.25 for each of the four
RNA nucleotides, but it may be different if the --null option
was used when cmbuild created the cmfile. By default, the
length of the sequences generated is sampled from the length
distribution of the CM. The average length of the random
sequences will be the consensus length of the RNA family
modelled by the CM (or very close to it). Alternatively, the
--Lmin <n> and --Lmax <n> options can be used to specify a
length distribution. The number of sequences generated is 10 by
default but can be changed with the -n option. The random
sequences generated can be saved to an output file in FASTA
format with the --outfile option.
--infile <f>
Sequences to score are read from the file <f>. All the
sequences from <f> are read and scored, the -n and -s options
are incompatible with --infile.
--outfile <f>
Save generated sequences that are scored to the file <f> in
FASTA format. This option is incompatible with the --infile
option.
--Lmin <n1>
Must be used in combination with --random and --Lmax <n2>. The
lengths of the random sequences generated and scored will be
uniform between the range of <n1>..<n2>.
--Lmax <n2>
Must be used in combination with --random and --Lmin <n1>. The
lengths of the random sequences generated and scored will be
uniform between the range of <n1>..<n2>.
--pad Must be used in combination with --emit and --search. Add <n>
cm->W (max hit length) minus L (sequence <x> length) residues to
the 5’ and 3’ end of each emitted sequence <x>.
--hbanded
Specify that the second stage alignment algorithm be HMM banded
CYK. This option is on by default. For more information on this
option, see the description of the --hbanded option in the man
page for cmalign.
--tau <x>
For stage 2 alignment, set the tail loss probability used during
HMM band calculation to <x>. This is the amount of probability
mass within the HMM posterior probabilities that is considered
negligible. The default value is 1E-7. In general, higher
values will result in greater acceleration, but increase the
chance of missing the optimal alignment due to the HMM bands.
--aln2bands
With --search, when calculating HMM bands, use an HMM alignment
algorithm instead of an HMM search algorithm. In general, using
this option will result in greater acceleration, but will
increase the chance of missing the optimal alignment.
--hsafe
For stage 2 HMM banded alignment, realign any sequences with a
negative alignment score using non-banded CYK to guarantee
finding the optimal alignment.
--nonbanded
Specify that the second stage alignment algorithm be standard,
non-banded, non-D&C CYK. When --nonbanded is enabled, the
program fails with a non-zero exit code and prints an error
message if the parsetree score for any sequence from stage 1 D&C
alignment and stage 2 alignment differs by more than 0.01 bits.
In theory, this should never happen as both algorithms are
guaranteed to determine the optimal parsetree. For larger RNAs
(more than 300 residues) if memory is limiting, --nonbanded
should be used in combination with --scoreonly.
--scoreonly
With --nonbanded during the second stage standard non-banded CYK
alignment, use the "score only" variant of the algorithm to
save memory, and don’t recover a parse tree.
--viterbi
Specify that the second stage alignment algorithm be Viterbi to
a CM Plan 9 HMM.
--search
Run all algorithms in scanning mode, not alignment mode.
This means the highest scoring subsequence within each
sequence is returned as the score, not necessarily the
score of an alignment of the full sequence.
--inside
With --search Compare the non-banded scanning Inside
algorithm to the HMM banded scanning Inside algorith,
instead of using CYK versions.
--forward
With --search Compare the scanning Forward scoring
algorithm against CYK.
--taus <n>
Specify the first alignment algorithm as non-banded D&C
CYK, and multiple stages of HMM banded CYK alignment. The
first HMM banded alignment will use tau=1E-<x>, which
will be the highest value of tau used. Must be used in
combination with --taue.
--taue <n>
Specify the first alignment algorithm as non-banded D&C
CYK, and multiple stages of HMM banded CYK alignment. The
final HMM banded alignment will use tau=1E-<x>, which
will be the lowest value of tau used. Must be used in
combination with --taus.
--tfile <f>
Print the parsetrees for each alignment of each sequence
to file <f>.
SEE ALSO
For complete documentation, see the User’s Guide (Userguide.pdf)
that came with the distribution; or see the Infernal web page,
http://infernal.janelia.org/.
COPYRIGHT
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on
redistribution conditions.
AUTHOR
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/