NAME
cmbuild - construct a CM from an RNA multiple sequence alignment
SYNOPSIS
cmbuild [options] cmfile alifile
DESCRIPTION
cmbuild read an RNA multiple sequence alignment from alifile,
constructs a covariance model (CM), and saves the CM to cmfile.
The alignment file must be in Stockholm format, and must contain
consensus secondary structure annotation. cmbuild uses the consensus
structure to determine the architecture of the CM.
The alignment file may be a database containing more than one
alignment. If so, the resulting cmfile will be a database of CMs, one
per alignment.
The expert options --ctarget, --cmindiff, and --call result in multiple
CMs being built from each alignment in alifile as described below.
OUTPUT
The default output from cmbuild is tabular, with a single line printed
for each model . Each line has the following fields: aln: the index of
the alignment used to build the CM, cm idx: the index of the CM in the
cmfile; name: the name of the CM, nseq: the number of sequences in the
alignment used to build the CM, eff_nseq: the effective number of
sequences used to build the model (see the User Guide); alen: the
length of the alignment used to build the CM; clen: the number of
columns from the alignment defined as consensus columns; rel entropy,
CM: the total relative entropy of the model divided by the number of
consensus columns; rel entropy, HMM: the total relative entropy of the
model ignoring secondary structure divided by the number of consensus
columns.
OPTIONS
-h Print brief help; includes version number and summary of all
options, including expert options.
-n <s> Name the covariance model <s>. (Does not work if alifile
contains more than one alignment). The default is to use the
name of the alignment (given by the #=GF ID tag, in Stockholm
format), or if that is not present, to use the name of the
alignment file minus any file type extension plus a "-" and a
positive integer indicating the position of that alignment in
the file (that is, the first alignment in a file "myrnas.sto"
would give a CM named "myrnas-1", the second alignment would
give a CM named "myrnas-2").
-A Append the CM to cmfile, if cmfile already exists.
-F Allow cmfile to be overwritten. Normally, if cmfile already
exists, cmbuild exits with an error unless the -A or -F option
is set.
-v Run in verbose output mode instead of using the default single
line tabular format. This output format is similar to that used
by older versions of Infernal.
--iins Allow informative insert emissions for the CM. By default, all
CM insert emission scores are set to 0.0 bits. The motivation
for zero bit scores is to avoid high-scoring hits to low
complexity sequence favored by high insert state emission
scores.
--Wbeta<x>
Set the beta tail loss probability for query-dependent banding
(QDB) to <x> The QDB algorithm is used to determine the maximium
length of a hit to the model. For more information on QDB see
(Nawrocki and Eddy, PLoS Computational Biology 3(3): e56). The
beta paramater is the amount of probability mass considered
negligible during band calculation, lower values of beta will
result in shorter maximum hit lengths, which will yield faster
searches. The default beta is 1E-7: determined empirically as a
good tradeoff between sensitivity, specificity and speed.
--devhelp
Print help, as with -h , but also include undocumented developer
options. These options are not listed below. They are under
development or experimental, and are not guaranteed to even work
correctly. Use developer options at your own risk. The only
resources for understanding what they actually do are the brief
one-line description printed when --devhelp is enabled, and the
source code.
EXPERT OPTIONS
--rsearch <f>
Parameterize emission scores a la RSEARCH, using the RIBOSUM
matrix in file <f>. (Actually, the emission scores will not be
identical to RIBOSUM scores due of differences in the modelling
strategy between Infernal and RSEARCH, but they will be as
similar as possible.) RIBOSUM matrix files are included with
Infernal in the "matrices/" subdirectory of the top-level
Infernal directory. RIBOSUM matrices are substitution score
matrices trained specifically for structural RNAs with separate
single stranded residue and base pair substitution scores. For
more information see the RSEARCH publication (Klein and Eddy,
BMC Bioinformatics 4:44, 2003). Actually, the emission scores
will not exactly
With --rsearch enabled, all alignments in alifile must contain
exactly one sequence or the --call option must also be enabled.
--binary
Save the model in a compact binary format. The default is a more
readable ASCII text format.
--rf Use reference coordinate annotation (#=GC RF line, in Stockholm)
to determine which columns are consensus, and which are inserts.
Any non-gap character indicates a consensus column. (For
example, mark consensus columns with "x", and insert columns
with ".".) The default is to determine this automatically; if
the frequency of gap characters in a column is greater than a
threshold, gapthresh (default 0.5), the column is called an
insertion.
--gapthresh <x>
Set the gap threshold (used for determining which columns are
insertions versus consensus; see --rf above) to <x>. The
default is 0.5.
--ignorant
Strip all base pair secondary structure information from all
input alignments in alifile before building the CM(s). All
resulting CM(s) will have zero MATP (base pair) nodes, with zero
bifurcations.
--wgsc Use the Gerstein/Sonnhammer/Chothia (GSC) weighting algorithm.
This is the default unless the number of sequences in the
alignment exceeds a cutoff (see --pbswitch), in which case the
default becomes the faster Henikoff position-based weighting
scheme.
--wblosum
Use the BLOSUM filtering algorithm to weight the sequences,
instead of the default GSC weighting. Cluster the sequences at
a given percentage identity (see --wid); assign each cluster a
total weight of 1.0, distributed equally amongst the members of
that cluster.
--wpb Use the Henikoff position-based weighting scheme. This weighting
scheme is automatically used (overriding --wgsc and --wblosum)
if the number of sequences in the alignment exceeds a cutoff
(see --pbswitch).
--wnone
Turn sequence weighting off; e.g. explicitly set all sequence
weights to 1.0.
--wgiven
Use sequence weights as given in annotation in the input
alignment file. If no weights were given, assume they are all
1.0. The default is to determine new sequence weights by the
Gerstein/Sonnhammer/Chothia algorithm, ignoring any annotated
weights.
--pbswitch <n>
Set the cutoff for automatically switching the weighting method
to the Henikoff position-based weighting scheme to <n>. If the
number of sequences in the alignment exceeds <n> Henikoff
weighting is used. By default <n> is 5000.
--wid <x>
Controls the behavior of the --wblosum weighting option by
setting the percent identity for clustering the alignment to
<x>.
--eent Use the entropy weighting strategy to determine the effective
sequence number that gives a target mean match state relative
entropy. This option is the default, and can be turned off with
--enone. The default target mean match state relative entropy
is 0.59 bits but can be changed with --ere. The default of 0.59
bits is automatically changed if the total relative entropy of
the model (summed match state relative entropy) is less than a
cutoff, which is is 6.0 bits by default, but can be changed with
the expert, undocumented --eX option. If you really want to play
with that option, consult the source code.
--enone
Turn off the entropy weighting strategy. The effective sequence
number is just the number of sequences in the alignment.
--ere <x>
Set the target mean match state relative entropy as <x>. By
default the target relative entropy per match position is 0.59
bits.
--null <f>
Read a null model from <f>. The null model defines the
probability of each RNA nucleotide in background sequence, the
default is to use 0.25 for each nucleotide. The format of null
files is documented in the User’s Guide.
--prior <f>
Read a Dirichlet prior from <f>, replacing the default mixture
Dirichlet. The format of prior files is documented in the
User’s Guide.
--ctarget <n>
Cluster each alignment in alifile by percent identity. Find a
cutoff percent id threshold that gives exactly <n> clusters and
build a separate CM from each cluster. If <n> is greater than
the number of sequences in the alignment the program will not
complain, and each sequence in the alignment will be its own
cluster. Each CM will have a positive integer appended to its
name indicating the order in which it was built. For example, if
cmbuild --ctarget 3 is called with alifile "myrnas.sto", and
"myrnas.sto" has exactly one Stockholm alignment in it with no
#=GF ID tag annotation, three CMs will be built, the first will
be named "myrnas-1.1", the second, "myrnas-1.2", and the third
"myrnas-1.3". (As explained above for the -n option, the first
number "1" after "myrnas" indicates the CM was built from the
first alignment in "myrnas.sto".)
--cmaxid <x>
Cluster each sequence alignment in alifile by percent identity.
Define clusters at the cutoff fractional id similarity of <x>
and build a separate CM from each cluster. No two sequences
will be be more than <x> fractionally identical ( <x> * 100
percent identical) if those two sequences are in different
clusters. The CMs are named as described above for --ctarget.
--call Build a separate CM from each sequence in each alignment in
alifile. Naming of CMs takes place as described above for
--ctarget. Using this option in combination with --rsearch
causes a separate CM to be built and parameterized using a
RIBOSUM matrix for each sequence in alifile.
--corig
After building multiple CMs using --ctarget, --cmindiff or
--call as described above, build a final CM using the complete
original alignment from alifile. The CMs are named as described
above for --ctarget with the exception of the final CM built
from the original alignment which is named in the default
manner, without an appended integer.
--cdump<f>
Dump the multiple alignments of each cluster to <f> in Stockholm
format. This option only works in combination with --ctarget,
--cmindiff or --call.
--refine <f>
Attempt to refine the alignment before building the CM using
expectation-maximization (EM). A CM is first built from the
initial alignment as usual. Then, the sequences in the alignment
are realigned optimally (with the HMM banded CYK algorithm,
optimal means optimal given the bands) to the CM, and a new CM
is built from the resulting alignment. The sequences are then
realigned to the new CM, and a new CM is built from that
alignment. This is continued until convergence, specifically
when the alignments for two successive iterations are not
significantly different (the summed bit scores of all the
sequences in the alignment changes less than 1% between two
successive iterations). The final alignment (the alignment used
to build the CM that gets written to cmfile) is written to <f>.
--gibbs
Modifies the behavior of --refine so Gibbs sampling is used
instead of EM. The difference is that during the alignment stage
the alignment is not necessarily optimal, instead an alignment
(parsetree) for each sequences is sampled from the posterior
distribution of alignments as determined by the Inside
algorithm. Due to this sampling step --gibbs is non-
deterministic, so different runs with the same alignment may
yield different results. This is not true when --refine is used
without the --gibbs option, in which case the final alignment
and CM will always be the same. When --gibbs is enabled, the -s
<n> option can be used to seed the random number generator
predictably, making the results reproducible. The goal of the
--gibbs option is to help expert RNA alignment curators refine
structural alignments by allowing them to observe alternative
high scoring alignments.
-s <n> Set the random seed to <n>, where <n> is a positive integer.
This option can only be used in combination with --gibbs. The
default is to use time() to generate a different seed for each
run, which means that two different runs of cmbuild --refine <f>
--gibbs on the same alignment will give slightly different
results. You can use this option to generate reproducible
results.
-l With --refine, turn on the local alignment algorithm, which
allows the alignment to span two or more subsequences if
necessary (e.g. if the structures of the query model and target
sequence are only partially shared), allowing certain large
insertions and deletions in the structure to be penalized
differently than normal indels. The default is to globally
align the query model to the target sequences.
-a With --refine, print the scores of each individual sequence
alignment.
--cyk With --refine, align with the CYK algorithm. By default the
optimal accuracy algorithm is used. There is more information on
this in the cmalign manual page.
--sub With --refine, turn on the sub model construction and alignment
procedure. For each sequence to be realigned an HMM is first
used to predict the model start and end consensus columns, and a
new sub CM is constructed that only models consensus columns
from start to end. The sequence is then aligned to this sub CM.
This option is useful for building CMs for alignments with
sequences that are known to truncated, non-full length
sequences. This option is experimental and not rigorously
tested, use at your own risk. This "sub CM" procedure is not
the same as the "sub CMs" described by Weinberg and Ruzzo.
--nonbanded
With --refine, do not use HMM bands to accelerate alignment.
Use the full CYK algorithm which is guaranteed to give the
optimal alignment. This will slow down the run significantly,
especially for large models.
--tau <x>
With --refine, set the tail loss probability used during HMM
band calculation to <f>. This is the amount of probability mass
within the HMM posterior probabilities that is considered
negligible. The default value is 1E-7. In general, higher
values will result in greater acceleration, but increase the
chance of missing the optimal alignment due to the HMM bands.
--fins With --refine, change the behavior of how insert emissions are
placed in the alignment. By default, all contiguous blocks of
inserts are split in half, and half the residues are flushed
left against the nearest consensus column to the left, and half
are flushed right against the nearest consensus column on the
right. With --fins inserts are not split in half, instead all
inserted residues from IL states are flushed left, instead all
inserted residues from IR states are flushed right. This was the
default behavior of previous versions of Infernal.
--mxsize <x>
With --refine, set the maximum allowable matrix size for
alignment to <x> megabytes. By default this size is 2 Gb. This
should be large enough for the vast majority of alignments,
however it is possible that when run with --refine, cmbuild will
exit prematurely, reporting an error message that the matrix
exceeded it’s maximum allowable size. In this case, the --mxsize
can be used to raise the limit.
--rdump<x>
With --refine, output the intermediate alignments at each
iteration of the refinement procedure (as described above for
--refine ) to file <f>.
SEE ALSO
For complete documentation, see the User’s Guide (Userguide.pdf) that
came with the distribution; or see the Infernal web page,
http://infernal.janelia.org/.
COPYRIGHT
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on
redistribution conditions.
AUTHOR
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/