NAME
cmalign - use a CM to make a structured RNA multiple alignment
SYNOPSIS
Align sequences to a CM:
cmalign [options] cmfile seqfile
Merge two alignments:
cmalign --merge [options] cmfile msafile1 msafile2
DESCRIPTION
cmalign aligns the RNA sequences in seqfile to the covariance model
(CM) in cmfile, and outputs a multiple sequence alignment.
Alternatively, with the --merge option, cmalign merges the two
alignments msafile1 and msafile2 created by previous runs of cmalign
with cmfile into a single alignment.
The sequence file seqfile must be in FASTA, EMBL, or Genbank format.
CMs are profiles of RNA consensus sequence and secondary structure. A
CM file is produced by the cmbuild program, from a given RNA sequence
alignment of known consensus structure.
The alignment that cmalign makes is written in Stockholm format. It
can be redirected to a file using the -o option.
cmalign uses an HMM banding technique to accelerate alignment by
default as described below for the --hbanded option. HMM banding can be
turned off with the --nonbanded option.
By default, cmalign computes the alignment with maximum expected
accuracy that is consistent with constraints (bands) derived from an
HMM, using a banded version of the Durbin/Holmes optimal accuracy
algorithm. This behavior can be changed, as described below and in the
User’s Guide, with the --cyk, --sample, or --viterbi options.
It is possible to include the fixed training alignment used to build
the CM within the output alignment of cmalign. This is done using the
--withali option, as described below and in the User’s Guide.
OUTPUT
cmalign first outputs tabular information on the scores of each
sequence being aligned, then the alignment itself is printed. The
alignment can be redirected to an output file <f> with the -o <f>
option. The tabular output section includes one line per sequences and
seven fields per line: "seq idx": the index of the sequence in the
input file, "seq name": the sequence name, "len": the length of the
sequence, "total": the total bit score of the sequence, "struct": an
approximation of the contribution of the secondary structure to the bit
score, "avg prob": the average posterior probability (confidence
estimate) of each aligned residue, and "elapsed": the wall time spent
aligning the sequence.
The fields can change if different options are selected. For example if
the --cyk option is enabled, the "avg prob" field disappears because
posterior probabilities are not calculated by the CYK algorithm.
OPTIONS
-h Print brief help; includes version number and summary of all
options, including expert options.
-o <f> Save the alignment in Stockholm format to a file <f>. The
default is to write it to standard output.
-l Turn on the local alignment algorithm, which allows the
alignment to span two or more subsequences if necessary (e.g. if
the structures of the query model and target sequence are only
partially shared), allowing certain large insertions and
deletions in the structure to be penalized differently than
normal indels. The default is to globally align the query model
to the target sequences.
-p Annotate the alignment with posterior probabilities calculated
using the Inside and Outside algorithms. The -p option causes
additional annotation to appear in the output alignment, but
does not modify the alignment itself (that is, the relative
positions of the residues are unchanged). Two characters for
each residue are used to annotate the posterior probability that
the corresponding residue aligns at the corresponding position
in the Stockholm alignment. These characters have the Stockholm
markup tags "#=GR <seq name> POSTX." and "#=GR <seq name>
POST.X", and can only have the values: "0-9", "*" or ".". They
indicate the tens and ones place for the posterior probability:
an "8" for "POSTX." and a "3" for "POST.X" indicates that the
posterior probability is between 0.83 and 0.84. A "*" for both
"POSTX." and "POST.X" indicates that the confidence estimate is
"very nearly" 1.0 (it’s hard to be exact here due to numerical
precision issues) A "." in both "POSTX." and "POST.X" indicates
that that column aligns to a gap. When used in combination with
--nonbanded, the calculation of the posterior probabilities
considers all possible alignments of the target sequence to the
CM. Without --nonbanded (in HMM banded mode), the calculation
considers only possible alignments within the HMM bands.
-q Quiet; suppress the verbose banner, and only print the resulting
alignment to stdout. This allows piping the alignment to the
input of other programs, for example.
-1 Output the alignment in pfam format, a non-interleaved Stockholm
format in which each sequence is on a single line.
--informat <s>
Assert that the input seqfile is in format <s>. Do not run
Babelfish format autodection. This increases the reliability of
the program somewhat, because the Babelfish can make mistakes;
particularly recommended for unattended, high-throughput runs of
Infernal. Acceptable formats are: FASTA, EMBL, UNIPROT,
GENBANK, and DDBJ. <s> is case-insensitive.
--devhelp
Print help, as with -h , but also include undocumented developer
options. These options are not listed below, are under
development or experimental, and are not guaranteed to even work
correctly. Use developer options at your own risk. The only
resources for understanding what they actually do are the brief
one-line description printed when --devhelp is enabled, and the
source code.
--mpi Run as an MPI parallel program. This option will only be
available if Infernal has been configured and built with the
"--enable-mpi" flag (see User’s Guide for details).
EXPERT OPTIONS
--optacc
Align sequences using the Durbin/Holmes optimal accuracy
algorithm. This is default behavior, so this option is probably
useless. The optimal accuracy alignment will be constrained by
HMM bands for acceleration unless the --nonbanded option is
enabled. The optimal accuracy algorithm determines the
alignment that maximizes the posterior probabilities of the
aligned residues within it. The posterior probabilites are
determined using (possibly HMM banded) variants of the Inside
and Outside algorithms.
--cyk Do not use the Durbin/Holmes optimal accuracy alignment to align
the sequences, instead use the CYK algorithm which determines
the optimally scoring alignment of the sequence to the model.
--sample
Sample an alignment from the posterior distribution of
alignments. The posterior distribution is determined using an
HMM banded (unless --nonbanded) variant of the Inside algorithm.
-s <n> Set the random number generator seed to <n>, where <n> is a
positive integer. This option can only be used in combination
with --sample. The default is to use time() to generate a
different seed for each run, which means that two different runs
of cmalign --sample on the same alignment will give slightly
different results. You can use this option to generate
reproducible results.
--viterbi
Do not use the CM to align the sequences, instead use the HMM
Viterbi algorithm to align with a CM Plan 9 HMM. The HMM is
automatically constructed to be maximally similar to the CM.
This HMM alignment is faster than CM alignment, but can be less
accurate because the structure of the RNA family is ignored.
--sub Turn on the sub model construction and alignment procedure. For
each sequence, an HMM is first used to predict the model start
and end consensus columns, and a new sub CM is constructed that
only models consensus columns from start to end. The sequence is
then aligned to this sub CM. This option is useful for aligning
sequences that are known to truncated, non-full length
sequences. This "sub CM" procedure is not the same as the "sub
CMs" described by Weinberg and Ruzzo.
--small
Use the divide and conquer CYK alignment algorithm described in
SR Eddy, BMC Bioinformatics 3:18, 2002. The --nonbanded option
must be used in combination with this options. Also, it is
recommended whenever --nonbanded is used that --small is also
used because standard CM alignment without HMM banding requires
a lot of memory, especially for large RNAs. --small allows CM
alignment within practical memory limits, reducing the memory
required for alignment LSU rRNA, the largest known RNAs, from
150 Gb to less than 300 Mb. This option can only be used in
combination with --nonbanded and --cyk.
--hbanded
This option is turned on by default. Accelerate alignment by
pruning away regions of the CM DP matrix that are deemed
negligible by an HMM. First, each sequence is scored with a CM
plan 9 HMM derived from the CM using the Forward and Backward
HMM algorithms and calculate posterior probabilities that each
residue aligns to each state of the HMM. These posterior
probabilities are used to derive constraints (bands) on the CM
DP matrix. Finally, the target sequence is aligned to the CM
using the banded DP matrix, during which cells outside the bands
are ignored. Usually most of the full DP matrix lies outside the
bands (often more than 95%), making this technique faster
because fewer DP calculations are required, and more memory
efficient because only cells within the bands need be allocated.
Importantly, HMM banding sacrifices the guarantee of determining
the optimally accurarte or optimal alignment, which will be
missed if it lies outside the bands. The tau paramater
(analagous to the beta parameter for QDB calculation in cmsearch
) is the amount of probability mass considered negligible during
HMM band calculation; lower values of tau yield greater speedups
but also a greater chance of missing the optimal alignment. The
default tau is 1E-7, determined empirically as a good tradeoff
between sensitivity and speed, though this value can be changed
with the --tau <x> option. The level of acceleration increases
with both the length and primary sequence conservation level of
the family. For example, with the default tau of 1E-7, tRNA
models (low primary sequence conservation with length of about
75 residues) show about 10X acceleration, and SSU bacterial rRNA
models (high primary sequence conservation with length of about
1500 residues) show about 700X. HMM banding can be turned off
with the --nonbanded option.
--nonbanded
Turns off HMM banding. The returned alignment is guaranteed to
be the globally optimally accurate one (by default) or the
globally optimally scoring one (if --cyk is enabled). The
--small option is recommended in combination with this option,
because standard alignment without HMM banding requires a lot of
memory (see --small ).
--tau <x>
Set the tail loss probability used during HMM band calculation
to <x>. This is the amount of probability mass within the HMM
posterior probabilities that is considered negligible. The
default value is 1E-7. In general, higher values will result in
greater acceleration, but increase the chance of missing the
optimal alignment due to the HMM bands.
--mxsize <x>
Set the maximum allowable DP matrix size to <x> megabytes. By
default this size is 2,048 Mb. This should be large enough for
the vast majority of alignments, however if it is not cmalign
will exit prematurely and report an error message that the
matrix exceeded it’s maximum allowable size. In this case, the
--mxsize can be used to raise the limit. This is most likely to
occur when the --nonbanded option is used without the --small
option, but can still occur when --nonbanded is not used.
--rna Output the alignments as RNA sequence alignments. This is true
by default.
--dna Output the alignments as DNA sequence alignments.
--matchonly
Only include match columns in the output alignment, do not
include any insertions relative to the consensus model.
--resonly
Only include match columns in the output alignment that have at
least 1 residue (non-gap character) in them. By default all
match columns are printed to the alignment, even those that are
100% gaps. --resonly replicates the default behavior of
previous versions of cmalign.
--fins Change the behavior of how insert emissions are placed in the
alignment. By default, all contiguous blocks of inserts are
split in half, and half the residues are flushed left against
the nearest consensus column to the left, and half are flushed
right against the nearest consensus column on the right. With
--fins inserts are not split in half, instead all inserted
residues from IL states are flushed left, and all inserted
residues from IR states are flushed right. --fins replicates
the default behavior of previous versions of cmalign.
--onepost
Modifies behavior of the -p option. Use only one character
instead of two to annotate the posterior probability of each
aligned residue. Specifically, only the "#=GR <seq name> POSTX."
tag is printed to the alignment. An "8" for "POSTX." indicates a
posterior probability between 0.8 and 0.9 for the corresponding
residue.
--merge
With --merge the usage of cmalign changes to cmalign --merge
[options] cmfile msafile1 msafile2. Merge the two alignments in
msafile1 and msafile2 created by previous runs of cmalign with
cmfile together into a single alignment and exit. msafile1 and
msafile2 must only have one alignment per file. This option
allows the user to split up large sequence files into many
smaller files, align them independently to cmfile on different
computers to get many small alignments, and then merge them into
a single large alignment.
--withali <f>
Reads an alignment from file <f> and aligns it as a single
object to the CM; e.g. the alignment in <f> is held fixed. This
allows you to align sequences to a model with cmalign and view
them in the context of an existing trusted multiple alignment.
The alignment in the file <f> must be exactly the alignment that
the CM was built from, or a subset of it with the following
special property: the definition of consensus columns and
consensus secondary structure must be identical between <f> and
the alignment the CM was built from. One easy way to achieve
this is to use the --rf option to cmbuild (see man page for
cmbuild ) and to maintain the "#=GC RF" annotation in the
alignment when removing sequences to create the subset alignment
<f>. To specify that the --rf option to cmbuild was used,
enable the --rf option to cmalign (see --rf below).
--withpknots
Must be used in combination with --withali <f>. Propogate
structural information for any pseudoknots that exist in <f> to
the output alignment.
--rf Must be used in combination with --withali <f>. Specify that
the alignment in <f> has the same "#=GC RF" annotation as the
alignment file the CM was built from using cmbuild and further
that the --rf option was supplied to cmbuild when the CM was
constructed.
--gapthresh <x>
Must be used in combination with --withali <f>. Specify that
the --gapthresh <x> option was supplied to cmbuild when the CM
was constructed from the alignment file <f>.
--tfile <f>
Dump tabular sequence tracebacks for each individual sequence to
a file <f>. Primarily useful for debugging.
SEE ALSO
For complete documentation, see the User’s Guide (Userguide.pdf) that
came with the distribution; or see the Infernal web page,
http://infernal.janelia.org/.
COPYRIGHT
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on
redistribution conditions.
AUTHOR
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/