NAME
cmsearch - search a sequence database for RNAs homologous to a CM
SYNOPSIS
cmsearch [options] cmfile seqfile
DESCRIPTION
cmsearch uses the covariance model (CM) in cmfile to search for
homologous RNAs in seqfile, and outputs high-scoring alignments.
Currently, the sequence file must be in FASTA format.
CMs are profiles of RNA consensus sequence and secondary structure. A
CM file is produced by the cmbuild program, from a given RNA sequence
alignment of known consensus structure. CM files can be calibrated
prior to running cmsearch with the cmcalibrate program. Searches with
calibrated CM files will include E-values and will use appropriate
filter thresholds for acceleration. It is strongly recommended to
calibrate your CM files before using cmsearch. CM calibration is
described in more detail below and in chapters 5 and 6 of the User’s
Guide.
cmsearch output consists of alignments of all hits in the database
sorted by decreasing score per sequence and per strand. That is, all
hits for the same sequence and the same (Watson or Crick) strand are
sorted, but hits across sequences or strands are not sorted.
The threshold for reporting scores is different depending on whether
the CM file has been calibrated or not. If the CM file has been
calibrated, the default reporting threshold is an E-value of 1.0. This
is the threshold at which 1 hit is expected by chance. It is possible
to manually set the threshold to bit score <x> using the -T <x> option
as described below, or to E-value <x> using the -E <x> option. The -E
option will only work if the CM file has been calibrated.
RNA homology search with CMs is slow. To speed it up, cmsearch by
default uses two rounds of filters with faster algorithms to prune the
database prior to searching with the slow CM algorithm. The first
round of filtering is faster but less strict than the second round.
First, the full database is searched with the first round filter, then
any hits that survive the first round are searched with the second
round filter. Finally any hits that survive the first and second round
of filtering are searched with the final round search strategy. During
the filter rounds, hits are padded with a short stretch of residues on
either side prior to searching with the subsequent round. The exact
number of residues is dependent on the size of the model being searched
with.
The first round of filtering is performed with an HMM. If the CM file
is calibrated, the threshold for the HMM filter will be automatically
chosen as an appropriate one as determined in cmcalibrate. The minimum
threshold that will automatically be chosen is the threshold that will
allow a predicted fraction (0.02 by default, changeable to <x> with
--fil-Smin-hmm <x> ) of the database to survive the filter. The
maximum threshold that will automatically be chosen is the threshold
that will allow a predicted fraction (0.5 by default, changeable to <x>
with --fil-Smax-hmm <x> ) of the database to survive the filter. If the
threshold from cmcalibrate is greater than this maximum fraction, the
HMM filter will be turned off and not used. To ensure that the HMM
filter is never turned off and always uses a threshold that gives this
maximum fraction you must use the --fil-A-hmm option. If the model is
not calibrated, the default HMM filter threshold is 3.0 bits. The HMM
filter threshold can be manually set to bit score <x> using the --fil-
T-hmm <x> option as described below, or to E-value <x> using the --fil-
E-hmm <x> option, or to the bit score that will allow a predicted
database fraction of <x> to survive the filter using the --fil-S-hmm
<x> option. The --fil-E-hmm and --fil-S-hmm options will only work if
the CM file has been calibrated. The HMM filter can be turned off with
the --fil-no-hmm option.
The second round of filtering is performed with the CM CYK algorithm
(not an HMM) using query-dependent banding (QDB) for acceleration.
Briefly, QDB precalculates regions of the dynamic programming matrix
that have negligible probability based on the query CM’s transition
probabilities. During search, these regions of the matrix are ignored
to make searches faster. For more information on QDB see (Nawrocki and
Eddy, PLoS Computational Biology 3(3): e56). The beta paramater is the
amount of probability mass considered negligible during band
calculation, lower values of beta yield greater speedups but also a
greater chance of missing the optimal alignment. The default beta is
1E-10: determined empirically as a good tradeoff between sensitivity
and speed, though this value can be changed with the --fil-beta <x>
option. If the CM file has been calibrated, the QDB filter threshold
will be automatic set to an appropriate value using an ad-hoc procedure
(see the User’s Guide). If the CM file has not been calibrated, the
default QDB filter threshold is 0.0 bits. The QDB filter threshold can
be manually set to bit score <x> using the --fil-T-qdb <x> option as
described below, or to E-value <x> using the --fil-E-qdb <x> option.
The --fil-E-qdb option will only work if the CM file has been
calibrated. The QDB filter can be turned off with the --fil-no-qdb
option.
Another way to accelerate cmsearch is to run it in parallel with MPI on
multiple computers. To do this, use the --mpi option and run cmsearch
inside a MPI wrapper program such as mpirun. For example: mpirun C
cmsearch --mpi [other options] cmfile seqfile. The cmsearch program
must have been compiled in MPI mode for this to work. See the
Installation section of the User’s Guide for more information.
The --forecast <n> option will estimate how long a search will take for
your cmfile and seqfile on <n> processors. Unless you plan on running
cmsearch in MPI mode, <n> should be set as 1.
Another technique for accelerated CM homology search with HMM filters
is the construction and use of a "rigorous filter" HMM which was
developed by Zasha Weinberg and Larry Ruzzo. All hits above a certain
CM bit score threshold are guaranteed to survive the HMM filtering
step. Their implementation of rigorous filters has been included in
previous versions of Infernal, but not in the current version. For more
information see the User’s Guide.
OUTPUT
By default, cmsearch outputs the alignments of search hits that score
above the final search round threshold. The format of this output is
described in the "Tutorial" section of the User’s Guide. This format
has purposefully not been changed from the 0.x versions of Infernal so
as not to break existing parsers. However, it can be augmented with a
line of output that marks non-compensatory (negative scoring) basepairs
with an ’x’ by using the -x option. Alternatively, only negative
scoring non-canonical basepairs (those other than A:U, U:A, C:G, G:C,
U:G, and G:U) are marked if the -v option is enabled. These two options
were added to facilitate quick analysis of the secondary structure of
hits by eye. Additionally, the -p option can be used to annotate the
posterior probability of each aligned residue in the hit alignments as
described below.
The --tabfile <f> outputs a tabular representation of the hits found by
cmsearch to the file <f>. Each non-such line has 9 fields: "model
name" the name of the CM used for the search, "target name" the name of
the target sequence the hit was found in, "target coord - start": the
start position of the hit in the target sequence, "target coord -
stop": the end position of hit in the target sequence, "query coord -
start": the start position of the hit in the query model, "query coord
- stop": the end position of hit in the query sequence, "bit sc": the
bit score of the hit, "E-value": the E-value of the hit (if available,
"-" if not), and "GC" the percentage of G and C residues in the hit
within the target sequence. cmsearch tab files can be used as input to
the Easel miniapp esl-sfetch (included in the easel/miniapp/
subdirectory of infernal) with the -C -f --tabfile options to extract
all the hits from the target database file to a new FASTA file. This
file can then be aligned to a CM with cmalign.
OPTIONS
-h Print brief help; includes version number and summary of all
options, including expert options.
-o <f> Save the high-scoring alignments of hits to a file <f>. The
default is to write them to standard output.
-g <f> Turn on the ’glocal’ alignment algorithm, local with respect to
the target database, and global with respect to the model. By
default, the local alignment algorithm is used which is local
with respect to both the target sequence and the model. In local
mode, the alignment to span two or more subsequences if
necessary (e.g. if the structures of the query model and target
sequence are only partially shared), allowing certain large
insertions and deletions in the structure to be penalized
differently than normal indels. Local mode performs better on
empirical benchmarks and is significantly more sensitive for
remote homology detection. Empirically, glocal searches return
many fewer hits than local searches, so glocal may be desired
for some applications.
-p Append posterior probabilities to alignments of hits. For more
information on posterior probabilities see the description of
the -p option in the manual page for cmalign.
-x Annotate negative scoring basepairs and basepairs that include a
gap in the left or right half of the pair (but not both) with
x’s in the alignments of hits. The x’s appear above the
structural annotation in the alignment output. Basepairs without
x’s above them are compensatory with respect to the model.
Compensatory mutations are good evidence for structural
homology.
-v Very similar to -x, but only mark negative scoring basepairs
that are non-canonical basepairs (not an A:U, U:A, C:G, G:C, G:U
or U:G), and mark them with a ’v’ instead of an ’x’ in the
output.
-Z <x> Calculate E-values as if the target database size was <x>
megabases (Mb). Ignore the actual size of the database. This
option is only valid if the CM file has been calibrated.
Warning: the predictions for timings and survival fractions will
be calculated as if the database was of size <x> Mb, which means
they will be inaccurate.
--toponly
Only search the top (Watson) strand of the sequences in seqfile.
By default, both strands are searched.
--bottomonly
Only search the bottom (Crick) strand of the sequences in
seqfile. By default, both strands are searched.
--forecast <n>
Predict the running time of the search with provided files and
options and exit, DO NOT perform the search. This option is only
available with calibrated CM files. The predictions should be
used as rough estimates and can be fairly inaccurate, especially
for highly biased target databases (for example 80% AT genomes).
The value for <n> is the number of processors the search will be
run on, so <n> equal to 1 is appropriate unless you will run
cmsearch in parallel with MPI.
--informat <s>
Assert that the input seqfile is in format <s>. Do not run
Babelfish format autodection. This increases the reliability of
the program somewhat, because the Babelfish can make mistakes;
particularly recommended for unattended, high-throughput runs of
Infernal. <s> is case-insensitive. Acceptable formats are:
FASTA, EMBL, UNIPROT, GENBANK, and DDBJ. <s> is case-
insensitive.
--mxsize <x>
Set the maximum allowable DP matrix size to <x> megabytes. By
default this size is 2,048 Mb. This should be large enough for
the vast majority of alignments, however if it is not cmsearch
will exit prematurely and report an error message that the
matrix exceeded it’s maximum allowable size. In this case, the
--mxsize can be used to raise the limit.
--devhelp
Print help, as with -h , but also include undocumented developer
options. These options are not listed below, are under
development or experimental, and are not guaranteed to even work
correctly. Use developer options at your own risk. The only
resources for understanding what they actually do are the brief
one-line description printed when --devhelp is enabled, and the
source code.
--mpi Run as an MPI parallel program. This option will only be
available if Infernal has been configured and built with the
"--enable-mpi" flag (see User’s Guide for details).
EXPERT OPTIONS
--inside
Use the Inside algorithm for the final round of searching. This
is true by default.
--cyk Use the CYK algorithm for the final round of searching.
--forward
Search only with an HMM. This is much faster but less sensitive
than a CM search. Use the Forward algorithm for the HMM search.
--viterbi
Search only with an HMM. This is much faster but less sensitive
than a CM search. Use the Viterbi algorithm for the HMM search.
-E <x> Set the E-value cutoff for the per-sequence/strand ranked hit
list to <x>, where <x> is a positive real number. Hits with E-
values better than (less than) or equal to this threshold will
be shown. This option is only available if the CM file has been
calibrated. This threshold is relevant only to the final round
of searching performed after all filters have been used, not to
the filter rounds themselves.
-T <x> Set the bit score cutoff for the per-sequence ranked hit list to
<x>, where <x> is a positive real number. Hits with bit scores
better than (greater than) this threshold will be shown. This
threshold is relevant only to the final round of searching
performed after all filters have been used, not to the filter
rounds themselves.
--nc Set the bit score cutoff as the NC cutoff value used by Rfam
curators as the noise cutoff score. This is the highest scoring
hit found by this model during Rfam curation that the Rfam
curators defined as a noise (false positive) sequence. The NC
cutoff is defined as <x> bits in the original Stockholm
alignment the model was built from with a line: #=GF NC <x>
positioned before the sequence alignment. If such a line existed
in the alignment provided to cmbuild then the --nc option will
be available in cmsearch. If no such line existed when cmbuild
was run, then using the --nc option to cmsearch will cause the
program to print an error message and exit.
--ga Set the bit score cutoff as the GA cutoff value used by Rfam
curators as the gathering threshold. The GA cutoff is defined in
a stockholm file used to build the model in the same way as the
NC cutoff (see above), but with a line: #=GF GA <x>
--tc Set the bit score cutoff as the TC cutoff value used by Rfam
curators as the trusted cutoff. The TC cutoff is defined in the
stockholm file used to build the model in the same way as the NC
cutoff (see above), but with a line: #=GF TC <x>
--no-qdb
Do not use query-dependent banding (QDB) for the final round of
search. By default, QDB is used in the final round of search
with beta = 1E-15, after all filtering is finished.
--beta <x>
For query-dependent banding (QDB) during the final round of
search, set the beta parameter to <x> where <x> is any positive
real number less than 1.0. Beta is the probability mass
considered negligible during band calculation. The default beta
for the final round of search is 1E-15.
--hbanded
Use HMM bands to accelerate the final round of search.
Constraints for the CM search are derived from posterior
probabilities from an HMM. This is an experimental option and
it is not recommended for use unless you know exactly what
you’re doing.
--tau <x>
Set the tail loss probability during HMM band calculation to
<x>. This is the amount of probability mass within the HMM
posterior probabilities that is considered negligible. The
default value is 1E-7. In general, higher values will result in
greater acceleration, but increase the chance of missing the
optimal alignment due to the HMM bands. This option only makes
sense in combination with --hbanded
--fil-no-hmm
Turn the HMM filter off.
--fil-no-qdb
Turn the QDB filter off.
--fil-beta
For the QDB filter, set the beta parameter to <x> where <x> is
any positive real number less than 1.0. Beta is the probability
mass considered negligible during band calculation. The default
beta for the QDB filter round of search is 1E-10.
--fil-T-qdb <x>
Set the bit score cutoff for the QDB filter round to <x>, where
<x> is a positive real number. Hits with bit scores better than
(greater than) this threshold will survive the QDB filter and be
passed to the final round.
--fil-T-hmm <x>
Set the bit score cutoff for the HMM filter round to <x>, where
<x> is a positive real number. Hits with bit scores better than
(greater than) this threshold will survive the HMM filter and be
passed to the next round, either a QDB filter round, or if the
QDB filter is disabled, to the final round of search.
--fil-E-qdb <x>
Set the E-value cutoff for the QDB filter round. <x>, where <x>
is a positive real number. Hits with E-values better than (less
than) or equal to this threshold will survive and be passed to
the final round. This option is only available if the CM file
has been calibrated.
--fil-E-hmm <x>
Set the E-value cutoff for the HMM filter round. <x>, where <x>
is a positive real number. Hits with E-values better than (less
than) or equal to this threshold will survive and be passed to
the next round, either a QDB filter round, or if the QDB filter
is disable, to the final round of search. This option is only
available if the CM file has been calibrated.
--fil-S-hmm <x>
Set the bit score cutoff for the HMM filter round as the score
that will allow a predicted <x> fraction of the database to
survive the HMM filter round, where <x> is a positive real
number between 0 and 1.
--fil-Smax-hmm <x>
When using automatically calibrated HMM thresholds for a CM file
calibrated with cmcalibrate, set the maximum HMM filter
threshold as the score that will allow a predicted <x> fraction
of the database to survive the filter. If the automatic
threshold from cmcalibrate exceeds this value, turn the HMM
filter off and do not use it for the search. By default, this
option is ON with the default value of 0.5 used for <x>. To
modify the behavior of this option so it does not turn off the
HMM filter if exceeded use the --fil-A-hmm option described
below.
--fil-Smin-hmm <x>
When using automatically calibrated HMM thresholds for a CM file
calibrated with cmcalibrate, set the minimum HMM filter
threshold as the score that will allow a predicted <x> fraction
of the database to survive the filter. By default, this option
is ON with the default value of 0.02 used for <x>. Setting <x>
lower will only accelerate the majority of searches by a small
amount.
--fil-A-hmm
Always enforce the maximum HMM filter threshold of <x> from
--fil-Smax-hmm <x>. That is, never turn off the HMM filter, or
set its threshold above the score that will allow a predicted
<x> fraction of the database to survive. This option is OFF by
default.
--hmm-W <n>
Set the HMM window size W (maximum size of a hit) to <n>. This
option only works in combination with --forward or --viterbi.
By default, W is calculated automatically, but this automatic
calculation is time consuming for large models.
--hmm-cW <x>
Set the HMM window size W (maximum size of a hit) as <x> times
the consensus length of the CM. The consensus length (clen) of
the CM can be determined using the cmstat program. This option
only works in combination with --forward or --viterbi. By
default, W is calculated automatically, but this automatic
calculation is time consuming for large models. To find
potential full length hits to the model <x> should be greater
than 1.0, but values above 2.0 are probably wasteful.
--noalign
Do not calculate and print alignments of each hit, only print
locations and scores.
--aln-hbanded
Use HMM bands to accelerate alignment during the hit alignment
stage.
--aln-optacc
Calculate alignments of hits from final round of search using
the optimal accuracy algorithm which computes the alignment that
maximizes the summed posterior probability of all aligned
residues given the model, which can be different from the
highest scoring one.
--tabfile <f>
Create a new output file <f> and print tabular results to it.
The format of the tabular results is listed in the OUTPUT
section. The tabular results can be more easily parsed by
scripts than the default cmsearch output. The esl-sfetch miniapp
included in the easel/miniapps/ subdirectory of infernal has a
--tabfile option that allows it to read cmsearch tab files and
fetch the hits reported within them from the target database
into a new sequence file.
--gcfile <f>
Create a new output file <f> and print statistics of the GC
content of the sequences in seqfile to it. The sequences are
partitioned into 100 nt non-overlapping windows, and the GC
percentage of each window is calculated. A normalized histogram
of those GC percentages is then printed to <f> This file can be
generated even if cmsearch is run with --forecast and no search
is performed.
--rna Output the hit alignments as RNA sequences alignments. This is
true by default.
--dna Output the hit alignments as DNA sequence alignments.
SEE ALSO
For complete documentation, see the User’s Guide (Userguide.pdf) that
came with the distribution; or see the Infernal web page,
http://infernal.janelia.org/.
COPYRIGHT
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on
redistribution conditions.
AUTHOR
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/