cmsearch - search a sequence database for RNAs homologous to a CM

NAME

       cmsearch - search a sequence database for RNAs homologous to a CM

SYNOPSIS

       cmsearch [options] cmfile seqfile

DESCRIPTION

       cmsearch  uses  the  covariance  model  (CM)  in  cmfile  to search for
       homologous RNAs in seqfile, and outputs high-scoring alignments.

       Currently, the sequence file must be in FASTA format.

       CMs are profiles of RNA consensus sequence and secondary  structure.  A
       CM  file  is produced by the cmbuild program, from a given RNA sequence
       alignment of known consensus structure.  CM  files  can  be  calibrated
       prior  to  running cmsearch with the cmcalibrate program. Searches with
       calibrated CM files will include  E-values  and  will  use  appropriate
       filter  thresholds  for  acceleration.  It  is  strongly recommended to
       calibrate your CM files  before  using  cmsearch.   CM  calibration  is
       described  in  more  detail below and in chapters 5 and 6 of the User’s
       Guide.

       cmsearch output consists of alignments of  all  hits  in  the  database
       sorted  by  decreasing  score per sequence and per strand. That is, all
       hits for the same sequence and the same (Watson or  Crick)  strand  are
       sorted, but hits across sequences or strands are not sorted.

       The  threshold  for  reporting scores is different depending on whether
       the CM file has been calibrated or  not.   If  the  CM  file  has  been
       calibrated,  the default reporting threshold is an E-value of 1.0. This
       is the threshold at which 1 hit is expected by chance. It  is  possible
       to  manually set the threshold to bit score <x> using the -T <x> option
       as described below, or to E-value <x> using the -E <x> option.  The  -E
       option will only work if the CM file has been calibrated.

       RNA  homology  search  with  CMs  is slow.  To speed it up, cmsearch by
       default uses two rounds of filters with faster algorithms to prune  the
       database  prior  to  searching  with  the slow CM algorithm.  The first
       round of filtering is faster but less strict  than  the  second  round.
       First,  the full database is searched with the first round filter, then
       any hits that survive the first round  are  searched  with  the  second
       round  filter. Finally any hits that survive the first and second round
       of filtering are searched with the final round search strategy.  During
       the  filter rounds, hits are padded with a short stretch of residues on
       either side prior to searching with the subsequent  round.   The  exact
       number of residues is dependent on the size of the model being searched
       with.

       The first round of filtering is performed with an HMM. If the  CM  file
       is  calibrated,  the threshold for the HMM filter will be automatically
       chosen as an appropriate one as determined in cmcalibrate.  The minimum
       threshold  that will automatically be chosen is the threshold that will
       allow a predicted fraction (0.02 by default,  changeable  to  <x>  with
       --fil-Smin-hmm  <x>  )  of  the  database  to  survive the filter.  The
       maximum threshold that will automatically be chosen  is  the  threshold
       that will allow a predicted fraction (0.5 by default, changeable to <x>
       with --fil-Smax-hmm <x> ) of the database to survive the filter. If the
       threshold  from  cmcalibrate is greater than this maximum fraction, the
       HMM filter will be turned off and not used.  To  ensure  that  the  HMM
       filter  is never turned off and always uses a threshold that gives this
       maximum fraction you must use the --fil-A-hmm option.  If the model  is
       not  calibrated,  the default HMM filter threshold is 3.0 bits. The HMM
       filter threshold can be manually set to bit score <x> using the  --fil-
       T-hmm <x> option as described below, or to E-value <x> using the --fil-
       E-hmm <x> option, or to the bit  score  that  will  allow  a  predicted
       database  fraction  of  <x> to survive the filter using the --fil-S-hmm
       <x> option. The --fil-E-hmm and --fil-S-hmm options will only  work  if
       the CM file has been calibrated.  The HMM filter can be turned off with
       the --fil-no-hmm option.

       The second round of filtering is performed with the  CM  CYK  algorithm
       (not  an  HMM)  using  query-dependent  banding (QDB) for acceleration.
       Briefly, QDB precalculates regions of the  dynamic  programming  matrix
       that  have  negligible  probability  based on the query CM’s transition
       probabilities.  During search, these regions of the matrix are  ignored
       to make searches faster.  For more information on QDB see (Nawrocki and
       Eddy, PLoS Computational Biology 3(3): e56).  The beta paramater is the
       amount   of   probability   mass   considered  negligible  during  band
       calculation, lower values of beta yield greater  speedups  but  also  a
       greater  chance  of  missing the optimal alignment. The default beta is
       1E-10: determined empirically as a good  tradeoff  between  sensitivity
       and  speed,  though  this  value can be changed with the --fil-beta <x>
       option.  If the CM file has been calibrated, the QDB  filter  threshold
       will be automatic set to an appropriate value using an ad-hoc procedure
       (see the User’s Guide). If the CM file has  not  been  calibrated,  the
       default QDB filter threshold is 0.0 bits.  The QDB filter threshold can
       be manually set to bit score <x> using the --fil-T-qdb  <x>  option  as
       described  below,  or  to E-value <x> using the --fil-E-qdb <x> option.
       The --fil-E-qdb  option  will  only  work  if  the  CM  file  has  been
       calibrated.   The  QDB  filter  can be turned off with the --fil-no-qdb
       option.

       Another way to accelerate cmsearch is to run it in parallel with MPI on
       multiple  computers.  To do this, use the --mpi option and run cmsearch
       inside a MPI wrapper program such as mpirun.   For  example:  mpirun  C
       cmsearch  --mpi  [other  options] cmfile seqfile.  The cmsearch program
       must have been compiled  in  MPI  mode  for  this  to  work.   See  the
       Installation section of the User’s Guide for more information.

       The --forecast <n> option will estimate how long a search will take for
       your cmfile and seqfile on <n> processors. Unless you plan  on  running
       cmsearch in MPI mode, <n> should be set as 1.

       Another  technique  for accelerated CM homology search with HMM filters
       is the construction and use  of  a  "rigorous  filter"  HMM  which  was
       developed  by  Zasha Weinberg and Larry Ruzzo. All hits above a certain
       CM bit score threshold are guaranteed  to  survive  the  HMM  filtering
       step.  Their  implementation  of  rigorous filters has been included in
       previous versions of Infernal, but not in the current version. For more
       information see the User’s Guide.

OUTPUT

       By  default,  cmsearch outputs the alignments of search hits that score
       above the final search round threshold. The format of  this  output  is
       described  in  the  "Tutorial" section of the User’s Guide. This format
       has purposefully not been changed from the 0.x versions of Infernal  so
       as  not  to break existing parsers. However, it can be augmented with a
       line of output that marks non-compensatory (negative scoring) basepairs
       with  an  ’x’  by  using  the  -x  option. Alternatively, only negative
       scoring non-canonical basepairs (those other than A:U, U:A,  C:G,  G:C,
       U:G, and G:U) are marked if the -v option is enabled. These two options
       were added to facilitate quick analysis of the secondary  structure  of
       hits  by  eye.  Additionally, the -p option can be used to annotate the
       posterior probability of each aligned residue in the hit alignments  as
       described below.

       The --tabfile <f> outputs a tabular representation of the hits found by
       cmsearch to the file <f>.  Each non-such  line  has  9  fields:  "model
       name" the name of the CM used for the search, "target name" the name of
       the target sequence the hit was found in, "target coord -  start":  the
       start  position  of  the  hit  in  the target sequence, "target coord -
       stop": the end position of hit in the target sequence, "query  coord  -
       start":  the start position of the hit in the query model, "query coord
       - stop": the end position of hit in the query sequence, "bit sc":   the
       bit  score of the hit, "E-value": the E-value of the hit (if available,
       "-" if not), and "GC" the percentage of G and C  residues  in  the  hit
       within the target sequence.  cmsearch tab files can be used as input to
       the  Easel  miniapp  esl-sfetch   (included   in   the   easel/miniapp/
       subdirectory  of  infernal) with the -C -f --tabfile options to extract
       all the hits from the target database file to a new  FASTA  file.  This
       file can then be aligned to a CM with cmalign.

OPTIONS

       -h     Print  brief  help;  includes  version number and summary of all
              options, including expert options.

       -o <f> Save the high-scoring alignments of hits to  a  file  <f>.   The
              default is to write them to standard output.

       -g <f> Turn  on the ’glocal’ alignment algorithm, local with respect to
              the target database, and global with respect to  the  model.  By
              default,  the  local  alignment algorithm is used which is local
              with respect to both the target sequence and the model. In local
              mode,  the  alignment  to  span  two  or  more  subsequences  if
              necessary (e.g. if the structures of the query model and  target
              sequence  are  only  partially  shared),  allowing certain large
              insertions and  deletions  in  the  structure  to  be  penalized
              differently  than  normal indels.  Local mode performs better on
              empirical benchmarks and is  significantly  more  sensitive  for
              remote  homology detection.  Empirically, glocal searches return
              many fewer hits than local searches, so glocal  may  be  desired
              for some applications.

       -p     Append  posterior  probabilities to alignments of hits. For more
              information on posterior probabilities see  the  description  of
              the -p option in the manual page for cmalign.

       -x     Annotate negative scoring basepairs and basepairs that include a
              gap in the left or right half of the pair (but  not  both)  with
              x’s  in  the  alignments  of  hits.  The  x’s  appear  above the
              structural annotation in the alignment output. Basepairs without
              x’s  above  them  are  compensatory  with  respect to the model.
              Compensatory  mutations  are  good   evidence   for   structural
              homology.

       -v     Very  similar  to  -x,  but only mark negative scoring basepairs
              that are non-canonical basepairs (not an A:U, U:A, C:G, G:C, G:U
              or  U:G),  and  mark  them  with  a ’v’ instead of an ’x’ in the
              output.

       -Z <x> Calculate E-values as  if  the  target  database  size  was  <x>
              megabases  (Mb).  Ignore  the  actual size of the database. This
              option is only  valid  if  the  CM  file  has  been  calibrated.
              Warning: the predictions for timings and survival fractions will
              be calculated as if the database was of size <x> Mb, which means
              they will be inaccurate.

       --toponly
              Only search the top (Watson) strand of the sequences in seqfile.
              By default, both strands are searched.

       --bottomonly
              Only search the  bottom  (Crick)  strand  of  the  sequences  in
              seqfile.  By default, both strands are searched.

       --forecast <n>
              Predict  the  running time of the search with provided files and
              options and exit, DO NOT perform the search. This option is only
              available  with  calibrated  CM files. The predictions should be
              used as rough estimates and can be fairly inaccurate, especially
              for highly biased target databases (for example 80% AT genomes).
              The value for <n> is the number of processors the search will be
              run  on,  so  <n>  equal to 1 is appropriate unless you will run
              cmsearch in parallel with MPI.

       --informat <s>
              Assert that the input seqfile is in  format  <s>.   Do  not  run
              Babelfish  format autodection. This increases the reliability of
              the program somewhat, because the Babelfish can  make  mistakes;
              particularly recommended for unattended, high-throughput runs of
              Infernal.  <s> is  case-insensitive.   Acceptable  formats  are:
              FASTA,   EMBL,   UNIPROT,  GENBANK,  and  DDBJ.   <s>  is  case-
              insensitive.

       --mxsize <x>
              Set the maximum allowable DP matrix size to  <x>  megabytes.  By
              default  this size is 2,048 Mb.  This should be large enough for
              the vast majority of alignments, however if it is  not  cmsearch
              will  exit  prematurely  and  report  an  error message that the
              matrix exceeded it’s maximum allowable size. In this  case,  the
              --mxsize can be used to raise the limit.

       --devhelp
              Print help, as with -h , but also include undocumented developer
              options.  These  options  are  not  listed  below,   are   under
              development or experimental, and are not guaranteed to even work
              correctly. Use developer options at  your  own  risk.  The  only
              resources  for understanding what they actually do are the brief
              one-line description printed when --devhelp is enabled, and  the
              source code.

       --mpi  Run  as  an  MPI  parallel  program.  This  option  will only be
              available if Infernal has been configured  and  built  with  the
              "--enable-mpi" flag (see User’s Guide for details).

EXPERT OPTIONS

       --inside
              Use  the Inside algorithm for the final round of searching. This
              is true by default.

       --cyk  Use the CYK algorithm for the final round of searching.

       --forward
              Search only with an HMM. This is much faster but less  sensitive
              than  a CM search. Use the Forward algorithm for the HMM search.

       --viterbi
              Search only with an HMM. This is much faster but less  sensitive
              than  a CM search. Use the Viterbi algorithm for the HMM search.

       -E <x> Set the E-value cutoff for the  per-sequence/strand  ranked  hit
              list  to  <x>, where <x> is a positive real number. Hits with E-
              values better than (less than) or equal to this  threshold  will
              be  shown. This option is only available if the CM file has been
              calibrated. This threshold is relevant only to the  final  round
              of  searching performed after all filters have been used, not to
              the filter rounds themselves.

       -T <x> Set the bit score cutoff for the per-sequence ranked hit list to
              <x>,  where <x> is a positive real number.  Hits with bit scores
              better than (greater than) this threshold will  be  shown.  This
              threshold  is  relevant  only  to  the  final round of searching
              performed after all filters have been used, not  to  the  filter
              rounds themselves.

       --nc   Set  the  bit  score  cutoff as the NC cutoff value used by Rfam
              curators as the noise cutoff score. This is the highest  scoring
              hit  found  by  this  model  during  Rfam curation that the Rfam
              curators defined as a noise (false positive) sequence.   The  NC
              cutoff  is  defined  as  <x>  bits  in  the  original  Stockholm
              alignment the model was built from with  a  line:  #=GF  NC  <x>
              positioned before the sequence alignment. If such a line existed
              in the alignment provided to cmbuild then the --nc  option  will
              be  available in cmsearch.  If no such line existed when cmbuild
              was run, then using the --nc option to cmsearch will  cause  the
              program to print an error message and exit.

       --ga   Set  the  bit  score  cutoff as the GA cutoff value used by Rfam
              curators as the gathering threshold. The GA cutoff is defined in
              a  stockholm file used to build the model in the same way as the
              NC cutoff (see above), but with a line: #=GF GA <x>

       --tc   Set the bit score cutoff as the TC cutoff  value  used  by  Rfam
              curators  as the trusted cutoff. The TC cutoff is defined in the
              stockholm file used to build the model in the same way as the NC
              cutoff (see above), but with a line: #=GF TC <x>

       --no-qdb
              Do  not use query-dependent banding (QDB) for the final round of
              search. By default, QDB is used in the  final  round  of  search
              with beta = 1E-15, after all filtering is finished.

       --beta <x>
              For  query-dependent  banding  (QDB)  during  the final round of
              search, set the beta parameter to <x> where <x> is any  positive
              real  number  less  than  1.0.  Beta  is  the  probability  mass
              considered negligible during band calculation. The default  beta
              for the final round of search is 1E-15.

       --hbanded
              Use   HMM  bands  to  accelerate  the  final  round  of  search.
              Constraints  for  the  CM  search  are  derived  from  posterior
              probabilities  from  an HMM.  This is an experimental option and
              it is not recommended for  use  unless  you  know  exactly  what
              you’re doing.

       --tau <x>
              Set  the  tail  loss  probability during HMM band calculation to
              <x>.  This is the amount of  probability  mass  within  the  HMM
              posterior  probabilities  that  is  considered  negligible.  The
              default value is 1E-7.  In general, higher values will result in
              greater  acceleration,  but  increase  the chance of missing the
              optimal alignment due to the HMM bands. This option  only  makes
              sense in combination with --hbanded

       --fil-no-hmm
              Turn the HMM filter off.

       --fil-no-qdb
              Turn the QDB filter off.

       --fil-beta
              For  the  QDB filter, set the beta parameter to <x> where <x> is
              any positive real number less than 1.0. Beta is the  probability
              mass  considered negligible during band calculation. The default
              beta for the QDB filter round of search is 1E-10.

       --fil-T-qdb <x>
              Set the bit score cutoff for the QDB filter round to <x>,  where
              <x> is a positive real number.  Hits with bit scores better than
              (greater than) this threshold will survive the QDB filter and be
              passed to the final round.

       --fil-T-hmm <x>
              Set  the bit score cutoff for the HMM filter round to <x>, where
              <x> is a positive real number.  Hits with bit scores better than
              (greater than) this threshold will survive the HMM filter and be
              passed to the next round, either a QDB filter round, or  if  the
              QDB filter is disabled, to the final round of search.

       --fil-E-qdb <x>
              Set the E-value cutoff for the QDB filter round.  <x>, where <x>
              is a positive real number. Hits with E-values better than  (less
              than)  or  equal to this threshold will survive and be passed to
              the final round. This option is only available if  the  CM  file
              has been calibrated.

       --fil-E-hmm <x>
              Set the E-value cutoff for the HMM filter round.  <x>, where <x>
              is a positive real number. Hits with E-values better than  (less
              than)  or  equal to this threshold will survive and be passed to
              the next round, either a QDB filter round, or if the QDB  filter
              is  disable,  to  the final round of search. This option is only
              available if the CM file has been calibrated.

       --fil-S-hmm <x>
              Set the bit score cutoff for the HMM filter round as  the  score
              that  will  allow  a  predicted  <x> fraction of the database to
              survive the HMM filter round,  where  <x>  is  a  positive  real
              number between 0 and 1.

       --fil-Smax-hmm <x>
              When using automatically calibrated HMM thresholds for a CM file
              calibrated  with  cmcalibrate,  set  the  maximum   HMM   filter
              threshold  as the score that will allow a predicted <x> fraction
              of  the  database  to  survive  the  filter.  If  the  automatic
              threshold  from  cmcalibrate  exceeds  this  value, turn the HMM
              filter off and do not use it for the search.  By  default,  this
              option  is  ON  with  the default value of 0.5 used for <x>.  To
              modify the behavior of this option so it does not turn  off  the
              HMM  filter  if  exceeded  use  the --fil-A-hmm option described
              below.

       --fil-Smin-hmm <x>
              When using automatically calibrated HMM thresholds for a CM file
              calibrated   with   cmcalibrate,  set  the  minimum  HMM  filter
              threshold as the score that will allow a predicted <x>  fraction
              of  the  database to survive the filter. By default, this option
              is ON with the default value of 0.02 used for <x>.  Setting  <x>
              lower  will  only accelerate the majority of searches by a small
              amount.

       --fil-A-hmm
              Always enforce the maximum HMM  filter  threshold  of  <x>  from
              --fil-Smax-hmm  <x>.  That is, never turn off the HMM filter, or
              set its threshold above the score that will  allow  a  predicted
              <x>  fraction  of the database to survive. This option is OFF by
              default.

       --hmm-W <n>
              Set the HMM window size W (maximum size of a hit) to <n>.   This
              option  only  works  in combination with --forward or --viterbi.
              By default, W is calculated automatically,  but  this  automatic
              calculation is time consuming for large models.

       --hmm-cW <x>
              Set  the  HMM window size W (maximum size of a hit) as <x> times
              the consensus length of the CM. The consensus length  (clen)  of
              the  CM can be determined using the cmstat program.  This option
              only works in  combination  with  --forward  or  --viterbi.   By
              default,  W  is  calculated  automatically,  but  this automatic
              calculation  is  time  consuming  for  large  models.  To   find
              potential  full  length  hits to the model <x> should be greater
              than 1.0, but values above 2.0 are probably wasteful.

       --noalign
              Do not calculate and print alignments of each  hit,  only  print
              locations and scores.

       --aln-hbanded
              Use  HMM  bands to accelerate alignment during the hit alignment
              stage.

       --aln-optacc
              Calculate alignments of hits from final round  of  search  using
              the optimal accuracy algorithm which computes the alignment that
              maximizes  the  summed  posterior  probability  of  all  aligned
              residues  given  the  model,  which  can  be  different from the
              highest scoring one.

       --tabfile <f>
              Create a new output file <f> and print tabular  results  to  it.
              The  format  of  the  tabular  results  is  listed in the OUTPUT
              section. The tabular  results  can  be  more  easily  parsed  by
              scripts than the default cmsearch output. The esl-sfetch miniapp
              included in the easel/miniapps/ subdirectory of infernal  has  a
              --tabfile  option  that allows it to read cmsearch tab files and
              fetch the hits reported within them  from  the  target  database
              into a new sequence file.

       --gcfile <f>
              Create  a  new  output  file  <f> and print statistics of the GC
              content of the sequences in seqfile to it.   The  sequences  are
              partitioned  into  100  nt  non-overlapping  windows, and the GC
              percentage of each window is calculated. A normalized  histogram
              of  those GC percentages is then printed to <f> This file can be
              generated even if cmsearch is run with --forecast and no  search
              is performed.

       --rna  Output  the  hit alignments as RNA sequences alignments. This is
              true by default.

       --dna  Output the hit alignments as DNA sequence alignments.

COPYRIGHT

       Copyright (C) 2009 HHMI Janelia Farm Research Campus.
       Freely distributed under the GNU General Public License (GPLv3).
       See the  file  COPYING  that  came  with  the  source  for  details  on
       redistribution conditions.

AUTHOR

       Eric Nawrocki, Diana Kolbe, and Sean Eddy
       HHMI Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147
       http://selab.janelia.org/

NAME

SYNOPSIS

DESCRIPTION

OUTPUT

OPTIONS

EXPERT OPTIONS

SEE ALSO

COPYRIGHT

AUTHOR