cmscore - align and score one or more sequences to a CM

NAME

       cmscore - align and score one or more sequences to a CM

SYNOPSIS

       cmscore [options] cmfile seqfile

DESCRIPTION

       cmscore uses the covariance model (CM) in cmfile to align and score the
       sequences in seqfile, and output  summary  statistics  on  timings  and
       scores.   cmscore  is a testbed for new CM alignment algorithms, and it
       is also used by the testsuite.  It is not intended to  be  particularly
       useful  in  the real world. Documentation is provided for completeness,
       and to aid our own memories.

       CMs are profiles of RNA consensus sequence and secondary  structure.  A
       CM  file  is produced by the cmbuild program, from a given RNA sequence
       alignment of known consensus structure.

       cmscore  aligns  the  sequence(s)  in  seqfile  using   two   alignment
       algorithms  and  compares  the scores and timings of each algorithm. By
       default the two algorithms compared are the divide  and  conquer  (D&C)
       CYK  algorithm  (SR  Eddy,  BMC Bioinformatics 3:18, 2002), and the HMM
       banded standard CYK algorithm.  When the --nonbanded option is  enabled
       D&C CYK is compared with the standard CYK alignment algorithm.  In this
       case, because both algorithms should find  the  optimal  alignment  the
       parsetree scores should be nearly identical (within 0.01 bits), if this
       is not the case for any sequence cmscore  exits  and  prints  an  error
       message.  When the --viterbi option is enabled D&C CYK is compared with
       Viterbi alignment to a CM  Plan  9  HMM  constructed  to  be  maximally
       similar to the CM.

       While  non-banded  CYK  variants  are  guaranteed  to  find the optimal
       alignment and score of each sequence, HMM banded  CYK  sacrifices  this
       guarantee for acceleration. The level of acceleration can be controlled
       by the tau parameter, which is set with the --tau <x> option.  This  is
       described in more detail in the man page for cmalign, but in short, <x>
       is a rough estimate at the probability that the optimal alignment  will
       be  missed.  The  greater <x> is, the greater the acceleration, but the
       greater the chance of missing the optimal alignment. By default tau  is
       set as 1E-7.  cmscore is useful for testing for values of tau that give
       the best trade-off between acceleration versus accuracy.  To make  this
       testing  easier,  multiple  tau  values  can  be tested within a single
       cmscore call. The --taus <x> and --taue <x>  option  combination  allow
       the  user to specify a beginning tau value and an ending tau value. For
       example, --taus 3 and and --taue 5 would first align the  sequences  in
       seqfile  with  non-banded  D&C  CYK,  and then perform 3 additional HMM
       banded alignments, first with tau=1E-3, next with tau=1E-4 and  finally
       with  tau=1E-5.  Currently,  only values of 1E-<x> can be used. Summary
       statistics on timings and how often the optimal alignment is missed for
       each value or tau are printed to stdout.

       When  comparing the non-banded standard CYK and D&C CYK algorithms with
       the --nonbanded option, the two parse trees should usually be identical
       for  any  sequence,  because the optimal alignment score is guaranteed.
       However, there can be cases of ties, where two or more different  parse
       trees  have identical scores. In such cases, it is possible for the two
       parse trees to differ.  The  parse  tree  selected  as  "optimal"  from
       amongst  the ties is arbitrary, dependent on order of evaluation in the
       DP traceback, and the order of evaluation for D&C vs. standard  CYK  is
       different.  Thus, in its testsuite role, cmscore checks that the scores
       are within 0.01 bits of each other, but does not check that  the  parse
       trees are absolutely identical.

       The  alignment algorithms can be run in "search" mode within cmscore by
       using  the  --search  option.   When  --search  is  enabled,   --inside
       specifies  that  the  Inside  algorithm  be  used  instead  of  CYK and
       --forward specifies that the HMM Forward algorithm be used  instead  of
       CYK.

       The  sequences  are  treated as single stranded RNAs; that is, only the
       given strand of each sequence is aligned and  scored,  and  no  reverse
       complementing is done.

OPTIONS

       -h     Print  brief  help;  includes  version number and summary of all
              options, including expert options.

       -n <n> Set the number of sequences to generate and align to <n>.   This
              option is incompatible with the --infile option.

       -l     Turn   on  the  local  alignment  algorithm,  which  allows  the
              alignment to span two or more subsequences if necessary (e.g. if
              the  structures  of the query model and target sequence are only
              partially  shared),  allowing  certain  large   insertions   and
              deletions  in  the  structure  to  be penalized differently than
              normal indels.  The default is to globally align the query model
              to the target sequences.

       -s <n> Set the random seed to <n>, where <n> is a positive integer. The
              default is to use time() to generate a different seed  for  each
              run,  which means that two different runs of cmscore on the same
              CM will give different results.  You  can  use  this  option  to
              generate  reproducible  results.  The random number generator is
              used to generate sequences to score, so -s is incompatible  with
              the  --infile option which supplies the sequences to score in an
              input file.

       -a     Print individual timings and score comparisons for each sequence
              in seqfile.  By default only summary statistics are printed.

       --sub  Turn  on the sub model construction and alignment procedure. For
              each sequence, an HMM is first used to predict the  model  start
              and  end consensus columns, and a new sub CM is constructed that
              only models consensus columns from start to end. The sequence is
              then aligned to this sub CM.  This option is useful for aligning
              sequences  that  are  known  to   truncated,   non-full   length
              sequences.   This "sub CM" procedure is not the same as the "sub
              CMs" described by Weinberg and Ruzzo.  When used in  combination
              with  --tfile the parsetree printed is not the sub CM parsetree,
              but rather the sub CM parstree mapped onto the topology  of  the
              original  CM. This mapped parsetree will likely have a different
              score (sometimes much worse) than the sub CM parsetree, both  of
              those  scores  are  printed  to  the  parsetree  file  for  each
              sequence.

       --mxsize <x>
              Set the maximum allowable DP matrix size to  <x>  megabytes.  By
              default  this  size is 2048 Mb.  This should be large enough for
              the most alignments, however if it  is  not  cmscore  will  exit
              prematurely and report an error message that the matrix exceeded
              it’s maximum allowable size. In this case, the --mxsize  can  be
              used  to  raise  the  limit,  or  if --nonbanded is enabled, the
              --scoreonly option will solve the  memory  issue.   This  memory
              error  is  most  likely  to occur when the --nonbanded option is
              used without the --scoreonly option, but can  still  occur  when
              --nonbanded is not used.

       --devhelp
              Print help, as with -h , but also include undocumented developer
              options.  These  options  are  not  listed  below,   are   under
              development or experimental, and are not guaranteed to even work
              correctly. Use developer options at  your  own  risk.  The  only
              resources  for understanding what they actually do are the brief
              one-line description printed when --devhelp is enabled, and  the
              source code.

       --mpi  Run  as  an  MPI  parallel  program.   This  option will only be
              available if Infernal has been configured  and  built  with  the
              --enable-mpi flag (see User’s Guide for details).

EXPERT OPTIONS

       --emit Generate  sequences  to  score  by  sampling  from the CM.  This
              option is on by default. The number of sequences generated is 10
              by  default but can be changed with the -n option. The sequences
              generated from the CM can be saved to an output  file  in  FASTA
              format with the --outfile option.

       --random
              Generate  sequences  to  score  by  sampling  from  the CMs null
              distribution. This option  turns  the  --emit  option  off.   By
              default  the  CM  distribution will be 0.25 for each of the four
              RNA nucleotides, but it may be different if  the  --null  option
              was  used  when  cmbuild  created  the  cmfile.  By default, the
              length of the sequences generated is  sampled  from  the  length
              distribution  of  the  CM.  The  average  length  of  the random
              sequences will  be  the  consensus  length  of  the  RNA  family
              modelled  by  the  CM (or very close to it).  Alternatively, the
              --Lmin <n> and --Lmax <n> options  can  be  used  to  specify  a
              length distribution.  The number of sequences generated is 10 by
              default but can be  changed  with  the  -n  option.  The  random
              sequences  generated  can  be  saved  to an output file in FASTA
              format with the --outfile option.

       --infile <f>
              Sequences to  score  are  read  from  the  file  <f>.   All  the
              sequences  from  <f>  are read and scored, the -n and -s options
              are incompatible with --infile.

       --outfile <f>
              Save generated sequences that are scored  to  the  file  <f>  in
              FASTA  format.   This  option  is incompatible with the --infile
              option.

       --Lmin <n1>
              Must be used in combination with --random and --Lmax <n2>.   The
              lengths  of  the  random  sequences generated and scored will be
              uniform between the range of <n1>..<n2>.

       --Lmax <n2>
              Must be used in combination with --random and --Lmin <n1>.   The
              lengths  of  the  random  sequences generated and scored will be
              uniform between the range of <n1>..<n2>.

       --pad  Must be used in combination with --emit and --search.   Add  <n>
              cm->W (max hit length) minus L (sequence <x> length) residues to
              the 5’ and 3’ end of each  emitted sequence <x>.

       --hbanded
              Specify that the second stage alignment algorithm be HMM  banded
              CYK.  This option is on by default. For more information on this
              option, see the description of the --hbanded option in  the  man
              page for cmalign.

       --tau <x>
              For stage 2 alignment, set the tail loss probability used during
              HMM band calculation to <x>.  This is the amount of  probability
              mass  within  the HMM posterior probabilities that is considered
              negligible. The default  value  is  1E-7.   In  general,  higher
              values  will  result  in  greater acceleration, but increase the
              chance of missing the optimal alignment due to the HMM bands.

       --aln2bands
              With --search, when calculating HMM bands, use an HMM  alignment
              algorithm  instead of an HMM search algorithm. In general, using
              this option  will  result  in  greater  acceleration,  but  will
              increase the chance of missing the optimal alignment.

       --hsafe
              For  stage  2 HMM banded alignment, realign any sequences with a
              negative alignment  score  using  non-banded  CYK  to  guarantee
              finding the optimal alignment.

       --nonbanded
              Specify  that  the second stage alignment algorithm be standard,
              non-banded,  non-D&C  CYK.  When  --nonbanded  is  enabled,  the
              program  fails  with  a  non-zero  exit code and prints an error
              message if the parsetree score for any sequence from stage 1 D&C
              alignment  and stage 2 alignment differs by more than 0.01 bits.
              In theory, this should  never  happen  as  both  algorithms  are
              guaranteed  to determine the optimal parsetree.  For larger RNAs
              (more than 300 residues)  if  memory  is  limiting,  --nonbanded
              should be used in combination with --scoreonly.

       --scoreonly
              With --nonbanded during the second stage standard non-banded CYK
              alignment, use the  "score only" variant  of  the  algorithm  to
              save memory, and don’t recover a parse tree.

       --viterbi
              Specify  that the second stage alignment algorithm be Viterbi to
              a CM Plan 9 HMM.

       --search
              Run all algorithms in scanning mode, not alignment  mode.
              This  means  the  highest scoring subsequence within each
              sequence is returned as the score,  not  necessarily  the
              score of an alignment of the full sequence.

       --inside
              With  --search  Compare  the  non-banded  scanning Inside
              algorithm to the HMM  banded  scanning  Inside  algorith,
              instead of using CYK versions.

       --forward
              With   --search  Compare  the  scanning  Forward  scoring
              algorithm against CYK.

       --taus <n>
              Specify the first alignment algorithm as  non-banded  D&C
              CYK, and multiple stages of HMM banded CYK alignment. The
              first HMM banded alignment  will  use  tau=1E-<x>,  which
              will  be  the highest value of tau used.  Must be used in
              combination with --taue.

       --taue <n>
              Specify the first alignment algorithm as  non-banded  D&C
              CYK, and multiple stages of HMM banded CYK alignment. The
              final HMM banded alignment  will  use  tau=1E-<x>,  which
              will  be  the  lowest value of tau used.  Must be used in
              combination with --taus.

       --tfile <f>
              Print the parsetrees for each alignment of each  sequence
              to file <f>.

COPYRIGHT

       Copyright (C) 2009 HHMI Janelia Farm Research Campus.
       Freely distributed under the GNU General Public License (GPLv3).
       See  the  file  COPYING that came with the source for details on
       redistribution conditions.

AUTHOR

       Eric Nawrocki, Diana Kolbe, and Sean Eddy
       HHMI Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147
       http://selab.janelia.org/

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

EXPERT OPTIONS

SEE ALSO

COPYRIGHT

AUTHOR