cmcalibrate - fit exponential tails for E-values and determine HMM

NAME

       cmcalibrate - fit exponential tails for E-values and determine HMM
                         filter thresholds for a CM

SYNOPSIS

       cmcalibrate [options] cmfile

DESCRIPTION

       cmcalibrate calibrates E-value statistics and HMM filter thresholds for
       the covariance models (CMs) in cmfile.  The  E-values  and  HMM  filter
       threshold  statistics  are added to the cmfile and are used by cmsearch
       for increased sensitivity and acceleration in RNA homology search.

       CMs are profiles of RNA consensus sequence and secondary  structure.  A
       CM  file  is produced by the cmbuild program, from a given RNA sequence
       alignment of known consensus structure.

       cmcalibrate is very slow. It takes several hours to calibrate a  single
       average  sized CM.  cmcalibrate can be run in parallel with MPI.  To do
       this, use the --mpi option  and  run  cmsearch  inside  a  MPI  wrapper
       program such as mpirun.  For example: mpirun C cmcalibrate --mpi [other
       options] cmfile.  Note that cmcalibrate must have been compiled in  MPI
       mode  to  use  the  --mpi  option.  See the Installation seciton of the
       user’s guide for instructions on how to compile in MPI mode.

       The --forecast <n> option can be used to estimate how long the  program
       will  take  to  run  on  <n>  processors.  Unless  you  plan on running
       cmcalibrate in MPI mode, <n> should be set as 1.

       cmcalibrate performs two main tasks. The first is to calibrate  E-value
       statistics.   This is done by generating random sequences and searching
       them with the CM and collecting hits. The histogram of the  bit  scores
       of  the  hits  is fit to an exponential tail, and the parameters of the
       fitted tail are saved to the CM file. The exponential tail is  used  to
       predict  the expected number of hits (E-values) at a given bit score in
       cmsearch.  The random sequences  are  generated  by  an  HMM  that  was
       trained on real genomic sequences with various GC contents. The goal is
       to have the GC distributions in the random sequences to be  similar  to
       actual genomic sequences.

       The  second  task is to determine appropriate HMM filter thresholds for
       the CM over the possible range of final CM bit score  thresholds.  This
       is  done  by sampling 10,000 sequences from the CM itself and searching
       them with the CM and HMM. The appropriate HMM bit score threshold for a
       given  CM  threshold  is  set  as the HMM threshold that will recognize
       99.5% of the hits that score above the CM threshold. This HMM threshold
       is calculated over the range of reasonable CM thresholds.

       Both  tasks must be performed for each configuration and algorithm that
       cmsearch might use. These include HMM Viterbi, HMM Forward, CM CYK  and
       CM  Inside algorithms for E-value calibration, and CM CYK and CM Inside
       algorithms for HMM filter thresholds. Additionally, for each algorithm,
       each  task must be performed twice, once for a locally configured model
       and once for a globally configured model.

       The E-values and HMM filter thresholds determined  by  cmcalibrate  are
       only  used  by  the  cmsearch  program.   If  you  are not going to use
       cmsearch, do not waste time calibrating your models.

       The majority of the  options  to  cmcalibrate  fall  into  one  of  two
       categories, depending on which of the two main tasks they’re associated
       with. Options that affect the  exponential  tail  E-value  fitting  are
       prefixed  with  --exp.   Options  that  affect the HMM filter threshold
       determination are prefixed with --fil.

       The calibration of E-value statistics takes the majority of the running
       time  of  cmcalibrate.   This is because CM search algorithms are slow,
       and the random sequences that must be searched have to be  long  enough
       to  include  enough  random hits that can be binned into a histogram to
       which an exponential tail can be reliably fit. By  default  the  random
       sequence  length  for CM searches is 1.5 megabases (Mb), for all search
       modes, but 1.5 can be changed to <x> with --exp-cmL-glc <x>  or  --exp-
       cmL-loc  <x>  options  for  glocal  and  local  CM  search calibrations
       respectively.  Because cmsearch uses HMM search algorithms  to  filter,
       cmcalibrate  must also fit exponential tails for HMM search algorithms.
       HMMs are much faster than CMs so it is possible to search  much  longer
       random  sequence than 1.5 MB and not significantly increase the running
       time of cmcalibrate.  The length of sequence searched with the  HMM  is
       controlled by the --exp-fract <x>, --exp-hmmLn-glc <x>, --exp-hmmLn-loc
       <x>, and the --exp-hmmLx <x> options.  By default, the sequence  length
       for  HMM  calibration is set as the length that will require 0.10 times
       the  number  of  dynamic  programming  calculations  as  a  CM  E-value
       calibration step. (The value 0.10 can be changed to <x> with the --exp-
       fract <x> option).  If this sequence length  is  less  than  a  minimum
       value, which by default is 15.0 MB, then the minimum value is used. The
       minimum value can be changed to <x> with --exp-hmmLn-glc <x> and --exp-
       hmmLn-loc  <x> for glocal and local HMM search calibrations separately.
       Similarily if this value is more than a maximum value, which by default
       is  1000.0 MB, then the maximum value is used. The maximum value can be
       changed to <x> with the --exp-hmmLx <x> option.

OPTIONS

       -h     Print brief help; includes version number  and  summary  of  all
              options, including expert options.

       -s <n> Set  the  random  number  generator  seed to <n>, where <n> is a
              positive integer.  The default is to use time()  to  generate  a
              different seed for each run, which means that two different runs
              of cmcalibrate on the same CM will give  slightly  different  E-
              value  and  HMM  filter  threshold  parameters. You can use this
              option to generate reproducible results.

       --forecast <n>
              Predict the running time  of  the  calibration  for  cmfile  and
              provided  options and exit, DO NOT perform the calibration.  The
              predictions should be used as rough estimates. The value <n>  is
              the  number of processors the calibration will be run on, so <n>
              equal to 1 is appropriate unless you  will  run  cmcalibrate  in
              parallel with MPI.

       --devhelp
              Print help, as with -h , but also include undocumented developer
              options.  These  options  are  not  listed  below,   are   under
              development or experimental, and are not guaranteed to even work
              correctly. Use developer options at  your  own  risk.  The  only
              resources  for understanding what they actually do are the brief
              one-line description printed when --devhelp is enabled, and  the
              source code.

       --mpi  Run  as  an  MPI  parallel  program.  This  option  will only be
              available if Infernal has been configured  and  built  with  the
              "--enable-mpi" flag (see User’s Guide for details).

EXPERT OPTIONS

       --exp-cmL-glc <x>
              Set  the  length  of random sequence to search for the CM glocal
              exponential tail fits to <x> megabases (Mb). By default, <x>  is
              1.5  Mb. Searching more sequences will make the exponential tail
              fits more precise, but  will  take  longer:  using  <x>  of  3.0
              instead  of  the  default  of 1.5 will cause the running time of
              cmcalibrate to increase by roughly 50%.

       --exp-cmL-loc <x>
              Set the length of random sequence to search  for  the  CM  local
              exponential  tail fits to <x> megabases (Mb). By default, <x> is
              1.5 Mb. Searching more sequences will make the exponential  tail
              fits  more  precise,  but  will  take  longer:  using <x> of 3.0
              instead of the default of 1.5 will cause  the  running  time  of
              cmcalibrate to increase by roughly 50%.

       --exp-hmmLn-glc <x>
              Set  the  minimum  random  sequence length to search for the HMM
              glocal exponential tail fits to <x> megabases (Mb). By  default,
              <x>   is  15.0.   For  more  information,  see  the  explanation
              regarding sequence lengths for E-value calibration above  before
              the Options section.

       --exp-hmmLn-loc <x>
              Set  the  minimum  random  sequence length to search for the HMM
              local exponential tail fits to <x> megabases (Mb).  By  default,
              <x>   is  15.0.   For  more  information,  see  the  explanation
              regarding sequence lengths for E-value calibration above  before
              the Options section.

       --exp-hmmLx <x>
              Set   the   maximum   random  sequence  length  to  search  when
              determining HMM E-values to <x> megabases (Mb). By default,  <x>
              is  1000.0.  For more information, see the explanation regarding
              sequence  lengths  for  E-value  calibration  above  before  the
              Options section.

       --exp-fract <x>
              Set  the  HMM/CM fraction of dynamic programming calculations to
              <x>.  By default, <x> is 0.10.  For more  information,  see  the
              explanation  regarding  sequence lengths for E-value calibration
              above before the Options section.

       --exp-tailn-cglc <x>
              During E-value calibration of glocal CM  search  modes  fit  the
              exponential  tail  to the high scores in the histogram tail that
              includes <x> hits per Mb searched.  By default this <x>  is  25.
              The  value  25  was chosen because it works well empirically for
              glocal CM modes relative to other values.

       --exp-tailn-cloc <x>
              During E-value calibration of local  CM  search  modes  fit  the
              exponential  tail  to the high scores in the histogram tail that
              includes <x> hits per Mb searched.  By default this <x>  is  75.
              The  value  75  was chosen because it works well empirically for
              local CM modes relative to other values.

       --exp-tailn-hglc <x>
              During E-value calibration of glocal HMM search  modes  fit  the
              exponential  tail  to the high scores in the histogram tail that
              includes <x> hits per Mb searched.  By default this <x> is  250.
              The  value  250 was chosen because it works well empirically for
              glocal HMM modes relative to other values.

       --exp-tailn-hloc <x>
              During E-value calibration of local HMM  search  modes  fit  the
              exponential  tail  to the high scores in the histogram tail that
              includes <x> hits per Mb searched.  By default this <x> is  750.
              The  value  750 was chosen because it works well empirically for
              glocal HMM modes relative to other values.

       --exp-tailp <x>
              Ignore the --exp-tailn prefixed options and fit the <x> fraction
              right tail of the histogram to exponential tails, for all search
              modes.

       --exp-tailxn <n>
              With --exp-tailp enforce that the maximum number of hits in  the
              tail that is fit is <n>.

       --exp-beta <x>
              During  E-value  calibration, by default query-dependent banding
              (QDB) is used to accelerate the CM search algorithms with a beta
              tail  loss probability of 1E-15.  This beta value can be changed
              to <x> using the --exp-beta <x> option. The  beta  parameter  is
              the amount of probability mass excluded during band calculation,
              higher values of beta give greater speedups but  sacrifice  more
              accuracy   than  lower  values.  A  recommended  value  is  1E-7
              (0.00001).  QDB is explained in more detail in the  manual  page
              for  cmsearch  and  in  (Nawrocki  and  Eddy, PLoS Computational
              Biology 3(3): e56).

       --exp-no-qdb
              Turn of QDB during E-value  calibration.  This  will  slow  down
              calibration,  and  is  not  recommended unless you plan on using
              --no-qdb in cmsearch.

       --exp-hfile <f>
              Save the histograms fit for the E-value calibration to file <f>.
              The  format of this file is two tab delimited columns. The first
              column is the x-axis values of  bit  scores  of  each  bin.  The
              second  column  is  the y-axis values of number of hits per bin.
              Each series is delimited by a line with a single character  "&".
              The  file will contain one series for each exponential tail fit,
              i.e. one series of empirical data for each line of  output  from
              cmcalibrate that begins with "exp tail".

       --exp-sfile <f>
              Save  a  survival  plot for the E-value calibration to file <f>.
              The format of this file is two tab delimited columns. The  first
              column  is  the  x-axis  values  of  bit scores of each bin. The
              second column is the y-axis values of fraction of hits that meet
              or  exceed the score for each bin. Each series is delimited by a
              line with a single character "&".  The file will  contain  three
              series’ of data for each exponential tail fit, i.e. three series
              for each line of output from cmcalibrate that begins  with  "exp
              tail".  The first series is the empirical survival plot from the
              histogram of hits to the random sequence. The second  series  is
              the  exponential  tail  fit  to  the empirical distribution. The
              third series is the exponential tail fit if  lambda  were  fixed
              and set as the natural log of 2 (0.691314718).

       --exp-qqfile <f>
              Save  a  quantile-quantile  plot  for the E-value calibration to
              file <f>.  The format of this file is two tab delimited columns.
              The  first column is the x-axis values, and the second column is
              the y-axis values. The distance of the points from the  identity
              line (y=x) is a measure of how good the exponential tail fit is,
              the closer the points are to the identity line, the  better  the
              fit  is.   Each  series  is  delimited  by  a line with a single
              character "&".  The file will contain one  series  of  empirical
              data  for  each  exponential  tail fit, i.e. one series for each
              line of output from cmcalibrate that begins with "exp tail".

       --exp-ffile <f>
              Save statistics on the exponential tail statistics to file  <f>.
              The  file  will contain the lambda and mu values for exponential
              tails fit to tails of different sizes. For example,  by  default
              cmcalibrate  fits exponential tails to the rightmost 0.01 (1) of
              the  score  histogram  and  stores  the   parameters   of   that
              exponential  tail  to  the  CM  file.  (The value of 0.01 can be
              changed to <x> with the --exp-tailp <x>  option).   When  --exp-
              ffile  <f>  is  used  the file  <f> will include the exponential
              tail parameters for fits to various fractions of  the  histogram
              tail, instead of just to 0.01.

       --fil-N <n>
              Set  the  number  of  sequences sampled and searched for the HMM
              filter threshold calibration to <n>.  By default, <n> is 10,000.

       --fil-F <x>
              Set the fraction of sample sequences the HMM filter must be able
              to recognize, and allow to survive,  to  <x>,  where  <x>  is  a
              positive real number less than or equal to 1.0.  By default, <x>
              is 0.993.

       --fil-tau <x>
              Set the tail loss probability during HMM  band  calculation  for
              HMM  filter threshold calibration to <x>.  This is the amount of
              probability mass within the HMM posterior probabilities that  is
              considered  negligible.  The default value is 1E-7.  In general,
              higher values will result in greater acceleration, but  increase
              the  chance  of  missing  the  optimal  alignment due to the HMM
              bands.

       --fil-gemit
              During HMM filter calibration, always sample  sequences  from  a
              globally  configured  CM,  even when calibrating local modes. By
              default, sequences are sampled from  a  globally  configured  CM
              when  calibrating  the  global  search modes, and sampled from a
              locally configured CM when calibrating the local  search  modes.

       --fil-dfile <f>
              Save  statistics  on filter threshold calibration, including HMM
              and CM scores for all sampled sequences, to file <f>.

       --mxsize <x>
              Set the maximum allowable DP matrix size to  <x>  megabytes.  By
              default  this size is 2,048 Mb.  This should be large enough for
              the  vast  majority  of  calibrations,  however  if  it  is  not
              cmcalibrate  will  exit  prematurely and report an error message
              that the matrix exceeded it’s maximum allowable  size.  In  this
              case, the --mxsize can be used to raise the limit.

COPYRIGHT

       Copyright (C) 2009 HHMI Janelia Farm Research Campus.
       Freely distributed under the GNU General Public License (GPLv3).
       See  the  file  COPYING  that  came  with  the  source  for  details on
       redistribution conditions.

AUTHOR

       Eric Nawrocki, Diana Kolbe, and Sean Eddy
       HHMI Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147
       http://selab.janelia.org/

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

EXPERT OPTIONS

SEE ALSO

COPYRIGHT

AUTHOR