cmbuild - construct a CM from an RNA multiple sequence alignment

NAME

       cmbuild - construct a CM from an RNA multiple sequence alignment

SYNOPSIS

       cmbuild [options] cmfile alifile

DESCRIPTION

       cmbuild   read   an  RNA  multiple  sequence  alignment  from  alifile,
       constructs a covariance model (CM), and saves the CM to cmfile.

       The alignment file must  be  in  Stockholm  format,  and  must  contain
       consensus  secondary  structure annotation.  cmbuild uses the consensus
       structure to determine the architecture of the CM.

       The  alignment  file  may  be  a  database  containing  more  than  one
       alignment.   If so, the resulting cmfile will be a database of CMs, one
       per alignment.

       The expert options --ctarget, --cmindiff, and --call result in multiple
       CMs being built from each alignment in alifile as described below.

OUTPUT

       The  default output from cmbuild is tabular, with a single line printed
       for each model . Each line has the following fields: aln: the index  of
       the  alignment used to build the CM, cm idx: the index of the CM in the
       cmfile; name: the name of the CM, nseq: the number of sequences in  the
       alignment  used  to  build  the  CM,  eff_nseq: the effective number of
       sequences used to build the model  (see  the  User  Guide);  alen:  the
       length  of  the  alignment  used  to  build the CM; clen: the number of
       columns from the alignment defined as consensus columns;  rel  entropy,
       CM:  the  total  relative entropy of the model divided by the number of
       consensus columns; rel entropy, HMM: the total relative entropy of  the
       model  ignoring  secondary structure divided by the number of consensus
       columns.

OPTIONS

       -h     Print brief help; includes version number  and  summary  of  all
              options, including expert options.

       -n <s> Name  the  covariance  model  <s>.   (Does  not  work if alifile
              contains more than one alignment).  The default is  to  use  the
              name  of  the  alignment (given by the #=GF ID tag, in Stockholm
              format), or if that is not present,  to  use  the  name  of  the
              alignment  file  minus  any file type extension plus a "-" and a
              positive integer indicating the position of  that  alignment  in
              the  file  (that  is, the first alignment in a file "myrnas.sto"
              would give a CM named "myrnas-1",  the  second  alignment  would
              give a CM named "myrnas-2").

       -A     Append the CM to cmfile, if cmfile already exists.

       -F     Allow  cmfile  to  be  overwritten.  Normally, if cmfile already
              exists, cmbuild exits with an error unless the -A or  -F  option
              is set.

       -v     Run  in  verbose output mode instead of using the default single
              line tabular format. This output format is similar to that  used
              by older versions of Infernal.

       --iins Allow  informative insert emissions for the CM.  By default, all
              CM insert emission scores are set to 0.0 bits.   The  motivation
              for  zero  bit  scores  is  to  avoid  high-scoring  hits to low
              complexity  sequence  favored  by  high  insert  state  emission
              scores.

       --Wbeta<x>
              Set  the  beta tail loss probability for query-dependent banding
              (QDB) to <x> The QDB algorithm is used to determine the maximium
              length  of  a  hit to the model. For more information on QDB see
              (Nawrocki and Eddy, PLoS Computational Biology 3(3): e56).   The
              beta  paramater  is  the  amount  of probability mass considered
              negligible during band calculation, lower values  of  beta  will
              result  in  shorter maximum hit lengths, which will yield faster
              searches.  The default beta is 1E-7: determined empirically as a
              good tradeoff between sensitivity, specificity  and speed.

       --devhelp
              Print help, as with -h , but also include undocumented developer
              options. These options are not  listed  below.  They  are  under
              development or experimental, and are not guaranteed to even work
              correctly. Use developer options at  your  own  risk.  The  only
              resources  for understanding what they actually do are the brief
              one-line description printed when --devhelp is enabled, and  the
              source code.

EXPERT OPTIONS

       --rsearch <f>
              Parameterize  emission  scores  a  la RSEARCH, using the RIBOSUM
              matrix in file <f>.  (Actually, the emission scores will not  be
              identical  to RIBOSUM scores due of differences in the modelling
              strategy between Infernal and  RSEARCH,  but  they  will  be  as
              similar  as  possible.)   RIBOSUM matrix files are included with
              Infernal  in  the  "matrices/"  subdirectory  of  the  top-level
              Infernal  directory.   RIBOSUM  matrices  are substitution score
              matrices trained specifically for structural RNAs with  separate
              single  stranded  residue and base pair substitution scores. For
              more information see the RSEARCH publication  (Klein  and  Eddy,
              BMC  Bioinformatics  4:44,  2003). Actually, the emission scores
              will not exactly

              With --rsearch enabled, all alignments in alifile  must  contain
              exactly  one sequence or the --call option must also be enabled.

       --binary
              Save the model in a compact binary format. The default is a more
              readable ASCII text format.

       --rf   Use reference coordinate annotation (#=GC RF line, in Stockholm)
              to determine which columns are consensus, and which are inserts.
              Any   non-gap  character  indicates  a  consensus  column.  (For
              example, mark consensus columns with  "x",  and  insert  columns
              with  ".".)   The default is to determine this automatically; if
              the frequency of gap characters in a column is  greater  than  a
              threshold,  gapthresh  (default  0.5),  the  column is called an
              insertion.

       --gapthresh <x>
              Set the gap threshold (used for determining  which  columns  are
              insertions  versus  consensus;  see  --rf  above)  to  <x>.  The
              default is 0.5.

       --ignorant
              Strip all base pair secondary  structure  information  from  all
              input  alignments  in  alifile  before  building  the CM(s). All
              resulting CM(s) will have zero MATP (base pair) nodes, with zero
              bifurcations.

       --wgsc Use  the  Gerstein/Sonnhammer/Chothia (GSC) weighting algorithm.
              This is the default  unless  the  number  of  sequences  in  the
              alignment  exceeds  a cutoff (see --pbswitch), in which case the
              default becomes the  faster  Henikoff  position-based  weighting
              scheme.

       --wblosum
              Use  the  BLOSUM  filtering  algorithm  to weight the sequences,
              instead of the default GSC weighting.  Cluster the sequences  at
              a  given  percentage identity (see --wid); assign each cluster a
              total weight of 1.0, distributed equally amongst the members  of
              that cluster.

       --wpb  Use the Henikoff position-based weighting scheme. This weighting
              scheme is automatically used (overriding --wgsc  and  --wblosum)
              if  the  number  of  sequences in the alignment exceeds a cutoff
              (see --pbswitch).

       --wnone
              Turn sequence weighting off; e.g. explicitly  set  all  sequence
              weights to 1.0.

       --wgiven
              Use  sequence  weights  as  given  in  annotation  in  the input
              alignment file. If no weights were given, assume  they  are  all
              1.0.   The  default  is to determine new sequence weights by the
              Gerstein/Sonnhammer/Chothia algorithm,  ignoring  any  annotated
              weights.

       --pbswitch <n>
              Set  the cutoff for automatically switching the weighting method
              to the Henikoff position-based weighting scheme to <n>.  If  the
              number  of  sequences  in  the  alignment  exceeds  <n> Henikoff
              weighting is used.  By default <n> is 5000.

       --wid <x>
              Controls the behavior  of  the  --wblosum  weighting  option  by
              setting  the  percent  identity  for clustering the alignment to
              <x>.

       --eent Use the entropy weighting strategy to  determine  the  effective
              sequence  number  that  gives a target mean match state relative
              entropy. This option is the default, and can be turned off  with
              --enone.   The  default target mean match state relative entropy
              is 0.59 bits but can be changed with --ere.  The default of 0.59
              bits  is  automatically changed if the total relative entropy of
              the model (summed match state relative entropy) is less  than  a
              cutoff, which is is 6.0 bits by default, but can be changed with
              the expert, undocumented --eX option. If you really want to play
              with that option, consult the source code.

       --enone
              Turn  off the entropy weighting strategy. The effective sequence
              number is just the number of sequences in the alignment.

       --ere <x>
              Set the target mean match state relative  entropy  as  <x>.   By
              default  the  target relative entropy per match position is 0.59
              bits.

       --null <f>
              Read a  null  model  from  <f>.   The  null  model  defines  the
              probability  of  each RNA nucleotide in background sequence, the
              default is to use 0.25 for each nucleotide.  The format of  null
              files is documented in the User’s Guide.

       --prior <f>
              Read  a  Dirichlet prior from <f>, replacing the default mixture
              Dirichlet.  The format of  prior  files  is  documented  in  the
              User’s Guide.

       --ctarget <n>
              Cluster  each  alignment  in alifile by percent identity. Find a
              cutoff percent id threshold that gives exactly <n> clusters  and
              build  a  separate  CM from each cluster. If <n> is greater than
              the number of sequences in the alignment the  program  will  not
              complain,  and  each  sequence  in the alignment will be its own
              cluster.  Each CM will have a positive integer appended  to  its
              name indicating the order in which it was built. For example, if
              cmbuild --ctarget 3 is called  with  alifile  "myrnas.sto",  and
              "myrnas.sto"  has  exactly one Stockholm alignment in it with no
              #=GF ID tag annotation, three CMs will be built, the first  will
              be  named  "myrnas-1.1", the second, "myrnas-1.2", and the third
              "myrnas-1.3".  (As explained above for the -n option, the  first
              number  "1"  after  "myrnas" indicates the CM was built from the
              first alignment in "myrnas.sto".)

       --cmaxid <x>
              Cluster each sequence alignment in alifile by percent  identity.
              Define  clusters  at  the cutoff fractional id similarity of <x>
              and build a separate CM from each  cluster.   No  two  sequences
              will  be  be  more  than  <x> fractionally identical ( <x> * 100
              percent identical) if  those  two  sequences  are  in  different
              clusters.  The CMs are named as described above for --ctarget.

       --call Build  a  separate  CM  from  each sequence in each alignment in
              alifile.  Naming of CMs  takes  place  as  described  above  for
              --ctarget.   Using  this  option  in  combination with --rsearch
              causes a separate CM to  be  built  and  parameterized  using  a
              RIBOSUM matrix for each sequence in alifile.

       --corig
              After  building  multiple  CMs  using  --ctarget,  --cmindiff or
              --call as described above, build a final CM using  the  complete
              original alignment from alifile.  The CMs are named as described
              above for --ctarget with the exception of  the  final  CM  built
              from  the  original  alignment  which  is  named  in the default
              manner, without an appended integer.

       --cdump<f>
              Dump the multiple alignments of each cluster to <f> in Stockholm
              format.   This  option only works in combination with --ctarget,
              --cmindiff or --call.

       --refine <f>
              Attempt to refine the alignment before  building  the  CM  using
              expectation-maximization  (EM).  A  CM  is  first built from the
              initial alignment as usual. Then, the sequences in the alignment
              are  realigned  optimally  (with  the  HMM banded CYK algorithm,
              optimal means optimal given the bands) to the CM, and a  new  CM
              is  built  from  the resulting alignment. The sequences are then
              realigned to the new CM,  and  a  new  CM  is  built  from  that
              alignment.  This  is  continued  until convergence, specifically
              when the  alignments  for  two  successive  iterations  are  not
              significantly  different  (the  summed  bit  scores  of  all the
              sequences in the alignment changes  less  than  1%  between  two
              successive  iterations). The final alignment (the alignment used
              to build the CM that gets written to cmfile) is written to  <f>.

       --gibbs
              Modifies  the  behavior  of  --refine  so Gibbs sampling is used
              instead of EM. The difference is that during the alignment stage
              the  alignment  is not necessarily optimal, instead an alignment
              (parsetree) for each sequences is  sampled  from  the  posterior
              distribution   of   alignments   as  determined  by  the  Inside
              algorithm.  Due  to  this  sampling   step   --gibbs   is   non-
              deterministic,  so  different  runs  with the same alignment may
              yield different results. This is not true when --refine is  used
              without  the  --gibbs  option, in which case the final alignment
              and CM will always be the same. When --gibbs is enabled, the  -s
              <n>  option  can  be  used  to  seed the random number generator
              predictably, making the results reproducible.  The goal  of  the
              --gibbs  option  is to help expert RNA alignment curators refine
              structural alignments by allowing them  to  observe  alternative
              high scoring alignments.

       -s <n> Set  the  random  seed  to <n>, where <n> is a positive integer.
              This option can only be used in combination with  --gibbs.   The
              default  is  to use time() to generate a different seed for each
              run, which means that two different runs of cmbuild --refine <f>
              --gibbs  on  the  same  alignment  will  give slightly different
              results. You  can  use  this  option  to  generate  reproducible
              results.

       -l     With  --refine,  turn  on  the  local alignment algorithm, which
              allows the  alignment  to  span  two  or  more  subsequences  if
              necessary  (e.g. if the structures of the query model and target
              sequence are only  partially  shared),  allowing  certain  large
              insertions  and  deletions  in  the  structure  to  be penalized
              differently than normal indels.   The  default  is  to  globally
              align the query model to the target sequences.

       -a     With  --refine,  print  the  scores  of each individual sequence
              alignment.

       --cyk  With --refine, align with the  CYK  algorithm.  By  default  the
              optimal accuracy algorithm is used. There is more information on
              this in the cmalign manual page.

       --sub  With --refine, turn on the sub model construction and  alignment
              procedure.  For  each  sequence  to be realigned an HMM is first
              used to predict the model start and end consensus columns, and a
              new  sub  CM  is  constructed that only models consensus columns
              from start to end. The sequence is then aligned to this sub  CM.
              This  option  is  useful  for  building  CMs for alignments with
              sequences  that  are  known  to   truncated,   non-full   length
              sequences.  This  option  is  experimental  and  not  rigorously
              tested, use at your own risk.  This "sub CM"  procedure  is  not
              the same as the "sub CMs" described by Weinberg and Ruzzo.

       --nonbanded
              With  --refine,  do  not  use HMM bands to accelerate alignment.
              Use the full CYK algorithm  which  is  guaranteed  to  give  the
              optimal  alignment.   This will slow down the run significantly,
              especially for large models.

       --tau <x>
              With --refine, set the tail loss  probability  used  during  HMM
              band calculation to <f>.  This is the amount of probability mass
              within  the  HMM  posterior  probabilities  that  is  considered
              negligible.  The  default  value  is  1E-7.   In general, higher
              values will result in greater  acceleration,  but  increase  the
              chance of missing the optimal alignment due to the HMM bands.

       --fins With  --refine,  change the behavior of how insert emissions are
              placed in the alignment.  By default, all contiguous  blocks  of
              inserts  are  split  in  half, and half the residues are flushed
              left against the nearest consensus column to the left, and  half
              are  flushed  right  against the nearest consensus column on the
              right. With --fins inserts are not split in  half,  instead  all
              inserted  residues  from IL states are flushed left, instead all
              inserted residues from IR states are flushed right. This was the
              default behavior of previous versions of Infernal.

       --mxsize <x>
              With  --refine,  set  the  maximum  allowable  matrix  size  for
              alignment to <x> megabytes. By default this size is 2 Gb.   This
              should  be  large  enough  for  the vast majority of alignments,
              however it is possible that when run with --refine, cmbuild will
              exit  prematurely,  reporting  an  error message that the matrix
              exceeded it’s maximum allowable size. In this case, the --mxsize
              can be used to raise the limit.

       --rdump<x>
              With  --refine,  output  the  intermediate  alignments  at  each
              iteration of the refinement procedure (as  described  above  for
              --refine ) to file <f>.

COPYRIGHT

       Copyright (C) 2009 HHMI Janelia Farm Research Campus.
       Freely distributed under the GNU General Public License (GPLv3).
       See  the  file  COPYING  that  came  with  the  source  for  details on
       redistribution conditions.

AUTHOR

       Eric Nawrocki, Diana Kolbe, and Sean Eddy
       HHMI Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147
       http://selab.janelia.org/

NAME

SYNOPSIS

DESCRIPTION

OUTPUT

OPTIONS

EXPERT OPTIONS

SEE ALSO

COPYRIGHT

AUTHOR