cmalign - use a CM to make a structured RNA multiple alignment

NAME

       cmalign - use a CM to make a structured RNA multiple alignment

SYNOPSIS

       Align sequences to a CM:
              cmalign [options] cmfile seqfile

       Merge two alignments:
              cmalign --merge [options] cmfile msafile1 msafile2

DESCRIPTION

       cmalign  aligns  the  RNA  sequences in seqfile to the covariance model
       (CM)  in  cmfile,  and   outputs   a   multiple   sequence   alignment.
       Alternatively,   with  the  --merge  option,  cmalign  merges  the  two
       alignments msafile1 and msafile2 created by previous  runs  of  cmalign
       with cmfile into a single alignment.

       The sequence file seqfile must be in FASTA, EMBL, or Genbank format.

       CMs  are  profiles of RNA consensus sequence and secondary structure. A
       CM file is produced by the cmbuild program, from a given  RNA  sequence
       alignment of known consensus structure.

       The  alignment  that  cmalign makes is written in Stockholm format.  It
       can be redirected to a file using the -o option.

       cmalign uses an  HMM  banding  technique  to  accelerate  alignment  by
       default as described below for the --hbanded option. HMM banding can be
       turned off with the --nonbanded option.

       By default,  cmalign  computes  the  alignment  with  maximum  expected
       accuracy  that  is  consistent with constraints (bands) derived from an
       HMM, using a banded  version  of  the  Durbin/Holmes  optimal  accuracy
       algorithm.  This behavior can be changed, as described below and in the
       User’s Guide, with the --cyk, --sample, or --viterbi options.

       It is possible to include the fixed training alignment  used  to  build
       the  CM within the output alignment of cmalign.  This is done using the
       --withali option, as described below and in the User’s Guide.

OUTPUT

       cmalign first  outputs  tabular  information  on  the  scores  of  each
       sequence  being  aligned,  then  the  alignment  itself is printed. The
       alignment can be redirected to an output  file  <f>  with  the  -o  <f>
       option.  The tabular output section includes one line per sequences and
       seven fields per line:  "seq idx": the index of  the  sequence  in  the
       input  file,  "seq  name":  the sequence name, "len": the length of the
       sequence, "total": the total bit score of the  sequence,  "struct":  an
       approximation of the contribution of the secondary structure to the bit
       score,  "avg  prob":  the  average  posterior  probability  (confidence
       estimate)  of  each aligned residue, and "elapsed": the wall time spent
       aligning the sequence.

       The fields can change if different options are selected. For example if
       the  --cyk  option  is enabled, the "avg prob" field disappears because
       posterior probabilities are not calculated by the CYK algorithm.

OPTIONS

       -h     Print brief help; includes version number  and  summary  of  all
              options, including expert options.

       -o <f> Save  the  alignment  in  Stockholm  format  to a file <f>.  The
              default is to write it to standard output.

       -l     Turn  on  the  local  alignment  algorithm,  which  allows   the
              alignment to span two or more subsequences if necessary (e.g. if
              the structures of the query model and target sequence  are  only
              partially   shared),   allowing  certain  large  insertions  and
              deletions in the structure  to  be  penalized  differently  than
              normal indels.  The default is to globally align the query model
              to the target sequences.

       -p     Annotate the alignment with posterior  probabilities  calculated
              using  the  Inside and Outside algorithms.  The -p option causes
              additional annotation to appear in  the  output  alignment,  but
              does  not  modify  the  alignment  itself (that is, the relative
              positions of the residues are unchanged).   Two  characters  for
              each residue are used to annotate the posterior probability that
              the corresponding residue aligns at the  corresponding  position
              in  the Stockholm alignment. These characters have the Stockholm
              markup tags "#=GR  <seq  name>  POSTX."  and  "#=GR  <seq  name>
              POST.X",  and  can only have the values: "0-9", "*" or ".". They
              indicate the tens and ones place for the posterior  probability:
              an  "8"  for  "POSTX." and a "3" for "POST.X" indicates that the
              posterior probability is between 0.83 and 0.84. A "*"  for  both
              "POSTX."  and "POST.X" indicates that the confidence estimate is
              "very nearly" 1.0 (it’s hard to be exact here due  to  numerical
              precision issues) A "."  in both "POSTX." and "POST.X" indicates
              that that column aligns to a gap. When used in combination  with
              --nonbanded,  the  calculation  of  the  posterior probabilities
              considers all possible alignments of the target sequence to  the
              CM.  Without  --nonbanded  (in HMM banded mode), the calculation
              considers only possible alignments within the HMM bands.

       -q     Quiet; suppress the verbose banner, and only print the resulting
              alignment  to  stdout.  This  allows piping the alignment to the
              input of other programs, for example.

       -1     Output the alignment in pfam format, a non-interleaved Stockholm
              format in which each sequence is on a single line.

       --informat <s>
              Assert  that  the  input  seqfile  is in format <s>.  Do not run
              Babelfish format autodection. This increases the reliability  of
              the  program  somewhat, because the Babelfish can make mistakes;
              particularly recommended for unattended, high-throughput runs of
              Infernal.    Acceptable   formats  are:  FASTA,  EMBL,  UNIPROT,
              GENBANK, and DDBJ.  <s> is case-insensitive.

       --devhelp
              Print help, as with -h , but also include undocumented developer
              options.   These   options  are  not  listed  below,  are  under
              development or experimental, and are not guaranteed to even work
              correctly.  Use  developer  options  at  your own risk. The only
              resources for understanding what they actually do are the  brief
              one-line  description printed when --devhelp is enabled, and the
              source code.

       --mpi  Run as an  MPI  parallel  program.  This  option  will  only  be
              available  if  Infernal  has  been configured and built with the
              "--enable-mpi" flag (see User’s Guide for details).

EXPERT OPTIONS

       --optacc
              Align  sequences  using  the  Durbin/Holmes   optimal   accuracy
              algorithm.  This is default behavior, so this option is probably
              useless.  The optimal accuracy alignment will be constrained  by
              HMM  bands  for  acceleration  unless  the --nonbanded option is
              enabled.   The  optimal  accuracy   algorithm   determines   the
              alignment  that  maximizes  the  posterior  probabilities of the
              aligned residues within  it.   The  posterior  probabilites  are
              determined  using  (possibly  HMM banded) variants of the Inside
              and Outside algorithms.

       --cyk  Do not use the Durbin/Holmes optimal accuracy alignment to align
              the  sequences,  instead  use the CYK algorithm which determines
              the optimally scoring alignment of the sequence to the model.

       --sample
              Sample  an  alignment  from  the   posterior   distribution   of
              alignments.   The  posterior distribution is determined using an
              HMM banded (unless --nonbanded) variant of the Inside algorithm.

       -s <n> Set  the  random  number  generator  seed to <n>, where <n> is a
              positive integer. This option can only be  used  in  combination
              with  --sample.   The  default  is  to  use time() to generate a
              different seed for each run, which means that two different runs
              of  cmalign  --sample  on  the same alignment will give slightly
              different  results.  You  can  use  this  option   to   generate
              reproducible results.

       --viterbi
              Do  not  use  the CM to align the sequences, instead use the HMM
              Viterbi algorithm to align with a CM Plan  9  HMM.  The  HMM  is
              automatically  constructed  to  be  maximally similar to the CM.
              This HMM alignment is faster than CM alignment, but can be  less
              accurate because the structure of the RNA family is ignored.

       --sub  Turn  on the sub model construction and alignment procedure. For
              each sequence, an HMM is first used to predict the  model  start
              and  end consensus columns, and a new sub CM is constructed that
              only models consensus columns from start to end. The sequence is
              then aligned to this sub CM.  This option is useful for aligning
              sequences  that  are  known  to   truncated,   non-full   length
              sequences.   This "sub CM" procedure is not the same as the "sub
              CMs" described by Weinberg and Ruzzo.

       --small
              Use the divide and conquer CYK alignment algorithm described  in
              SR  Eddy,  BMC Bioinformatics 3:18, 2002. The --nonbanded option
              must be used in combination with  this  options.   Also,  it  is
              recommended  whenever  --nonbanded  is used that --small is also
              used  because standard CM alignment without HMM banding requires
              a  lot  of memory, especially for large RNAs.  --small allows CM
              alignment within practical memory limits,  reducing  the  memory
              required  for  alignment  LSU rRNA, the largest known RNAs, from
              150 Gb to less than 300 Mb.  This option can  only  be  used  in
              combination with --nonbanded and --cyk.

       --hbanded
              This  option  is  turned on by default.  Accelerate alignment by
              pruning away regions  of  the  CM  DP  matrix  that  are  deemed
              negligible  by an HMM.  First, each sequence is scored with a CM
              plan 9 HMM derived from the CM using the  Forward  and  Backward
              HMM  algorithms  and calculate posterior probabilities that each
              residue aligns  to  each  state  of  the  HMM.  These  posterior
              probabilities  are  used to derive constraints (bands) on the CM
              DP matrix. Finally, the target sequence is  aligned  to  the  CM
              using the banded DP matrix, during which cells outside the bands
              are ignored. Usually most of the full DP matrix lies outside the
              bands  (often  more  than  95%),  making  this  technique faster
              because fewer DP calculations  are  required,  and  more  memory
              efficient because only cells within the bands need be allocated.

              Importantly, HMM banding sacrifices the guarantee of determining
              the  optimally  accurarte  or  optimal  alignment, which will be
              missed  if  it  lies  outside  the  bands.  The  tau   paramater
              (analagous to the beta parameter for QDB calculation in cmsearch
              ) is the amount of probability mass considered negligible during
              HMM band calculation; lower values of tau yield greater speedups
              but also a greater chance of missing the optimal alignment.  The
              default  tau  is 1E-7, determined empirically as a good tradeoff
              between sensitivity and speed, though this value can be  changed
              with  the --tau  <x> option. The level of acceleration increases
              with both the length and primary sequence conservation level  of
              the  family.  For  example,  with  the default tau of 1E-7, tRNA
              models (low primary sequence conservation with length  of  about
              75 residues) show about 10X acceleration, and SSU bacterial rRNA
              models (high primary sequence conservation with length of  about
              1500  residues)  show about 700X.  HMM banding can be turned off
              with the --nonbanded option.

       --nonbanded
              Turns off HMM banding. The returned alignment is  guaranteed  to
              be  the  globally  optimally  accurate  one  (by default) or the
              globally optimally scoring  one  (if  --cyk  is  enabled).   The
              --small  option  is recommended in combination with this option,
              because standard alignment without HMM banding requires a lot of
              memory (see --small ).

       --tau <x>
              Set  the  tail loss probability used during HMM band calculation
              to <x>.  This is the amount of probability mass within  the  HMM
              posterior  probabilities  that  is  considered  negligible.  The
              default value is 1E-7.  In general, higher values will result in
              greater  acceleration,  but  increase  the chance of missing the
              optimal alignment due to the HMM bands.

       --mxsize <x>
              Set the maximum allowable DP matrix size to  <x>  megabytes.  By
              default  this size is 2,048 Mb.  This should be large enough for
              the vast majority of alignments, however if it  is  not  cmalign
              will  exit  prematurely  and  report  an  error message that the
              matrix exceeded it’s maximum allowable size. In this  case,  the
              --mxsize can be used to raise the limit.  This is most likely to
              occur when the --nonbanded option is used  without  the  --small
              option, but can still occur when --nonbanded is not used.

       --rna  Output  the  alignments as RNA sequence alignments. This is true
              by default.

       --dna  Output the alignments as DNA sequence alignments.

       --matchonly
              Only include match columns  in  the  output  alignment,  do  not
              include any insertions relative to the consensus model.

       --resonly
              Only  include match columns in the output alignment that have at
              least 1 residue (non-gap character)  in  them.  By  default  all
              match  columns are printed to the alignment, even those that are
              100%  gaps.   --resonly  replicates  the  default  behavior   of
              previous versions of cmalign.

       --fins Change  the  behavior  of how insert emissions are placed in the
              alignment.  By default, all contiguous  blocks  of  inserts  are
              split  in  half,  and half the residues are flushed left against
              the nearest consensus column to the left, and half  are  flushed
              right  against  the  nearest consensus column on the right. With
              --fins inserts are not  split  in  half,  instead  all  inserted
              residues  from  IL  states  are  flushed  left, and all inserted
              residues from IR states are flushed  right.   --fins  replicates
              the default behavior of previous versions of cmalign.

       --onepost
              Modifies  behavior  of  the  -p  option.  Use only one character
              instead of two to annotate the  posterior  probability  of  each
              aligned residue. Specifically, only the "#=GR <seq name> POSTX."
              tag is printed to the alignment. An "8" for "POSTX." indicates a
              posterior  probability between 0.8 and 0.9 for the corresponding
              residue.

       --merge
              With --merge the usage of cmalign  changes  to  cmalign  --merge
              [options] cmfile msafile1 msafile2.  Merge the two alignments in
              msafile1 and msafile2 created by previous runs of  cmalign  with
              cmfile  together into a single alignment and exit.  msafile1 and
              msafile2 must only have one alignment  per  file.   This  option
              allows  the  user  to  split  up  large sequence files into many
              smaller files, align them independently to cmfile  on  different
              computers to get many small alignments, and then merge them into
              a single large alignment.

       --withali <f>
              Reads an alignment from file <f>  and  aligns  it  as  a  single
              object to the CM; e.g. the alignment in <f> is held fixed.  This
              allows you to align sequences to a model with cmalign  and  view
              them  in  the context of an existing trusted multiple alignment.
              The alignment in the file <f> must be exactly the alignment that
              the  CM  was  built  from,  or a subset of it with the following
              special  property:  the  definition  of  consensus  columns  and
              consensus  secondary structure must be identical between <f> and
              the alignment the CM was built from. One  easy  way  to  achieve
              this  is  to  use  the  --rf option to cmbuild (see man page for
              cmbuild ) and to  maintain  the  "#=GC  RF"  annotation  in  the
              alignment when removing sequences to create the subset alignment
              <f>.  To specify that the  --rf  option  to  cmbuild  was  used,
              enable the --rf option to cmalign (see --rf below).

       --withpknots
              Must  be  used  in  combination  with  --withali <f>.  Propogate
              structural information for any pseudoknots that exist in <f>  to
              the output alignment.

       --rf   Must  be  used  in combination with --withali <f>.  Specify that
              the alignment in <f> has the same "#=GC RF"  annotation  as  the
              alignment  file  the CM was built from using cmbuild and further
              that the --rf option was supplied to cmbuild  when  the  CM  was
              constructed.

       --gapthresh <x>
              Must  be  used  in combination with --withali <f>.  Specify that
              the --gapthresh <x> option was supplied to cmbuild when  the  CM
              was constructed from the alignment file <f>.

       --tfile <f>
              Dump tabular sequence tracebacks for each individual sequence to
              a file <f>.  Primarily useful for debugging.

COPYRIGHT

       Copyright (C) 2009 HHMI Janelia Farm Research Campus.
       Freely distributed under the GNU General Public License (GPLv3).
       See the  file  COPYING  that  came  with  the  source  for  details  on
       redistribution conditions.

AUTHOR

       Eric Nawrocki, Diana Kolbe, and Sean Eddy
       HHMI Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147
       http://selab.janelia.org/

NAME

SYNOPSIS

DESCRIPTION

OUTPUT

OPTIONS

EXPERT OPTIONS

SEE ALSO

COPYRIGHT

AUTHOR