hmmbuild - build a profile HMM from an alignment

NAME

       hmmbuild - build a profile HMM from an alignment

SYNOPSIS

       hmmbuild [options] hmmfile alignfile

DESCRIPTION

       hmmbuild  reads a multiple sequence alignment file alignfile , builds a
       new profile HMM, and saves the HMM in hmmfile.

       alignfile may be in ClustalW, GCG MSF,  SELEX,  Stockholm,  or  aligned
       FASTA alignment format. The format is automatically detected.

       By  default, the model is configured to find one or more nonoverlapping
       alignments to the  complete  model:  multiple  global  alignments  with
       respect  to the model, and local with respect to the sequence.  This is
       analogous to the  behavior  of  the  hmmls  program  of  HMMER  1.   To
       configure  the  model for multiple local alignments with respect to the
       model and local with respect to the sequence,  a  la  the  old  program
       hmmfs,  use  the  -f  (fragment)  option.  More rarely, you may want to
       configure the model for a single global alignment (global with  respect
       to  both  model and sequence), using the -g option; or to configure the
       model for a single local/local alignment (a la standard Smith/Waterman,
       or the old hmmsw program), use the -s option.

OPTIONS

       -f     Configure  the  model for finding multiple domains per sequence,
              where each domain can be a local (fragmentary)  alignment.  This
              is analogous to the old hmmfs program of HMMER 1.

       -g     Configure  the  model for finding a single global alignment to a
              target sequence, analogous to the old hmms program of HMMER 1.

       -h     Print brief help; includes version number  and  summary  of  all
              options, including expert options.

       -n <s> Name  this  HMM  <s>.   <s>  can be any string of non-whitespace
              characters (e.g. one "word").  There  is  no  length  limit  (at
              least  not  one imposed by HMMER; your shell will complain about
              command line lengths first).

       -o <f> Re-save the starting alignment to <f>, in Stockholm format.  The
              columns  which were assigned to match states will be marked with
              x’s in an #=RF annotation line.  If either the --hand or  --fast
              construction  options  were  chosen, the alignment may have been
              slightly altered to be compatible with Plan  7  transitions,  so
              saving  the  final  alignment  and  comparing  to  the  starting
              alignment can let you view these alterations.   See  the  User’s
              Guide for more information on this arcane side effect.

       -s     Configure  the  model  for  finding a single local alignment per
              target   sequence.   This   is   analogous   to   the   standard
              Smith/Waterman algorithm or the hmmsw program of HMMER 1.

       -A     Append  this  model  to an existing hmmfile rather than creating
              hmmfile.  Useful for building HMM libraries (like Pfam).

       -F     Force overwriting of an existing hmmfile.  Otherwise HMMER  will
              refuse to clobber your existing HMM files, for safety’s sake.

EXPERT OPTIONS

       --amino
              Force  the  sequence  alignment  to be interpreted as amino acid
              sequences. Normally HMMER autodetects whether the  alignment  is
              protein  or  DNA,  but  sometimes  alignments  are so small that
              autodetection is ambiguous. See --nucleic.

       --archpri <x>
              Set  the  "architecture  prior"   used   by   MAP   architecture
              construction to <x>, where <x> is a probability between 0 and 1.
              This parameter governs a geometric prior distribution over model
              lengths.  As  <x> increases, longer models are favored a priori.
              As <x> decreases, it takes more residue conservation in a column
              to  make  a  column  a  "consensus"  match  column  in the model
              architecture.  The 0.85 default has been chosen empirically as a
              reasonable setting.

       --binary
              Write  the  HMM  to  hmmfile  in  HMMER binary format instead of
              readable ASCII text.

       --cfile <f>
              Save the observed emission and transition counts  to  <f>  after
              the  architecture  has been determined (e.g. after residues/gaps
              have been assigned to match, delete, and insert  states).   This
              option  is  used  in HMMER development for generating data files
              useful for training new Dirichlet priors. The  format  of  count
              files is documented in the User’s Guide.

       --fast Quickly  and  heuristically  determine  the  architecture of the
              model by assigning all columns will more than a certain fraction
              of  gap characters to insert states. By default this fraction is
              0.5, and it can be  changed  using  the  --gapmax  option.   The
              default  construction  algorithm is a maximum a posteriori (MAP)
              algorithm, which is slower.

       --gapmax <x>
              Controls the --fast model construction algorithm, but if  --fast
              is  not  being used, has no effect.  If a column has more than a
              fraction <x> of gap symbols in it, it gets assigned to an insert
              column.   <x>  is a frequency from 0 to 1, and by default is set
              to 0.5. Higher values of <x> mean more columns get  assigned  to
              consensus,  and  models  get  longer; smaller values of <x> mean
              fewer columns get assigned to consensus, and models get smaller.
              <x>

       --hand Specify  the  architecture  of  the model by hand: the alignment
              file must be in SELEX or Stockholm  format,  and  the  reference
              annotation line (#=RF in SELEX, #=GC RF in Stockholm) is used to
              specify the architecture.  Any  column  marked  with  a  non-gap
              symbol (such as an ’x’, for instance) is assigned as a consensus
              (match) column in the model.

       --idlevel <x>
              Controls both the determination of effective sequence number and
              the  behavior  of  the  --wblosum weighting option. The sequence
              alignment is clustered by percent identity, and  the  number  of
              clusters  at  a cutoff threshold of <x> is used to determine the
              effective sequence number.   Higher  values  of  <x>  give  more
              clusters  and higher effective sequence numbers; lower values of
              <x> give fewer clusters and lower  effective  sequence  numbers.
              <x>  is  a  fraction  from 0 to 1, and by default is set to 0.62
              (corresponding to the clustering level used in constructing  the
              BLOSUM62 substitution matrix).

       --informat <s>
              Assert  that  the  input  seqfile  is  in format <s>; do not run
              Babelfish format autodection. This increases the reliability  of
              the  program  somewhat, because the Babelfish can make mistakes;
              particularly recommended for unattended, high-throughput runs of
              HMMER.  Valid  format strings include FASTA, GENBANK, EMBL, GCG,
              PIR, STOCKHOLM, SELEX, MSF, CLUSTAL, and PHYLIP. See the  User’s
              Guide for a complete list.

       --noeff
              Turn  off the effective sequence number calculation, and use the
              true number of sequences instead. This will usually  reduce  the
              sensitivity  of  the  final  model  (so don’t do it without good
              reason!)

       --nucleic
              Force the alignment to be interpreted as nucleic acid  sequence,
              either  RNA  or  DNA.  Normally  HMMER  autodetects  whether the
              alignment is protein or DNA, but  sometimes  alignments  are  so
              small that autodetection is ambiguous. See --amino.

       --null <f>
              Read  a  null model from <f>.  The default for protein is to use
              average amino acid  frequencies  from  Swissprot  34  and  p1  =
              350/351;  for  nucleic acid, the default is to use 0.25 for each
              base and p1 = 1000/1001. For documentation of the format of  the
              null model file and further explanation of how the null model is
              used, see the User’s Guide.

       --pam <f>
              Apply a heuristic PAM- (substitution  matrix-)  based  prior  on
              match  emission  probabilities  instead  of  the default mixture
              Dirichlet. The  substitution  matrix  is  read  from  <f>.   See
              --pamwgt.

              The default Dirichlet state transition prior and insert emission
              prior are unaffected. Therefore in principle you  could  combine
              --prior with --pam but this isn’t recommended, as it hasn’t been
              tested. ( --pam itself hasn’t been tested much!)

       --pamwgt <x>
              Controls the weight on a PAM-based prior.  Only  has  effect  if
              --pam  option  is  also  in use.  <x> is a positive real number,
              20.0  by  default.   <x>  is  the   number   of   "pseudocounts"
              contriubuted by the heuristic prior. Very high values of <x> can
              force  a  scoring  system  that  is  entirely  driven   by   the
              substitution  matrix, making HMMER somewhat approximate Gribskov
              profiles.

       --pbswitch <n>
              For alignments with a very large number of sequences,  the  GSC,
              BLOSUM,  and  Voronoi weighting schemes are slow; they’re O(N^2)
              for N sequences. Henikoff position-based  weights  (PB  weights)
              are  more  efficient.  At  or above a certain threshold sequence
              number <n> hmmbuild will switch from  GSC,  BLOSUM,  or  Voronoi
              weights  to  PB  weights. To disable this switching behavior (at
              the cost of compute time, set <n> to be  something  larger  than
              the  number  of  sequences in your alignment.  <n> is a positive
              integer; the default is 1000.

       --prior <f>
              Read a Dirichlet prior from <f>, replacing the  default  mixture
              Dirichlet.   The  format  of  prior  files  is documented in the
              User’s Guide, and an example is given in the Demos directory  of
              the HMMER distribution.

       --swentry <x>
              Controls  the  total  probability  that  is distributed to local
              entries into the model, versus starting at the beginning of  the
              model  as in a global alignment.  <x> is a probability from 0 to
              1, and by default is set to 0.5.  Higher values of <x> mean that
              hits  that  are  fragments on their left (N or 5’-terminal) side
              will be penalized less, but complete global alignments  will  be
              penalized  more.  Lower values of <x> mean that fragments on the
              left will be penalized more, and global alignments on this  side
              will  be  favored.   This option only affects the configurations
              that allow local alignments, e.g.  -s  and  -f;  unless  one  of
              these options is also activated, this option has no effect.  You
              have independent control over  local/global  alignment  behavior
              for  the  N/C  (5’/3’)  termini  of  your target sequences using
              --swentry and --swexit.

       --swexit <x>
              Controls the total probability  that  is  distributed  to  local
              exits  from  the model, versus ending an alignment at the end of
              the model as in a global alignment.  <x> is a probability from 0
              to  1,  and by default is set to 0.5.  Higher values of <x> mean
              that hits that are fragments on their right (C  or  3’-terminal)
              side will be penalized less, but complete global alignments will
              be penalized more.  Lower values of <x> mean that  fragments  on
              the  right will be penalized more, and global alignments on this
              side  will  be  favored.    This   option   only   affects   the
              configurations  that  allow  local  alignments, e.g.  -s and -f;
              unless one of these options is also activated, this  option  has
              no  effect.   You  have  independent  control  over local/global
              alignment behavior for the N/C (5’/3’) termini  of  your  target
              sequences using --swentry and --swexit.

       --verbose
              Print  more possibly useful stuff, such as the individual scores
              for each sequence in the alignment.

       --wblosum
              Use the BLOSUM filtering  algorithm  to  weight  the  sequences,
              instead  of  the  default.   Cluster  the  sequences  at a given
              percentage identity (see --idlevel); assign each cluster a total
              weight  of  1.0, distributed equally amongst the members of that
              cluster.

       --wgsc Use the Gerstein/Sonnhammer/Chothia ad  hoc  sequence  weighting
              algorithm.  This  is  already the default, so this option has no
              effect (unless it follows another option in the --w  family,  in
              which case it overrides it).

       --wme  Use  the  Krogh/Mitchison  maximum entropy algorithm to "weight"
              the sequences. This supercedes the Eddy/Mitchison/Durbin maximum
              discrimination  algorithm,  which gives almost identical weights
              but is less robust.  ME  weighting  seems  to  give  a  marginal
              increase  in sensitivity over the default GSC weights, but takes
              a fair amount of time.

       --wnone
              Turn off all sequence weighting.

       --wpb  Use the Henikoff position-based weighting scheme.

       --wvoronoi
              Use the Sibbald/Argos Voronoi sequence  weighting  algorithm  in
              place of the default GSC weighting.

COPYRIGHT

       Copyright (C) 1992-2003 HHMI/Washington University School of Medicine.
       Freely distributed under the GNU General Public License (GPL).
       See the file COPYING in your distribution for details on redistribution
       conditions.

AUTHOR

       Sean Eddy
       HHMI/Dept. of Genetics
       Washington Univ. School of Medicine
       4566 Scott Ave.
       St Louis, MO 63110 USA
       http://www.genetics.wustl.edu/eddy/

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

EXPERT OPTIONS

SEE ALSO

COPYRIGHT

AUTHOR