NAME
hmmbuild - build a profile HMM from an alignment
SYNOPSIS
hmmbuild [options] hmmfile alignfile
DESCRIPTION
hmmbuild reads a multiple sequence alignment file alignfile , builds a
new profile HMM, and saves the HMM in hmmfile.
alignfile may be in ClustalW, GCG MSF, SELEX, Stockholm, or aligned
FASTA alignment format. The format is automatically detected.
By default, the model is configured to find one or more nonoverlapping
alignments to the complete model: multiple global alignments with
respect to the model, and local with respect to the sequence. This is
analogous to the behavior of the hmmls program of HMMER 1. To
configure the model for multiple local alignments with respect to the
model and local with respect to the sequence, a la the old program
hmmfs, use the -f (fragment) option. More rarely, you may want to
configure the model for a single global alignment (global with respect
to both model and sequence), using the -g option; or to configure the
model for a single local/local alignment (a la standard Smith/Waterman,
or the old hmmsw program), use the -s option.
OPTIONS
-f Configure the model for finding multiple domains per sequence,
where each domain can be a local (fragmentary) alignment. This
is analogous to the old hmmfs program of HMMER 1.
-g Configure the model for finding a single global alignment to a
target sequence, analogous to the old hmms program of HMMER 1.
-h Print brief help; includes version number and summary of all
options, including expert options.
-n <s> Name this HMM <s>. <s> can be any string of non-whitespace
characters (e.g. one "word"). There is no length limit (at
least not one imposed by HMMER; your shell will complain about
command line lengths first).
-o <f> Re-save the starting alignment to <f>, in Stockholm format. The
columns which were assigned to match states will be marked with
x’s in an #=RF annotation line. If either the --hand or --fast
construction options were chosen, the alignment may have been
slightly altered to be compatible with Plan 7 transitions, so
saving the final alignment and comparing to the starting
alignment can let you view these alterations. See the User’s
Guide for more information on this arcane side effect.
-s Configure the model for finding a single local alignment per
target sequence. This is analogous to the standard
Smith/Waterman algorithm or the hmmsw program of HMMER 1.
-A Append this model to an existing hmmfile rather than creating
hmmfile. Useful for building HMM libraries (like Pfam).
-F Force overwriting of an existing hmmfile. Otherwise HMMER will
refuse to clobber your existing HMM files, for safety’s sake.
EXPERT OPTIONS
--amino
Force the sequence alignment to be interpreted as amino acid
sequences. Normally HMMER autodetects whether the alignment is
protein or DNA, but sometimes alignments are so small that
autodetection is ambiguous. See --nucleic.
--archpri <x>
Set the "architecture prior" used by MAP architecture
construction to <x>, where <x> is a probability between 0 and 1.
This parameter governs a geometric prior distribution over model
lengths. As <x> increases, longer models are favored a priori.
As <x> decreases, it takes more residue conservation in a column
to make a column a "consensus" match column in the model
architecture. The 0.85 default has been chosen empirically as a
reasonable setting.
--binary
Write the HMM to hmmfile in HMMER binary format instead of
readable ASCII text.
--cfile <f>
Save the observed emission and transition counts to <f> after
the architecture has been determined (e.g. after residues/gaps
have been assigned to match, delete, and insert states). This
option is used in HMMER development for generating data files
useful for training new Dirichlet priors. The format of count
files is documented in the User’s Guide.
--fast Quickly and heuristically determine the architecture of the
model by assigning all columns will more than a certain fraction
of gap characters to insert states. By default this fraction is
0.5, and it can be changed using the --gapmax option. The
default construction algorithm is a maximum a posteriori (MAP)
algorithm, which is slower.
--gapmax <x>
Controls the --fast model construction algorithm, but if --fast
is not being used, has no effect. If a column has more than a
fraction <x> of gap symbols in it, it gets assigned to an insert
column. <x> is a frequency from 0 to 1, and by default is set
to 0.5. Higher values of <x> mean more columns get assigned to
consensus, and models get longer; smaller values of <x> mean
fewer columns get assigned to consensus, and models get smaller.
<x>
--hand Specify the architecture of the model by hand: the alignment
file must be in SELEX or Stockholm format, and the reference
annotation line (#=RF in SELEX, #=GC RF in Stockholm) is used to
specify the architecture. Any column marked with a non-gap
symbol (such as an ’x’, for instance) is assigned as a consensus
(match) column in the model.
--idlevel <x>
Controls both the determination of effective sequence number and
the behavior of the --wblosum weighting option. The sequence
alignment is clustered by percent identity, and the number of
clusters at a cutoff threshold of <x> is used to determine the
effective sequence number. Higher values of <x> give more
clusters and higher effective sequence numbers; lower values of
<x> give fewer clusters and lower effective sequence numbers.
<x> is a fraction from 0 to 1, and by default is set to 0.62
(corresponding to the clustering level used in constructing the
BLOSUM62 substitution matrix).
--informat <s>
Assert that the input seqfile is in format <s>; do not run
Babelfish format autodection. This increases the reliability of
the program somewhat, because the Babelfish can make mistakes;
particularly recommended for unattended, high-throughput runs of
HMMER. Valid format strings include FASTA, GENBANK, EMBL, GCG,
PIR, STOCKHOLM, SELEX, MSF, CLUSTAL, and PHYLIP. See the User’s
Guide for a complete list.
--noeff
Turn off the effective sequence number calculation, and use the
true number of sequences instead. This will usually reduce the
sensitivity of the final model (so don’t do it without good
reason!)
--nucleic
Force the alignment to be interpreted as nucleic acid sequence,
either RNA or DNA. Normally HMMER autodetects whether the
alignment is protein or DNA, but sometimes alignments are so
small that autodetection is ambiguous. See --amino.
--null <f>
Read a null model from <f>. The default for protein is to use
average amino acid frequencies from Swissprot 34 and p1 =
350/351; for nucleic acid, the default is to use 0.25 for each
base and p1 = 1000/1001. For documentation of the format of the
null model file and further explanation of how the null model is
used, see the User’s Guide.
--pam <f>
Apply a heuristic PAM- (substitution matrix-) based prior on
match emission probabilities instead of the default mixture
Dirichlet. The substitution matrix is read from <f>. See
--pamwgt.
The default Dirichlet state transition prior and insert emission
prior are unaffected. Therefore in principle you could combine
--prior with --pam but this isn’t recommended, as it hasn’t been
tested. ( --pam itself hasn’t been tested much!)
--pamwgt <x>
Controls the weight on a PAM-based prior. Only has effect if
--pam option is also in use. <x> is a positive real number,
20.0 by default. <x> is the number of "pseudocounts"
contriubuted by the heuristic prior. Very high values of <x> can
force a scoring system that is entirely driven by the
substitution matrix, making HMMER somewhat approximate Gribskov
profiles.
--pbswitch <n>
For alignments with a very large number of sequences, the GSC,
BLOSUM, and Voronoi weighting schemes are slow; they’re O(N^2)
for N sequences. Henikoff position-based weights (PB weights)
are more efficient. At or above a certain threshold sequence
number <n> hmmbuild will switch from GSC, BLOSUM, or Voronoi
weights to PB weights. To disable this switching behavior (at
the cost of compute time, set <n> to be something larger than
the number of sequences in your alignment. <n> is a positive
integer; the default is 1000.
--prior <f>
Read a Dirichlet prior from <f>, replacing the default mixture
Dirichlet. The format of prior files is documented in the
User’s Guide, and an example is given in the Demos directory of
the HMMER distribution.
--swentry <x>
Controls the total probability that is distributed to local
entries into the model, versus starting at the beginning of the
model as in a global alignment. <x> is a probability from 0 to
1, and by default is set to 0.5. Higher values of <x> mean that
hits that are fragments on their left (N or 5’-terminal) side
will be penalized less, but complete global alignments will be
penalized more. Lower values of <x> mean that fragments on the
left will be penalized more, and global alignments on this side
will be favored. This option only affects the configurations
that allow local alignments, e.g. -s and -f; unless one of
these options is also activated, this option has no effect. You
have independent control over local/global alignment behavior
for the N/C (5’/3’) termini of your target sequences using
--swentry and --swexit.
--swexit <x>
Controls the total probability that is distributed to local
exits from the model, versus ending an alignment at the end of
the model as in a global alignment. <x> is a probability from 0
to 1, and by default is set to 0.5. Higher values of <x> mean
that hits that are fragments on their right (C or 3’-terminal)
side will be penalized less, but complete global alignments will
be penalized more. Lower values of <x> mean that fragments on
the right will be penalized more, and global alignments on this
side will be favored. This option only affects the
configurations that allow local alignments, e.g. -s and -f;
unless one of these options is also activated, this option has
no effect. You have independent control over local/global
alignment behavior for the N/C (5’/3’) termini of your target
sequences using --swentry and --swexit.
--verbose
Print more possibly useful stuff, such as the individual scores
for each sequence in the alignment.
--wblosum
Use the BLOSUM filtering algorithm to weight the sequences,
instead of the default. Cluster the sequences at a given
percentage identity (see --idlevel); assign each cluster a total
weight of 1.0, distributed equally amongst the members of that
cluster.
--wgsc Use the Gerstein/Sonnhammer/Chothia ad hoc sequence weighting
algorithm. This is already the default, so this option has no
effect (unless it follows another option in the --w family, in
which case it overrides it).
--wme Use the Krogh/Mitchison maximum entropy algorithm to "weight"
the sequences. This supercedes the Eddy/Mitchison/Durbin maximum
discrimination algorithm, which gives almost identical weights
but is less robust. ME weighting seems to give a marginal
increase in sensitivity over the default GSC weights, but takes
a fair amount of time.
--wnone
Turn off all sequence weighting.
--wpb Use the Henikoff position-based weighting scheme.
--wvoronoi
Use the Sibbald/Argos Voronoi sequence weighting algorithm in
place of the default GSC weighting.
SEE ALSO
Master man page, with full list of and guide to the individual man
pages: see hmmer(1).
For complete documentation, see the user guide that came with the
distribution (Userguide.pdf); or see the HMMER web page,
http://hmmer.wustl.edu/.
COPYRIGHT
Copyright (C) 1992-2003 HHMI/Washington University School of Medicine.
Freely distributed under the GNU General Public License (GPL).
See the file COPYING in your distribution for details on redistribution
conditions.
AUTHOR
Sean Eddy
HHMI/Dept. of Genetics
Washington Univ. School of Medicine
4566 Scott Ave.
St Louis, MO 63110 USA
http://www.genetics.wustl.edu/eddy/