exonerate - a generic tool for sequence comparison

NAME

       exonerate - a generic tool for sequence comparison

SYNOPSIS

       exonerate [ options ] <query path> <target path>

DESCRIPTION

       exonerate is a general tool for sequence comparison.

       It  uses the C4 dynamic programming library.  It is designed to be both
       general and fast.  It can produce either gapped or ungapped alignments,
       according  to  a variety of different alignment models.  The C4 library
       allows  sequence  alignment  using  a  reduced   space   full   dynamic
       programming  implementation,  but  also  allows automated generation of
       heuristics from the alignment  models,  using  bounded  sparse  dynamic
       programming,  so  that  these alignments may also be rapidly generated.
       Alignments generated using these heuristics will represent a valid path
       through  the  alignment  model, yet (unlike the exhaustive alignments),
       the results are not guaranteed to be optimal.

CONVENTIONS

       A number of conventions (and idiosyncracies) are used within exonerate.
       An understanding of them facilitates interpretation of the output.

       Coordinates
              An in-between coordinate system is used, where the positions are
              counted between the symbols, rather than on the  symbols.   This
              numbering  scheme  starts  from  zero.   This numbering is shown
              below for the sequence "ACGT":

               A C G T
              0 1 2 3 4

              Hence the  subsequence  "CG"  would  have  start=1,  end=3,  and
              length=2.    This   coordinate  system  is  used  internally  in
              exonerate, and for all the  output  formats  produced  with  the
              exception  of the "human readable" alignment display and the GFF
              output where convention and standards dictate otherwise.

       Reverse Complements
              When an alignment is reported on the  reverse  complement  of  a
              sequence,  the  coordinates  are  simply  given  on  the reverse
              complement  copy  of  the  sequence.   Hence  positions  on  the
              sequences  are never negative.  Generally, the forward strand is
              indicated by ’+’, the reverse strand by ’-’, and an  unknown  or
              not-applicable  strand (as in the case of a protein sequence) is
              indicated by ’.’

       Alignment Scores
              Currently, only the raw alignment scores  are  displayed.   This
              score  just is the sum of transistion scores used in the dynamic
              programming.  For example,  in  the  case  of  a  Smith-Waterman
              alignment,  this  will  be  the  sum  of the substitution matrix
              scores and the gap penalties.

GENERAL OPTIONS

       Most arguments have short and long forms.   The  long  forms  are  more
       likely  to  be  stable  over  time, and hence should be used in scripts
       which call exonerate.

       -h | --shorthelp <boolean>
              Show help.  This will display a concise summary of the available
              options, defaults and values currently set.

       --help <boolean>
              This  shows  all  the  help  options including the defaults, the
              value currently set, and the environment variable which  may  be
              used  to  set  each  parameter.   There will be an indication of
              which options are mandatory.  Mandatory options have no default,
              and  must  have  a  value  supplied  for  exonerate  to run.  If
              mandatory options are used in order, their flags may be  skipped
              from  the  command  line  (see examples below).  Unlike this man
              page, the information from this option will always be up to date
              with the latest version of the program.

       -v | --version <boolean>
              Display  the  version  number.   Also displays other information
              such as the build date and glib version used.

SEQUENCE INPUT OPTIONS

       Pairwise comparisons will be performed between all query sequences  and
       all  target  sequences.   Generally,  for the best performance, shorter
       sequences (eg. ESTs, shotgun reads, proteins) should  be  used  as  the
       query sequences, and longer sequences (eg. genomic sequences) should be
       used as the target sequences.

       -q | --query  <paths>
              Specify the query sequences required.  These must be in a  FASTA
              format  file.   Single  or  muiltiple  query  sequences  may  be
              supplied.  Additionally multiple copies of the fasta file may be
              supplied  following  a  --query  flag, or by using with multiple
              --query flags.

       -t | --target <paths>
              Specify the target sequences required.  Also, must be in a FASTA
              format  file.   As  with the query sequences, single or multiple
              target sequences and files may be  supplied.   NEW:  the  target
              filename  may by replace by a server name and port number in the
              form of hostname:port when using exonerate-server.  See the  man
              page  for  exonerate-server  for  more  information  on  running
              exonerate in client:server mode.

       -Q | --querytype <dna | protein>
              Specify the query type to use.  If this  is  not  supplied,  the
              query  type  is assumed to be DNA when the first sequence in the
              file contains more than 85% [ACGTN]  bases.   Otherwise,  it  is
              assumed  to  be  peptide.   This option forces the query type as
              some nucleotide and peptide sequences can fall  either  side  of
              this threshold.

       -T | --targettype <dna | protein>
              Specify  the  target  type  to  use.   The  same  as --querytype
              (above), except that it applies to the target.   Specifying  the
              sequence  type  will  avoid  the  overhead of having to read the
              first sequence in the database twice (which may  be  significant
              with chromosome-sized sequences)

       --querychunkid <id>

       --querychunktotal <total>

       --targetchunkid <id>

       --targetchunktotal <total>
              These  options to facilitate running exonerate on compute farms,
              and avoid having to  split  up  sequence  databases  into  small
              chunks  to  run on different nodes.  If, for example, you wished
              to split the target database into three  parts,  you  would  run
              three exonerate jobs on different nodes including the options:

              --targetchunkid 1 --targetchunktotal 3
              --targetchunkid 2 --targetchunktotal 3
              --targetchunkid 3 --targetchunktotal 3

              NB.  The  granularity offered by this option only goes down to a
              single sequence, so when there are more chunks than sequences in
              the database, some processes will do nothing.

       -V | --verbose <int>
              Be  verbose - show information about what is going on during the
              analysis.  The default is 1 (little information), the higher the
              number  given,  the more information is printed.  To silence all
              the   default   output   from   exonerate,   use   --verbose   0
              --showalignment no --showvulgar no

ANALYSIS OPTIONS

       -E | --exhaustive <boolean>
              Specify  whether or not exhaustive alignment should be used.  By
              default, this is FALSE, and alignment heuristics will  be  used.
              If   it  is  set  to  TRUE,  an  exhaustive  alignment  will  be
              calculated.  This requires quadratic time,  and  will  be  much,
              much  slower,  but will provide the optimal result for the given
              model.
       -B | --bigseq <int>
              Perform alignment of large (multi-megabase) sequences.  This  is
              very   memory   efficient  and  fast  when  both  sequences  are
              chromosome-sized, but currently does not  currently  permit  the
              use of a word neighbourhood (ie. exactly matching seeds only).
       --forcescan <none | query | target>
              Force the FSM to scan the query sequence rather than the target.
              This option is useful, for example, if you have a  single  piece
              of  genomic  sequence and you with to compare it to the whole of
              dbEST.  By scanning the database, rather  than  the  query,  the
              analysis  will  be completed much more quickly, as the overheads
              of multiple query FSM construction, multiple target reading  and
              splice  site predictions will be removed.  By default, exonerate
              will guess the  optimal  strategy  based  on  database  sequence
              sizes.
       --saturatethreshold <number>
              When  set  to  zero,  this option does nothing.  Otherwise, once
              more than this number of words  (in  addition  to  the  expected
              number of words by chance) have matched a position on the query,
              the position on the  query  will  be  ’numbed’  (ignore  further
              matches) for the current pairwise comparison.
       --customserver <command>
              NEW:  When  using  exonerate  in  client:server mode with a non-
              standard server, this  command  allows  you  to  send  a  custom
              command  to  the  server.   This  command  is sent by the client
              (exonerate) before any other commands, and is provided as a  way
              of  passing  parameters or other commands specific to the custom
              server.  See the exonerate-server man page for more  information
              on running exonerate in client:server mode.

FASTA DATABASE OPTIONS

       --fastasuffix <extension>
              If  any  of  the  inputs  given  with  --query  or  --target are
              directories,  then  exonerate  will  recursively  descent  these
              directories,  reading all files ending with this suffix as fasta
              format input.

GAPPED ALIGNMENT OPTIONS

-m | --model <alignment model>
Specify the alignment model to use. The models currently
supported are:
ungapped
The simplest type of model, used by default. An
appropriate model with be selected automatically for the
type of input sequences provided.
ungapped:trans
This ungapped model includes translation of all frames of
both the query and target sequences. This is similar to
an ungapped tblastx type search.
affine:global
This performs gapped global alignment, similar to the
Needleman-Wunsch algorithm, except with affine gaps.
Global alignment requires that both the sequences in
their entirety are included in the alignment.
affine:bestfit
This performs a best fit or best location alignment of
the query onto the target sequence. The entire query
sequence will be included in the alignment, but only the
best location for its alignment on the target sequence.
affine:local
This is local alignment with affine gaps, similar to the
Smith-Waterman-Gotoh algorithm. A general-purpose
alignment algorithm. As this is local alignment, any
subsequence of the query and target sequence may appear
in the alignment.
affine:overlap
This type of alignment finds the best overlap between the
query and target. The overlap alignment must include the
start of the query or target and the end of the query or
the target sequence, to align sequences which overlap at
the ends, or in the mid-section of a longer sequence..
This is the type of alignment frequently used in assembly
algorithms.
est2genome
This model is similar to the affine:local model, but it
also includes intron modelling on the target sequence to
allow alignment of spliced to unspliced coding sequences
for both forward and reversed genes. This is similar to
the alignment models used in programs such as EST_GENOME
and sim4.
ner NERs are non-equivalenced regions - large regions in both
the query and target which are not aligned. This model
can be used for protein alignments where strongly
conserved helix regions will be aligned, but weakly
conserved loop regions are not. Similarly, this model
could be used to look for co-linearly conserved regions
in comparison of genomic sequences.
protein2dna
This model compares a protein sequence to a DNA sequence,
incorporating all the appropriate gaps and frameshifts.
protein2dna:bestfit
NEW: This is a bestfit version of the protein2dna model,
with which the entire protein is included in the
alignment. It is currently only available when using
exhaustive alignment.
protein2genome
This model allows alignment of a protein sequence to
genomic DNA. This is similar to the protein2dna model,
with the addition of modelling of introns and intron
phases. This model is simliar to those used by genewise.
protein2genome:bestfit
NEW: This is a bestfit version of the protein2genome
model, with which the entire protein is included in the
alignment. It is currently only available when using
exhaustive alignment.
coding2coding
This model is similar to the ungapped:trans model, except
that gaps and frameshifts are allowed. It is similar to
a gapped tblastx search.
coding2genome
This is similar to the est2genome model, except that the
query sequence is translated during comparison, allowing
a more sensitive comparison.
cdna2genome
This combines properties of the est2genome and
coding2genome models, to allow modeling of an whole cDNA
where a central coding region can be flanked by non-
coding UTRs. When the CDS start and end is known it may
be specified using the --annotation option (see below) to
permit only the correct coding region to appear in the
alignemnt.
genome2genome
This model is similar to the coding2coding model, except
introns are modelled on both sequences. (not working
well yet)

The short names u, u:t, a:g, a:b, a:l, a:o, e2g, ner,
p2d, p2d:b p2g, p2g:b, c2c, c2g cd2g and g2g can also be used
for specifying models.

-s | --score <threshold>
This is the overall score threshold. Alignments will not be
reported below this threshold. For heuristic alignments, the
higher this threshold, the less time the analysis will take.

--percent <percentage>
Report only alignments scoring at least this percentage of the
maximal score for each query. eg. use --percent 90 to report
alignments with 90% of the maximal score optainable for that
query. This option is useful not only because it reduces the
spurious matches in the output, but because it generates query-
specific thresholds (unlike --score ) for a set of queries of
differing lengths, and will also speed up the search
considerably. NB. with this option, it is possible to have a
cDNA match its corresponding gene exactly, yet still score less
than 100%, due to the addition of the intron penalty scores,
hence this option must be used with caution.

--showalignment <boolean>
Show the alignments in an human readable form.

--showsugar <boolean>
Display "sugar" output for ungapped alignments. Sugar is Simple
UnGapped Alignment Report, which displays ungapped alignments
one-per-line. The sugar line starts with the string "sugar:"
for easy extraction from the output, and is followed by the the
following 9 fields in the order below:

query_id Query identifier
query_start Query position at alignment start
query_end Query position alignment end
query_strand Strand of query matched
target_id |
target_start | the same 4 fields
target_end | for the target sequence
target_strand |
score The raw alignment score

--showcigar <boolean>
Show the alignments in "cigar" format. Cigar is a Compact
Idiosyncratic Gapped Alignment Report, which displays gapped
alignments one-per-line. The format starts with the same 9
fields as sugar output (see above), and is followed by a series
of <operation, length> pairs where operation is one of match,
insert or delete, and the length describes the number of times
this operation is repeated.

--showvulgar <boolean>
Shows the alignments in "vulgar" format. Vulgar is Verbose
Useful Labelled Gapped Alignment Report, This format also starts
with the same 9 fields as sugar output (see above), and is
followed by a series of <label, query_length, target_length>
triplets. The label may be one of the following:

M Match
C Codon
G Gap
N Non-equivalenced region
5 5’ splice site
3 3’ splice site
I Intron
S Split codon
F Frameshift

--showquerygff <boolean>
Report GFF output for features on the query sequence. See
http://www.sanger.ac.uk/Software/formats/GFF for more
information.

--showtargetgff <boolean>
Report GFF output for features on the target sequence.

--ryo <format>
Roll-your-own output format. This allows specification of a
printf-esque format line which is used to specify which
information to include in the output, and how it is to be shown.
The format field may contain the following fields:

%[qt][idlsSt]
For either {query,target}, report the
{id,definition,length,sequence,Strand,type} Sequences are
reported in a fasta-format like block (no headers).
%[qt]a[bels]
For either {query,target} region which occurs in the
alignment, report the {begin,end,length,sequence}
%[qt]c[bels]
For either {query,target} region which occurs in the
coding sequence in the alignment, report the
{begin,end,length,sequence}
%s The raw score
%r The rank (in results from a bestn search)
%m Model name
%e[tism]
Equivalenced {total,id,similarity,mismatches} (ie. %em ==
(%et - %ei))
%p[is] Percent {id,similarity} over the equivalenced portions of
the alignment. (ie. %pi == 100*(%ei / %et))
%g Gene orientation (’+’ = forward, ’-’ = reverse, ’.’ =
unknown)
%S Sugar block (the 9 fields used in sugar output (see
above)
%C Cigar block (the fields of a cigar line after the sugar
portion)
%V Vulgar block (the fields of a vulgar line after the sugar
portion)
%% Expands to a percentage sign (%)
\n Newline
\t Tab
\\ Expands to a backslash (\)
\{ Open curly brace
\} Close curly brace
{ Begin per-transition output section
} End per-transition output section
%P[qt][sabe]
Per-transition output for {query,target}
{sequence,advance,begin,end}
%P[nsl]
Per-transition output for {name,score,label}

This option is very useful and flexible. For example, to report all
the sections of query sequences which feature in alignments in fasta
format, use:

--ryo ">%qi %qd\n%qas\n"

To output all the symbols and scores in an alignment, try something
like:

--ryo "%V{%Pqs %Pts %Ps\n}"

-n | --bestn <number>
Report the best N results for each query. (Only results scoring
better than the score threshold
will be reported). The option reduces the amount of output
generated, and also allows exonerate to speed up the search.

-S | --subopt <boolean>
This option allows for the reporting of (Waterman-Eggert style)
suboptimal alignments. (It is on by default.) All suboptimal
(ie. non-intersecting) alignments will be reported for each pair
of sequences scoring at least the threshold provided by --score.

When this option is used with exhaustive alignments, several
full quadratic time passes will be required, so the running time
will be considerably increased.

-g | --gappedextension <boolean>
Causes a gapped extension stage to be performed ie. dynamic
programming is applied in arbitrarily shaped and dynamically
sized regions surrounding HSP seeds. The extension threshold is
controlled by the --extensionthreshold option.

Although sometimes slower than BSDP, gapped extension improves
sensitivity with weak, gap-rich alignments such as during cross-
species comparison.

NB. This option is now the default. Set it to false to reverse
to the old BSDP type alignments. This option may be slower than
BSDP for some large scale analyses with simple alignment models.

--refine <strategy>
Force exonerate to refine alignments generated by heuristics
using dynamic programming over larger regions. This takes more
time, but improves the quality of the final alignments.

The strategies available for refinement are:

none The default - no refinement is used.
full An exhaustive alignment is calculated from the pair of
sequences in their entirety.
region DP is applied just to the region of the sequences covered
by the heuristic alignment.

--refineboundary <size>
Specify an extra boundary to be included in the region subject
to alignment during refinement by region.

VITERBI ALGORITM OPTIONS

       -D | --dpmemory <Mb>
              The  exhaustive  alignment traceback routines use a Hughey-style
              reduced memory technique.  This option specifies how much memory
              will  be used for this.  Generally, the more memory is permitted
              here, the faster the alignments will be produced.

CODE GENERATION OPTIONS

       -C | --compiled <boolean>
              This option allows  disabling  of  generated  code  for  dynamic
              programming.  It is mainly used during development of exonerate.
              When set to FALSE,  an  "interpreted"  version  of  the  dynamic
              programming implementation is used, which is much slower.

HEURISTIC OPTIONS

       --terminalrangeint
       --terminalrangeext
       --joinrangeint
       --joinrangeext
       --spanrangeint
       --spanrangeext
              These  options are used to specify the size of the sub-alignment
              regions to which DP is applied around  the  ends  of  the  HSPs.
              This can be at the HSP ends (terminal range), between HSPs (join
              range), or between HSPs which may be connected by a large region
              such  as  an  intron  or  non-equivalenced  region (span range).
              These ranges can be specified for a number of matches back  onto
              the HSP (internal range) or out from the HSP (external range).

SEEDED DYNAMIC PROGRAMMING OPTIONS

       -x | --extensionthreshold <score>
              This is the amount by which the score will be allowed to degrade
              during SDP.  This is the equivalent of the hspdropoff penalties,
              except  it  is  applied  during  dynamic  programming,  not  HSP
              extension.  Decreasing this parameter will increase the speed of
              the SDP, and increasing it will increase the sensitivity.

       --singlepass  <boolean>
              By  default  the  suboptimal  SDP  alignments  are reported by a
              singlepass algorithm, but may miss  some  suboptimal  alignments
              that  are  close together.  This option can be used to force the
              use of a  multipass  suboptimal  alignment  algorithm  for  SDP,
              resulting in higher quality suboptimal alignments.

BSDP OPTIONS

       --joinfilter <limit>
              (experimental)

              Only  allow  consider  this  number  of  SARs  for  joining HSPs
              together.  The SARs with the highest potential for appearing  in
              a high-scoring alignment are considered.  This option useful for
              limiting time and memory usage when searching unmasked data with
              repetitive  sequences,  but  should not be set too low, as valid
              matches may be ignored.  Something like --joinfilter 32 seems to
              work well.

SEQUENCE OPTIONS

       --annotation <path>
              Specify  basic  sequence  annotation  information.  This is most
              useful with the cdna2genome model,  but  will  work  with  other
              models.  The annotation file contains four fields per line:

              <id> <strand> <cds_start> <cds_length>

              Here is a simple example of such a file for 4 cDNAs:

              dhh.human.cdna + 308 1191
              dhh.mouse.cdna + 250 1191
              csn7a.human.cdna + 178 828
              csn7a.mouse.cdna + 126 828

              These  annotation  lines  will also work when only the first two
              fields are used.  This can be used when specifying which  strand
              of a specific sequence should be included in a comparison.

SYMBOL COMPARISON OPTIONS

       --softmaskquery <boolean>
              Indicate  that  the  query is softmasked.  See description below
              for --softmasktarget
       --softmasktarget <boolean>
              Indicate  that  the  target  is  softmasked.   In  a  softmasked
              sequence  file,  instead of masking regions by Ns or Xs they are
              masked by putting those regions in lower case (and with unmasked
              regions  in  upper  case).  This option allows the masking to be
              ignored by some parts of the program,  combining  the  speed  of
              searching  masked  data  with  sensitivity of searching unmasked
              data.  The utility fastasoftmask supplied which is supplied with
              exonerate  can  be  used  for producing softmasked sequence from
              conventionally masked sequence.
       -d | --dnasubmat <name>
              Specify  the  the  substitution  matrix  to  be  used  for   DNA
              comparison.   This  should be a path to a substitution matrix in
              same format as that which is used by blast.
       -p | --proteinsubmat <name>
              Specify the the substitution  matrix  to  be  used  for  protein
              comparison.   (Both  DNA  and  protein substitution matrices are
              required for some types of analysis).  The use  of  the  special
              names,  nucleic,  blosum62,  pam250, edit or identity will cause
              built-in substitution matrices to be used.

ALIGNMENT SEEDING OPTIONS

       -M | --fsmmemory <Mb>
              Specify the amount of memory to use for  the  FSM  in  heuristic
              analyses.   exonerate multiplexes the query to accelerate large-
              throughput database queries.  This figure should always be  less
              than  the  physical  memory  on  the machine, but when searching
              large databases, generally, the more memory  it  is  allowed  to
              use, the faster it will go.
       --forcefsm <none | normal | compact>
              Force the use of more compact finite state machines for analyses
              involving big  sequences  and  large  word  neighbourhoods.   By
              default, exonerate will pick a sensible strategy, so this option
              will rarely need to be set.
       --wordjump <int>
              The  jump  between  query  words  used   to   yield   the   word
              neighbourhood.   If  set  to 1, every word is used, if set to 2,
              every other word is used, and if set  to  the  wordlength,  only
              non-overlapping  words  will  be  used.  This option reduces the
              memory requirements when using very large query  sequences,  and
              makes  the  search  run  faster,  but  it  also  damages  search
              sensitivity when high values are set.

AFFINE MODEL OPTIONS

       -o | --gapopen <penalty>
              This is the gap open penalty.
       -e | --gapextend <penalty>
              This is the gap extension penalty.
       --codongapopen <penalty>
              This is the codon gap open penalty.
       --codongapextend <penalty>
              This is the codon gap extension penalty.

NER OPTIONS

       --minner <boolean>
              Minimum NER length allowed.
       --maxner <length>
              Maximum NER  length  allowed.   NB.  this  option  only  affects
              heuristic alignments.
       --neropen <penalty>
              Penalty for opening a non-equivalenced region.

INTRON MODELLING OPTIONS

       --minintron <length>
              Minimum  intron  length  limit.   NB.  this  option only affects
              heuristic alignments.  This is  not  a  hard  limit  -  it  only
              affects  size  of  introns  which  are  sought  during heuristic
              alignment.
       --maxintron <length>
              Maximum intron length limit.  See notes above for --minintron
       -i | --intronpenalty <penalty>
              Penalty for introduction of an intron.

FRAMESHIFT MODELLING OPTIONS

       -f | --frameshift <penalty>
              The penalty for the inclusion of a frameshift in an alignment.

ALPHABET OPTIONS

       --useaatla <boolean>
              Use  three-letter  abbreviations  for  AA   names.    ie.   when
              displaying alignment "Met" is used instead of " M "

TRANSLATION OPTIONS

       --geneticcode <code>
              NEW:  Specify an alternative genetic code.  The default code (1)
              is the standard  genetic  code.   Other  genetic  codes  may  be
              specified by in shorthand or longhand form.

              In  shorthand form, a number between 1 and 23 is used to specify
              one of 17 built-in genetic code  variants.   These  are  genetic
              code variants taken from:

              http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

              These are:
              1      The Standard Code
              2      The Vertebrate Mitochondrial Code
              3      The Yeast Mitochondrial Code
              4      The  Mold, Protozoan, and Coelenterate Mitochondrial Code
                     and the Mycoplasma/Spiroplasma Code
              5      The Invertebrate Mitochondrial Code
              6      The Ciliate, Dasycladacean and Hexamita Nuclear Code
              9      The Echinoderm and Flatworm Mitochondrial Code
              10     The Euplotid Nuclear Code
              11     The Bacterial and Plant Plastid Code
              12     The Alternative Yeast Nuclear Code
              13     The Ascidian Mitochondrial Code
              14     The Alternative Flatworm Mitochondrial Code
              15     Blepharisma Nuclear Code
              16     Chlorophycean Mitochondrial Code
              21     Trematode Mitochondrial Code
              22     Scenedesmus obliquus mitochondrial Code
              23     Thraustochytrium Mitochondrial Code",

              In longhand form, a genetic code variant may be provided as a 64
              byte string in TCAG order, eg. the standard genetic code in this
              form would be:

              FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG

HSP CREATION OPTIONS

       --hspfilter <threshold>
              Use  aggressive  HSP  filtering  to speed up heuristic searches.
              The threshold specifies the number of HSPs centred about a point
              in  the query which will be stored.  Any lower scoring HSPs will
              be discarded.  This is an experimental option  to  handle  speed
              problems  caused  by some sequences.  A value of about 100 seems
              to work well.
       --useworddropoff <boolean>
              When this is TRUE, the score threshold for admitting words  into
              the word neighbourhood is set to be the initial word score minus
              the word threshold (see below).  This strategy  is  designed  to
              prevent             restricting             the             word
              SSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG When
              this  is  FALSE,  the  word threshold is taken to be an absolute
              value.
       --seedrepeat <count>
              NEW: The seedrepeat parameter sets the  number  of  seeds  which
              must  be  found on the same diagonal or reading frame before HSP
              extension will occur.  Increasing  the  value  for  --seedrepeat
              will  speed  up  searches,  and  is usually a better option than
              using  longer  word  lengths,  particularly   when   using   the
              exonerate-server   where   increasing   word   lengths  requires
              recomputing   the   index,   and   greater   increases    memory
              requirements.
       -w --dnawordlen <bases>
       -W --proteinwordlen <residues>
       -W --codonnwordlen <bases>
              The  word  length  used  for  DNA, protein or codon words.  When
              performing DNA vs protein comparisons, a the DNA wordlength will
              always (automatically) be triple the protein wordlength.
       --dnahspdropoff <score>
       --proteinhspdropoff <score>
       --codonhspdropoff <score>
              The  amount  by  which  an  HSP score will be allowed to degrade
              during HSP extension.  Separate threshold can be set for dna  or
              protein comparisons.
       --dnahspthreshold <score>
       --proteinhspthreshold <score>
       --codonhspthreshold <score>
              The  HSP score thresholds.  An HSP must score at least this much
              before it will be reported  or  be  used  in  preparation  of  a
              heuristic alignment.
       --dnawordlimit  <score>
       --proteinwordlimit  <score>
       --codonwordlimit  <score>
              The  threshold  for admitting DNA or protein words into the word
              neighbourhood.  The behaviour of this option is altered  by  the
              --useworddropoff option (see above).

       --geneseed <threshold>
              Exclude  HSPs  from  gapped  alignment  computation which cannot
              feature in a alignment containing at least one  HSP  scoring  at
              least this threshold.

              This  option provides considerable speed up for gapped alignment
              computation, but may cause some very gap-rich alignments  to  be
              missed.

              It  is  useful  when aligning similar sequences back onto genome
              quickly, eg. try --geneseed 250
       --geneseedrepeat <count>
              NEW:  The  geneseedrepeat  parameter  is  like  the   seedrepeat
              parameter,  but  is  only  applied when looking for the geneseed
              hsps.  Using a larger value for --geneseedrepeat will  speed  up
              searches   when   the   --geneseed   parameter   is  also  used.
              (experimental, implementation incomplete)

ALIGNMENT OPTIONS

       --alignmentwidth <width>
              Width of alignment display.  The default is 80.
       --forwardcoordinates <boolean>
              By default, all coordinates are reported on the forward  strand.
              Setting  this  option  to  false  reverts  to  the old behaviour
              (pre-0.8.3) whereby alignments on the reverse  complement  of  a
              sequence   are   reported   using  coordinates  on  the  reverse
              complement.

SUB-ALIGNMENT REGION OPTIONS

       --quality <percent>
              This option  excludes  HSPs  from  BSDP  when  their  components
              outside of the SARs fall below this quality threshold.

SPLICE SITE PREDICTION OPTIONS

       --splice3 <path>
       --splice5 <path>
              NEW:  Provide a file containing a custom PSSM (position specific
              score matrix) for prediction of the intron splice sites.

              The file format for splice data is simple: lines beginning  with
              ´#´  are  comments,  a  line  containing  just the word ´splice´
              denotes the position of the splice site,  and  the  other  lines
              show the observed relative frequencies of the bases flanking the
              splice sites in the chosen organism (in ACGT order).

              Example 5’ splice data file:

               # start of example 5’ splice data
               # A C G T
               28 40  17  14
               59 14  13  14
                8  5  81   6
               splice
                0  0 100   0
                0  0   0 100
               54  2  42   2
               74  8  11   8
                5  6  85   4
               16 18  21  45
               # end of test 5’ splice data

              Example 3’ splice data file:

               # start of example 3’ splice data
               # A C G T
                10  31  14  44
                 8  36  14  43
                 6  34  12  48
                 6  34   8  52
                 9  37   9  45
                 9  38  10  44
                 8  44   9  40
                 9  41   8  41
                 6  44   6  45
                 6  40   6  48
                23  28  26  23
                 2  79   1  18
               100   0   0   0
                 0   0 100   0
               splice
                28  14  47  11
               # end of example 3’ splice data

       --forcegtag <boolean>
              Only allow splice sites at gt....ag  sites  (or  ct....ac  sites
              when  the  gene is reversed) With this restriction in place, the
              splice site prediction scores  are  still  used  and  allow  tie
              breaking when there is more than one possible splice site.

STRATEGIES FOR SPEED

       Keep all data on local disks.

       Apply  the  highest  acceptable score thresholds using a combination of
       --score, --percent and --bestn.

       Repeat mask and dust the genomic (target)  sequence.   (Softmask  these
       sequences and use --softmasktarget).

       Increase the --fsmmemory option to allow more query multiplexing.

       Increase the value for --seedrepeat

       When  using  an  alignment  model containing introns, set --geneseed as
       high as possible.

       If you are compiling exonerate yourself, see the README  file  supplied
       with the source code for details of compile-time optimisations.

STRATEGIES FOR SENSITIVITY

       Not documented yet.

       Increase the word neighbourhood.  Decrease the HSP threshold.  Increase
       the SAR ranges.  Run exhaustively.

ENVIRONMENT

       Not documented yet.

EXAMPLES

       exonerate cdna.fasta genomic.fasta
              This simplest way in which exonerate may be used.   By  default,
              an ungapped alignment model will be used.

       exonerate     --exhaustive     y    --model    est2genome    cdna.fasta
       genomic.masked.fasta
              Exhaustively align cdnas to  genomic  sequence.   This  will  be
              much,  much  slower,  but  more  accurate.   This  option causes
              exonerate to behave like EST_GENOME.

       exonerate --exhaustive --model affine:local query.fasta target.fasta
              If the affine:local model is used with exhaustive alignment, you
              have the Smith-Waterman algorithm.

       exonerate     --exhaustive    --model    affine:global    protein.fasta
       protein.fasta
              Switch to a global model, and you have Needleman-Wunsch.

       exonerate --wordthreshold 1 --gapped  no  --showhsp  yes  protein.fasta
       genome.fasta
              Generate ungapped Protein:DNA alignments

       exonerate    --model    coding2coding   --score   1000   --bigseq   yes
       --proteinhspthreshold 90 chr21.fa chr22.fa
              Perform quick-and-dirty translated  pairwise  alignment  of  two
              very large DNA sequences.

       Many similar combinations should work.  Try them out.

VERSION

       This  documentation accompanies version 2.2.0 of the exonerate package.

AUTHOR

       Guy St.C. Slater.  <guy@ebi.ac.uk>.  See the AUTHORS file  accompanying
       the source code for a list of contributors.

AVAILABILITY

       This source code for the exonerate package is available under the terms
       of the GNU general public licence.

       Please see the file COPYING which was distrubuted with this package, or
       http://www.gnu.org/licenses/gpl.txt for details.

       This package has been developed as part of the ensembl project.  Please
       see http://www.ensembl.org/ for more information.