Man Linux: Main Page and Category List

NAME

       exonerate - a generic tool for sequence comparison

SYNOPSIS

       exonerate [ options ] <query path> <target path>

DESCRIPTION

       exonerate is a general tool for sequence comparison.

       It  uses the C4 dynamic programming library.  It is designed to be both
       general and fast.  It can produce either gapped or ungapped alignments,
       according  to  a variety of different alignment models.  The C4 library
       allows  sequence  alignment  using  a  reduced   space   full   dynamic
       programming  implementation,  but  also  allows automated generation of
       heuristics from the alignment  models,  using  bounded  sparse  dynamic
       programming,  so  that  these alignments may also be rapidly generated.
       Alignments generated using these heuristics will represent a valid path
       through  the  alignment  model, yet (unlike the exhaustive alignments),
       the results are not guaranteed to be optimal.

CONVENTIONS

       A number of conventions (and idiosyncracies) are used within exonerate.
       An understanding of them facilitates interpretation of the output.

       Coordinates
              An in-between coordinate system is used, where the positions are
              counted between the symbols, rather than on the  symbols.   This
              numbering  scheme  starts  from  zero.   This numbering is shown
              below for the sequence "ACGT":

               A C G T
              0 1 2 3 4

              Hence the  subsequence  "CG"  would  have  start=1,  end=3,  and
              length=2.    This   coordinate  system  is  used  internally  in
              exonerate, and for all the  output  formats  produced  with  the
              exception  of the "human readable" alignment display and the GFF
              output where convention and standards dictate otherwise.

       Reverse Complements
              When an alignment is reported on the  reverse  complement  of  a
              sequence,  the  coordinates  are  simply  given  on  the reverse
              complement  copy  of  the  sequence.   Hence  positions  on  the
              sequences  are never negative.  Generally, the forward strand is
              indicated by ’+’, the reverse strand by ’-’, and an  unknown  or
              not-applicable  strand (as in the case of a protein sequence) is
              indicated by ’.’

       Alignment Scores
              Currently, only the raw alignment scores  are  displayed.   This
              score  just is the sum of transistion scores used in the dynamic
              programming.  For example,  in  the  case  of  a  Smith-Waterman
              alignment,  this  will  be  the  sum  of the substitution matrix
              scores and the gap penalties.

GENERAL OPTIONS

       Most arguments have short and long forms.   The  long  forms  are  more
       likely  to  be  stable  over  time, and hence should be used in scripts
       which call exonerate.

       -h | --shorthelp <boolean>
              Show help.  This will display a concise summary of the available
              options, defaults and values currently set.

       --help <boolean>
              This  shows  all  the  help  options including the defaults, the
              value currently set, and the environment variable which  may  be
              used  to  set  each  parameter.   There will be an indication of
              which options are mandatory.  Mandatory options have no default,
              and  must  have  a  value  supplied  for  exonerate  to run.  If
              mandatory options are used in order, their flags may be  skipped
              from  the  command  line  (see examples below).  Unlike this man
              page, the information from this option will always be up to date
              with the latest version of the program.

       -v | --version <boolean>
              Display  the  version  number.   Also displays other information
              such as the build date and glib version used.

SEQUENCE INPUT OPTIONS

       Pairwise comparisons will be performed between all query sequences  and
       all  target  sequences.   Generally,  for the best performance, shorter
       sequences (eg. ESTs, shotgun reads, proteins) should  be  used  as  the
       query sequences, and longer sequences (eg. genomic sequences) should be
       used as the target sequences.

       -q | --query  <paths>
              Specify the query sequences required.  These must be in a  FASTA
              format  file.   Single  or  muiltiple  query  sequences  may  be
              supplied.  Additionally multiple copies of the fasta file may be
              supplied  following  a  --query  flag, or by using with multiple
              --query flags.

       -t | --target <paths>
              Specify the target sequences required.  Also, must be in a FASTA
              format  file.   As  with the query sequences, single or multiple
              target sequences and files may be  supplied.   NEW:  the  target
              filename  may by replace by a server name and port number in the
              form of hostname:port when using exonerate-server.  See the  man
              page  for  exonerate-server  for  more  information  on  running
              exonerate in client:server mode.

       -Q | --querytype <dna | protein>
              Specify the query type to use.  If this  is  not  supplied,  the
              query  type  is assumed to be DNA when the first sequence in the
              file contains more than 85% [ACGTN]  bases.   Otherwise,  it  is
              assumed  to  be  peptide.   This option forces the query type as
              some nucleotide and peptide sequences can fall  either  side  of
              this threshold.

       -T | --targettype <dna | protein>
              Specify  the  target  type  to  use.   The  same  as --querytype
              (above), except that it applies to the target.   Specifying  the
              sequence  type  will  avoid  the  overhead of having to read the
              first sequence in the database twice (which may  be  significant
              with chromosome-sized sequences)

       --querychunkid <id>

       --querychunktotal <total>

       --targetchunkid <id>

       --targetchunktotal <total>
              These  options to facilitate running exonerate on compute farms,
              and avoid having to  split  up  sequence  databases  into  small
              chunks  to  run on different nodes.  If, for example, you wished
              to split the target database into three  parts,  you  would  run
              three exonerate jobs on different nodes including the options:

              --targetchunkid 1 --targetchunktotal 3
              --targetchunkid 2 --targetchunktotal 3
              --targetchunkid 3 --targetchunktotal 3

              NB.  The  granularity offered by this option only goes down to a
              single sequence, so when there are more chunks than sequences in
              the database, some processes will do nothing.

       -V | --verbose <int>
              Be  verbose - show information about what is going on during the
              analysis.  The default is 1 (little information), the higher the
              number  given,  the more information is printed.  To silence all
              the   default   output   from   exonerate,   use   --verbose   0
              --showalignment no --showvulgar no

ANALYSIS OPTIONS

       -E | --exhaustive <boolean>
              Specify  whether or not exhaustive alignment should be used.  By
              default, this is FALSE, and alignment heuristics will  be  used.
              If   it  is  set  to  TRUE,  an  exhaustive  alignment  will  be
              calculated.  This requires quadratic time,  and  will  be  much,
              much  slower,  but will provide the optimal result for the given
              model.
       -B | --bigseq <int>
              Perform alignment of large (multi-megabase) sequences.  This  is
              very   memory   efficient  and  fast  when  both  sequences  are
              chromosome-sized, but currently does not  currently  permit  the
              use of a word neighbourhood (ie. exactly matching seeds only).
       --forcescan <none | query | target>
              Force the FSM to scan the query sequence rather than the target.
              This option is useful, for example, if you have a  single  piece
              of  genomic  sequence and you with to compare it to the whole of
              dbEST.  By scanning the database, rather  than  the  query,  the
              analysis  will  be completed much more quickly, as the overheads
              of multiple query FSM construction, multiple target reading  and
              splice  site predictions will be removed.  By default, exonerate
              will guess the  optimal  strategy  based  on  database  sequence
              sizes.
       --saturatethreshold <number>
              When  set  to  zero,  this option does nothing.  Otherwise, once
              more than this number of words  (in  addition  to  the  expected
              number of words by chance) have matched a position on the query,
              the position on the  query  will  be  ’numbed’  (ignore  further
              matches) for the current pairwise comparison.
       --customserver <command>
              NEW:  When  using  exonerate  in  client:server mode with a non-
              standard server, this  command  allows  you  to  send  a  custom
              command  to  the  server.   This  command  is sent by the client
              (exonerate) before any other commands, and is provided as a  way
              of  passing  parameters or other commands specific to the custom
              server.  See the exonerate-server man page for more  information
              on running exonerate in client:server mode.

FASTA DATABASE OPTIONS

       --fastasuffix <extension>
              If  any  of  the  inputs  given  with  --query  or  --target are
              directories,  then  exonerate  will  recursively  descent  these
              directories,  reading all files ending with this suffix as fasta
              format input.

GAPPED ALIGNMENT OPTIONS

       -m | --model <alignment model>
              Specify the  alignment  model  to  use.   The  models  currently
              supported are:
              ungapped
                     The   simplest  type  of  model,  used  by  default.   An
                     appropriate model with be selected automatically for  the
                     type of input sequences provided.
              ungapped:trans
                     This ungapped model includes translation of all frames of
                     both the query and target sequences.  This is similar  to
                     an ungapped tblastx type search.
              affine:global
                     This  performs  gapped  global  alignment, similar to the
                     Needleman-Wunsch  algorithm,  except  with  affine  gaps.
                     Global  alignment  requires  that  both  the sequences in
                     their entirety are included in the alignment.
              affine:bestfit
                     This performs a best fit or best  location  alignment  of
                     the  query  onto  the  target sequence.  The entire query
                     sequence will be included in the alignment, but only  the
                     best location for its alignment on the target sequence.
              affine:local
                     This  is local alignment with affine gaps, similar to the
                     Smith-Waterman-Gotoh   algorithm.    A    general-purpose
                     alignment  algorithm.   As  this  is local alignment, any
                     subsequence of the query and target sequence  may  appear
                     in the alignment.
              affine:overlap
                     This type of alignment finds the best overlap between the
                     query and target.  The overlap alignment must include the
                     start  of the query or target and the end of the query or
                     the target sequence, to align sequences which overlap  at
                     the  ends,  or  in the mid-section of a longer sequence..
                     This is the type of alignment frequently used in assembly
                     algorithms.
              est2genome
                     This  model  is similar to the affine:local model, but it
                     also includes intron modelling on the target sequence  to
                     allow  alignment of spliced to unspliced coding sequences
                     for both forward and reversed genes.  This is similar  to
                     the  alignment models used in programs such as EST_GENOME
                     and sim4.
              ner    NERs are non-equivalenced regions - large regions in both
                     the  query  and target which are not aligned.  This model
                     can  be  used  for  protein  alignments  where   strongly
                     conserved  helix  regions  will  be  aligned,  but weakly
                     conserved loop regions are not.   Similarly,  this  model
                     could  be  used to look for co-linearly conserved regions
                     in comparison of genomic sequences.
              protein2dna
                     This model compares a protein sequence to a DNA sequence,
                     incorporating all the appropriate gaps and frameshifts.
              protein2dna:bestfit
                     NEW:  This is a bestfit version of the protein2dna model,
                     with  which  the  entire  protein  is  included  in   the
                     alignment.   It  is  currently  only available when using
                     exhaustive alignment.
              protein2genome
                     This model allows alignment  of  a  protein  sequence  to
                     genomic  DNA.   This is similar to the protein2dna model,
                     with the addition of  modelling  of  introns  and  intron
                     phases.  This model is simliar to those used by genewise.
              protein2genome:bestfit
                     NEW: This is a  bestfit  version  of  the  protein2genome
                     model,  with  which the entire protein is included in the
                     alignment.  It is currently  only  available  when  using
                     exhaustive alignment.
              coding2coding
                     This model is similar to the ungapped:trans model, except
                     that gaps and frameshifts are allowed.  It is similar  to
                     a gapped tblastx search.
              coding2genome
                     This  is similar to the est2genome model, except that the
                     query sequence is translated during comparison,  allowing
                     a more sensitive comparison.
              cdna2genome
                     This   combines   properties   of   the   est2genome  and
                     coding2genome models, to allow modeling of an whole  cDNA
                     where  a  central  coding  region  can be flanked by non-
                     coding UTRs.  When the CDS start and end is known it  may
                     be specified using the --annotation option (see below) to
                     permit only the correct coding region to  appear  in  the
                     alignemnt.
              genome2genome
                     This  model is similar to the coding2coding model, except
                     introns are modelled on  both  sequences.   (not  working
                     well yet)

       The short names u, u:t, a:g, a:b, a:l, a:o, e2g, ner,
              p2d,  p2d:b  p2g,  p2g:b, c2c, c2g cd2g and g2g can also be used
              for specifying models.

       -s | --score <threshold>
              This is the overall score threshold.   Alignments  will  not  be
              reported  below  this  threshold.  For heuristic alignments, the
              higher this threshold, the less time the analysis will take.

       --percent <percentage>
              Report only alignments scoring at least this percentage  of  the
              maximal  score  for  each query.  eg. use --percent 90 to report
              alignments with 90% of the maximal  score  optainable  for  that
              query.   This  option  is useful not only because it reduces the
              spurious matches in the output, but because it generates  query-
              specific  thresholds  (unlike  --score ) for a set of queries of
              differing  lengths,  and  will  also   speed   up   the   search
              considerably.   NB.   with this option, it is possible to have a
              cDNA match its corresponding gene exactly, yet still score  less
              than  100%,  due  to  the addition of the intron penalty scores,
              hence this option must be used with caution.

       --showalignment <boolean>
              Show the alignments in an human readable form.

       --showsugar <boolean>
              Display "sugar" output for ungapped alignments.  Sugar is Simple
              UnGapped  Alignment  Report,  which displays ungapped alignments
              one-per-line.  The sugar line starts with  the  string  "sugar:"
              for  easy extraction from the output, and is followed by the the
              following 9 fields in the order below:

              query_id        Query identifier
              query_start     Query position at alignment start
              query_end       Query position alignment end
              query_strand    Strand of query matched
              target_id       |
              target_start    | the same 4 fields
              target_end      | for the target sequence
              target_strand   |
              score           The raw alignment score

       --showcigar <boolean>
              Show the alignments in  "cigar"  format.   Cigar  is  a  Compact
              Idiosyncratic  Gapped  Alignment  Report,  which displays gapped
              alignments one-per-line.  The format  starts  with  the  same  9
              fields  as sugar output (see above), and is followed by a series
              of <operation, length> pairs where operation is  one  of  match,
              insert  or  delete, and the length describes the number of times
              this operation is repeated.

       --showvulgar <boolean>
              Shows the alignments in  "vulgar"  format.   Vulgar  is  Verbose
              Useful Labelled Gapped Alignment Report, This format also starts
              with the same 9 fields as  sugar  output  (see  above),  and  is
              followed  by  a  series  of <label, query_length, target_length>
              triplets.  The label may be one of the following:

              M      Match
              C      Codon
              G      Gap
              N      Non-equivalenced region
              5      5’ splice site
              3      3’ splice site
              I      Intron
              S      Split codon
              F      Frameshift

       --showquerygff <boolean>
              Report GFF output for  features  on  the  query  sequence.   See
              http://www.sanger.ac.uk/Software/formats/GFF       for      more
              information.

       --showtargetgff <boolean>
              Report GFF output for features on the target sequence.

       --ryo <format>
              Roll-your-own output format.  This  allows  specification  of  a
              printf-esque   format  line  which  is  used  to  specify  which
              information to include in the output, and how it is to be shown.
              The format field may contain the following fields:

              %[qt][idlsSt]
                     For      either      {query,target},      report      the
                     {id,definition,length,sequence,Strand,type} Sequences are
                     reported in a fasta-format like block (no headers).
              %[qt]a[bels]
                     For  either  {query,target}  region  which  occurs in the
                     alignment, report the {begin,end,length,sequence}
              %[qt]c[bels]
                     For either {query,target}  region  which  occurs  in  the
                     coding    sequence   in   the   alignment,   report   the
                     {begin,end,length,sequence}
              %s     The raw score
              %r     The rank (in results from a bestn search)
              %m     Model name
              %e[tism]
                     Equivalenced {total,id,similarity,mismatches} (ie. %em ==
                     (%et - %ei))
              %p[is] Percent {id,similarity} over the equivalenced portions of
                     the alignment.  (ie. %pi == 100*(%ei / %et))
              %g     Gene orientation (’+’ = forward, ’-’  =  reverse,  ’.’  =
                     unknown)
              %S     Sugar  block  (the  9  fields  used  in sugar output (see
                     above)
              %C     Cigar block (the fields of a cigar line after  the  sugar
                     portion)
              %V     Vulgar block (the fields of a vulgar line after the sugar
                     portion)
              %%     Expands to a percentage sign (%)
              \n     Newline
              \t     Tab
              \\     Expands to a backslash (\)
              \{     Open curly brace
              \}     Close curly brace
              {      Begin per-transition output section
              }      End per-transition output section
              %P[qt][sabe]
                     Per-transition      output       for       {query,target}
                     {sequence,advance,begin,end}
              %P[nsl]
                     Per-transition output for {name,score,label}

       This  option  is  very useful and flexible.  For example, to report all
       the sections of query sequences which feature in  alignments  in  fasta
       format, use:

       --ryo ">%qi %qd\n%qas\n"

       To  output  all  the  symbols and scores in an alignment, try something
       like:

       --ryo "%V{%Pqs %Pts %Ps\n}"

       -n | --bestn <number>
              Report the best N results for each query.  (Only results scoring
              better than the score threshold
               will  be  reported).   The  option reduces the amount of output
              generated, and also allows exonerate to speed up the search.

       -S | --subopt <boolean>
              This option allows for the reporting of (Waterman-Eggert  style)
              suboptimal  alignments.   (It is on by default.)  All suboptimal
              (ie. non-intersecting) alignments will be reported for each pair
              of sequences scoring at least the threshold provided by --score.

              When this option is used  with  exhaustive  alignments,  several
              full quadratic time passes will be required, so the running time
              will be considerably increased.

       -g | --gappedextension <boolean>
              Causes a gapped extension stage  to  be  performed  ie.  dynamic
              programming  is  applied  in  arbitrarily shaped and dynamically
              sized regions surrounding HSP seeds.  The extension threshold is
              controlled by the --extensionthreshold option.

              Although  sometimes  slower than BSDP, gapped extension improves
              sensitivity with weak, gap-rich alignments such as during cross-
              species comparison.

              NB.  This  option is now the default. Set it to false to reverse
              to the old BSDP type alignments.  This option may be slower than
              BSDP for some large scale analyses with simple alignment models.

       --refine <strategy>
              Force exonerate to refine  alignments  generated  by  heuristics
              using  dynamic programming over larger regions.  This takes more
              time, but improves the quality of the final alignments.

              The strategies available for refinement are:

              none   The default - no refinement is used.
              full   An exhaustive alignment is calculated from  the  pair  of
                     sequences in their entirety.
              region DP is applied just to the region of the sequences covered
                     by the heuristic alignment.

       --refineboundary <size>
              Specify an extra boundary to be included in the  region  subject
              to alignment during refinement by region.

VITERBI ALGORITM OPTIONS

       -D | --dpmemory <Mb>
              The  exhaustive  alignment traceback routines use a Hughey-style
              reduced memory technique.  This option specifies how much memory
              will  be used for this.  Generally, the more memory is permitted
              here, the faster the alignments will be produced.

CODE GENERATION OPTIONS

       -C | --compiled <boolean>
              This option allows  disabling  of  generated  code  for  dynamic
              programming.  It is mainly used during development of exonerate.
              When set to FALSE,  an  "interpreted"  version  of  the  dynamic
              programming implementation is used, which is much slower.

HEURISTIC OPTIONS

       --terminalrangeint
       --terminalrangeext
       --joinrangeint
       --joinrangeext
       --spanrangeint
       --spanrangeext
              These  options are used to specify the size of the sub-alignment
              regions to which DP is applied around  the  ends  of  the  HSPs.
              This can be at the HSP ends (terminal range), between HSPs (join
              range), or between HSPs which may be connected by a large region
              such  as  an  intron  or  non-equivalenced  region (span range).
              These ranges can be specified for a number of matches back  onto
              the HSP (internal range) or out from the HSP (external range).

SEEDED DYNAMIC PROGRAMMING OPTIONS

       -x | --extensionthreshold <score>
              This is the amount by which the score will be allowed to degrade
              during SDP.  This is the equivalent of the hspdropoff penalties,
              except  it  is  applied  during  dynamic  programming,  not  HSP
              extension.  Decreasing this parameter will increase the speed of
              the SDP, and increasing it will increase the sensitivity.

       --singlepass  <boolean>
              By  default  the  suboptimal  SDP  alignments  are reported by a
              singlepass algorithm, but may miss  some  suboptimal  alignments
              that  are  close together.  This option can be used to force the
              use of a  multipass  suboptimal  alignment  algorithm  for  SDP,
              resulting in higher quality suboptimal alignments.

BSDP OPTIONS

       --joinfilter <limit>
              (experimental)

              Only  allow  consider  this  number  of  SARs  for  joining HSPs
              together.  The SARs with the highest potential for appearing  in
              a high-scoring alignment are considered.  This option useful for
              limiting time and memory usage when searching unmasked data with
              repetitive  sequences,  but  should not be set too low, as valid
              matches may be ignored.  Something like --joinfilter 32 seems to
              work well.

SEQUENCE OPTIONS

       --annotation <path>
              Specify  basic  sequence  annotation  information.  This is most
              useful with the cdna2genome model,  but  will  work  with  other
              models.  The annotation file contains four fields per line:

              <id> <strand> <cds_start> <cds_length>

              Here is a simple example of such a file for 4 cDNAs:

              dhh.human.cdna + 308 1191
              dhh.mouse.cdna + 250 1191
              csn7a.human.cdna + 178 828
              csn7a.mouse.cdna + 126 828

              These  annotation  lines  will also work when only the first two
              fields are used.  This can be used when specifying which  strand
              of a specific sequence should be included in a comparison.

SYMBOL COMPARISON OPTIONS

       --softmaskquery <boolean>
              Indicate  that  the  query is softmasked.  See description below
              for --softmasktarget
       --softmasktarget <boolean>
              Indicate  that  the  target  is  softmasked.   In  a  softmasked
              sequence  file,  instead of masking regions by Ns or Xs they are
              masked by putting those regions in lower case (and with unmasked
              regions  in  upper  case).  This option allows the masking to be
              ignored by some parts of the program,  combining  the  speed  of
              searching  masked  data  with  sensitivity of searching unmasked
              data.  The utility fastasoftmask supplied which is supplied with
              exonerate  can  be  used  for producing softmasked sequence from
              conventionally masked sequence.
       -d | --dnasubmat <name>
              Specify  the  the  substitution  matrix  to  be  used  for   DNA
              comparison.   This  should be a path to a substitution matrix in
              same format as that which is used by blast.
       -p | --proteinsubmat <name>
              Specify the the substitution  matrix  to  be  used  for  protein
              comparison.   (Both  DNA  and  protein substitution matrices are
              required for some types of analysis).  The use  of  the  special
              names,  nucleic,  blosum62,  pam250, edit or identity will cause
              built-in substitution matrices to be used.

ALIGNMENT SEEDING OPTIONS

       -M | --fsmmemory <Mb>
              Specify the amount of memory to use for  the  FSM  in  heuristic
              analyses.   exonerate multiplexes the query to accelerate large-
              throughput database queries.  This figure should always be  less
              than  the  physical  memory  on  the machine, but when searching
              large databases, generally, the more memory  it  is  allowed  to
              use, the faster it will go.
       --forcefsm <none | normal | compact>
              Force the use of more compact finite state machines for analyses
              involving big  sequences  and  large  word  neighbourhoods.   By
              default, exonerate will pick a sensible strategy, so this option
              will rarely need to be set.
       --wordjump <int>
              The  jump  between  query  words  used   to   yield   the   word
              neighbourhood.   If  set  to 1, every word is used, if set to 2,
              every other word is used, and if set  to  the  wordlength,  only
              non-overlapping  words  will  be  used.  This option reduces the
              memory requirements when using very large query  sequences,  and
              makes  the  search  run  faster,  but  it  also  damages  search
              sensitivity when high values are set.

AFFINE MODEL OPTIONS

       -o | --gapopen <penalty>
              This is the gap open penalty.
       -e | --gapextend <penalty>
              This is the gap extension penalty.
       --codongapopen <penalty>
              This is the codon gap open penalty.
       --codongapextend <penalty>
              This is the codon gap extension penalty.

NER OPTIONS

       --minner <boolean>
              Minimum NER length allowed.
       --maxner <length>
              Maximum NER  length  allowed.   NB.  this  option  only  affects
              heuristic alignments.
       --neropen <penalty>
              Penalty for opening a non-equivalenced region.

INTRON MODELLING OPTIONS

       --minintron <length>
              Minimum  intron  length  limit.   NB.  this  option only affects
              heuristic alignments.  This is  not  a  hard  limit  -  it  only
              affects  size  of  introns  which  are  sought  during heuristic
              alignment.
       --maxintron <length>
              Maximum intron length limit.  See notes above for --minintron
       -i | --intronpenalty <penalty>
              Penalty for introduction of an intron.

FRAMESHIFT MODELLING OPTIONS

       -f | --frameshift <penalty>
              The penalty for the inclusion of a frameshift in an alignment.

ALPHABET OPTIONS

       --useaatla <boolean>
              Use  three-letter  abbreviations  for  AA   names.    ie.   when
              displaying alignment "Met" is used instead of " M "

TRANSLATION OPTIONS

       --geneticcode <code>
              NEW:  Specify an alternative genetic code.  The default code (1)
              is the standard  genetic  code.   Other  genetic  codes  may  be
              specified by in shorthand or longhand form.

              In  shorthand form, a number between 1 and 23 is used to specify
              one of 17 built-in genetic code  variants.   These  are  genetic
              code variants taken from:

              http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

              These are:
              1      The Standard Code
              2      The Vertebrate Mitochondrial Code
              3      The Yeast Mitochondrial Code
              4      The  Mold, Protozoan, and Coelenterate Mitochondrial Code
                     and the Mycoplasma/Spiroplasma Code
              5      The Invertebrate Mitochondrial Code
              6      The Ciliate, Dasycladacean and Hexamita Nuclear Code
              9      The Echinoderm and Flatworm Mitochondrial Code
              10     The Euplotid Nuclear Code
              11     The Bacterial and Plant Plastid Code
              12     The Alternative Yeast Nuclear Code
              13     The Ascidian Mitochondrial Code
              14     The Alternative Flatworm Mitochondrial Code
              15     Blepharisma Nuclear Code
              16     Chlorophycean Mitochondrial Code
              21     Trematode Mitochondrial Code
              22     Scenedesmus obliquus mitochondrial Code
              23     Thraustochytrium Mitochondrial Code",

              In longhand form, a genetic code variant may be provided as a 64
              byte string in TCAG order, eg. the standard genetic code in this
              form would be:

              FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG

HSP CREATION OPTIONS

       --hspfilter <threshold>
              Use  aggressive  HSP  filtering  to speed up heuristic searches.
              The threshold specifies the number of HSPs centred about a point
              in  the query which will be stored.  Any lower scoring HSPs will
              be discarded.  This is an experimental option  to  handle  speed
              problems  caused  by some sequences.  A value of about 100 seems
              to work well.
       --useworddropoff <boolean>
              When this is TRUE, the score threshold for admitting words  into
              the word neighbourhood is set to be the initial word score minus
              the word threshold (see below).  This strategy  is  designed  to
              prevent             restricting             the             word
              SSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG When
              this  is  FALSE,  the  word threshold is taken to be an absolute
              value.
       --seedrepeat <count>
              NEW: The seedrepeat parameter sets the  number  of  seeds  which
              must  be  found on the same diagonal or reading frame before HSP
              extension will occur.  Increasing  the  value  for  --seedrepeat
              will  speed  up  searches,  and  is usually a better option than
              using  longer  word  lengths,  particularly   when   using   the
              exonerate-server   where   increasing   word   lengths  requires
              recomputing   the   index,   and   greater   increases    memory
              requirements.
       -w --dnawordlen <bases>
       -W --proteinwordlen <residues>
       -W --codonnwordlen <bases>
              The  word  length  used  for  DNA, protein or codon words.  When
              performing DNA vs protein comparisons, a the DNA wordlength will
              always (automatically) be triple the protein wordlength.
       --dnahspdropoff <score>
       --proteinhspdropoff <score>
       --codonhspdropoff <score>
              The  amount  by  which  an  HSP score will be allowed to degrade
              during HSP extension.  Separate threshold can be set for dna  or
              protein comparisons.
       --dnahspthreshold <score>
       --proteinhspthreshold <score>
       --codonhspthreshold <score>
              The  HSP score thresholds.  An HSP must score at least this much
              before it will be reported  or  be  used  in  preparation  of  a
              heuristic alignment.
       --dnawordlimit  <score>
       --proteinwordlimit  <score>
       --codonwordlimit  <score>
              The  threshold  for admitting DNA or protein words into the word
              neighbourhood.  The behaviour of this option is altered  by  the
              --useworddropoff option (see above).

       --geneseed <threshold>
              Exclude  HSPs  from  gapped  alignment  computation which cannot
              feature in a alignment containing at least one  HSP  scoring  at
              least this threshold.

              This  option provides considerable speed up for gapped alignment
              computation, but may cause some very gap-rich alignments  to  be
              missed.

              It  is  useful  when aligning similar sequences back onto genome
              quickly, eg. try --geneseed 250
       --geneseedrepeat <count>
              NEW:  The  geneseedrepeat  parameter  is  like  the   seedrepeat
              parameter,  but  is  only  applied when looking for the geneseed
              hsps.  Using a larger value for --geneseedrepeat will  speed  up
              searches   when   the   --geneseed   parameter   is  also  used.
              (experimental, implementation incomplete)

ALIGNMENT OPTIONS

       --alignmentwidth <width>
              Width of alignment display.  The default is 80.
       --forwardcoordinates <boolean>
              By default, all coordinates are reported on the forward  strand.
              Setting  this  option  to  false  reverts  to  the old behaviour
              (pre-0.8.3) whereby alignments on the reverse  complement  of  a
              sequence   are   reported   using  coordinates  on  the  reverse
              complement.

SUB-ALIGNMENT REGION OPTIONS

       --quality <percent>
              This option  excludes  HSPs  from  BSDP  when  their  components
              outside of the SARs fall below this quality threshold.

SPLICE SITE PREDICTION OPTIONS

       --splice3 <path>
       --splice5 <path>
              NEW:  Provide a file containing a custom PSSM (position specific
              score matrix) for prediction of the intron splice sites.

              The file format for splice data is simple: lines beginning  with
              ´#´  are  comments,  a  line  containing  just the word ´splice´
              denotes the position of the splice site,  and  the  other  lines
              show the observed relative frequencies of the bases flanking the
              splice sites in the chosen organism (in ACGT order).

              Example 5splice data file:

               # start of example 5’ splice data
               # A C G T
               28 40  17  14
               59 14  13  14
                8  5  81   6
               splice
                0  0 100   0
                0  0   0 100
               54  2  42   2
               74  8  11   8
                5  6  85   4
               16 18  21  45
               # end of test 5’ splice data

              Example 3splice data file:

               # start of example 3’ splice data
               # A C G T
                10  31  14  44
                 8  36  14  43
                 6  34  12  48
                 6  34   8  52
                 9  37   9  45
                 9  38  10  44
                 8  44   9  40
                 9  41   8  41
                 6  44   6  45
                 6  40   6  48
                23  28  26  23
                 2  79   1  18
               100   0   0   0
                 0   0 100   0
               splice
                28  14  47  11
               # end of example 3’ splice data

       --forcegtag <boolean>
              Only allow splice sites at gt....ag  sites  (or  ct....ac  sites
              when  the  gene is reversed) With this restriction in place, the
              splice site prediction scores  are  still  used  and  allow  tie
              breaking when there is more than one possible splice site.

STRATEGIES FOR SPEED

       Keep all data on local disks.

       Apply  the  highest  acceptable score thresholds using a combination of
       --score, --percent and --bestn.

       Repeat mask and dust the genomic (target)  sequence.   (Softmask  these
       sequences and use --softmasktarget).

       Increase the --fsmmemory option to allow more query multiplexing.

       Increase the value for --seedrepeat

       When  using  an  alignment  model containing introns, set --geneseed as
       high as possible.

       If you are compiling exonerate yourself, see the README  file  supplied
       with the source code for details of compile-time optimisations.

STRATEGIES FOR SENSITIVITY

       Not documented yet.

       Increase the word neighbourhood.  Decrease the HSP threshold.  Increase
       the SAR ranges.  Run exhaustively.

ENVIRONMENT

       Not documented yet.

EXAMPLES

       exonerate cdna.fasta genomic.fasta
              This simplest way in which exonerate may be used.   By  default,
              an ungapped alignment model will be used.

       exonerate     --exhaustive     y    --model    est2genome    cdna.fasta
       genomic.masked.fasta
              Exhaustively align cdnas to  genomic  sequence.   This  will  be
              much,  much  slower,  but  more  accurate.   This  option causes
              exonerate to behave like EST_GENOME.

       exonerate --exhaustive --model affine:local query.fasta target.fasta
              If the affine:local model is used with exhaustive alignment, you
              have the Smith-Waterman algorithm.

       exonerate     --exhaustive    --model    affine:global    protein.fasta
       protein.fasta
              Switch to a global model, and you have Needleman-Wunsch.

       exonerate --wordthreshold 1 --gapped  no  --showhsp  yes  protein.fasta
       genome.fasta
              Generate ungapped Protein:DNA alignments

       exonerate    --model    coding2coding   --score   1000   --bigseq   yes
       --proteinhspthreshold 90 chr21.fa chr22.fa
              Perform quick-and-dirty translated  pairwise  alignment  of  two
              very large DNA sequences.

       Many similar combinations should work.  Try them out.

VERSION

       This  documentation accompanies version 2.2.0 of the exonerate package.

AUTHOR

       Guy St.C. Slater.  <guy@ebi.ac.uk>.  See the AUTHORS file  accompanying
       the source code for a list of contributors.

AVAILABILITY

       This source code for the exonerate package is available under the terms
       of the GNU general public licence.

       Please see the file COPYING which was distrubuted with this package, or
       http://www.gnu.org/licenses/gpl.txt for details.

       This package has been developed as part of the ensembl project.  Please
       see http://www.ensembl.org/ for more information.

SEE ALSO

       exonerate-server(1), ipcress(1), blast(1L).