NAME
exonerate - a generic tool for sequence comparison
SYNOPSIS
exonerate [ options ] <query path> <target path>
DESCRIPTION
exonerate is a general tool for sequence comparison.
It uses the C4 dynamic programming library. It is designed to be both
general and fast. It can produce either gapped or ungapped alignments,
according to a variety of different alignment models. The C4 library
allows sequence alignment using a reduced space full dynamic
programming implementation, but also allows automated generation of
heuristics from the alignment models, using bounded sparse dynamic
programming, so that these alignments may also be rapidly generated.
Alignments generated using these heuristics will represent a valid path
through the alignment model, yet (unlike the exhaustive alignments),
the results are not guaranteed to be optimal.
CONVENTIONS
A number of conventions (and idiosyncracies) are used within exonerate.
An understanding of them facilitates interpretation of the output.
Coordinates
An in-between coordinate system is used, where the positions are
counted between the symbols, rather than on the symbols. This
numbering scheme starts from zero. This numbering is shown
below for the sequence "ACGT":
A C G T
0 1 2 3 4
Hence the subsequence "CG" would have start=1, end=3, and
length=2. This coordinate system is used internally in
exonerate, and for all the output formats produced with the
exception of the "human readable" alignment display and the GFF
output where convention and standards dictate otherwise.
Reverse Complements
When an alignment is reported on the reverse complement of a
sequence, the coordinates are simply given on the reverse
complement copy of the sequence. Hence positions on the
sequences are never negative. Generally, the forward strand is
indicated by ’+’, the reverse strand by ’-’, and an unknown or
not-applicable strand (as in the case of a protein sequence) is
indicated by ’.’
Alignment Scores
Currently, only the raw alignment scores are displayed. This
score just is the sum of transistion scores used in the dynamic
programming. For example, in the case of a Smith-Waterman
alignment, this will be the sum of the substitution matrix
scores and the gap penalties.
GENERAL OPTIONS
Most arguments have short and long forms. The long forms are more
likely to be stable over time, and hence should be used in scripts
which call exonerate.
-h | --shorthelp <boolean>
Show help. This will display a concise summary of the available
options, defaults and values currently set.
--help <boolean>
This shows all the help options including the defaults, the
value currently set, and the environment variable which may be
used to set each parameter. There will be an indication of
which options are mandatory. Mandatory options have no default,
and must have a value supplied for exonerate to run. If
mandatory options are used in order, their flags may be skipped
from the command line (see examples below). Unlike this man
page, the information from this option will always be up to date
with the latest version of the program.
-v | --version <boolean>
Display the version number. Also displays other information
such as the build date and glib version used.
SEQUENCE INPUT OPTIONS
Pairwise comparisons will be performed between all query sequences and
all target sequences. Generally, for the best performance, shorter
sequences (eg. ESTs, shotgun reads, proteins) should be used as the
query sequences, and longer sequences (eg. genomic sequences) should be
used as the target sequences.
-q | --query <paths>
Specify the query sequences required. These must be in a FASTA
format file. Single or muiltiple query sequences may be
supplied. Additionally multiple copies of the fasta file may be
supplied following a --query flag, or by using with multiple
--query flags.
-t | --target <paths>
Specify the target sequences required. Also, must be in a FASTA
format file. As with the query sequences, single or multiple
target sequences and files may be supplied. NEW: the target
filename may by replace by a server name and port number in the
form of hostname:port when using exonerate-server. See the man
page for exonerate-server for more information on running
exonerate in client:server mode.
-Q | --querytype <dna | protein>
Specify the query type to use. If this is not supplied, the
query type is assumed to be DNA when the first sequence in the
file contains more than 85% [ACGTN] bases. Otherwise, it is
assumed to be peptide. This option forces the query type as
some nucleotide and peptide sequences can fall either side of
this threshold.
-T | --targettype <dna | protein>
Specify the target type to use. The same as --querytype
(above), except that it applies to the target. Specifying the
sequence type will avoid the overhead of having to read the
first sequence in the database twice (which may be significant
with chromosome-sized sequences)
--querychunkid <id>
--querychunktotal <total>
--targetchunkid <id>
--targetchunktotal <total>
These options to facilitate running exonerate on compute farms,
and avoid having to split up sequence databases into small
chunks to run on different nodes. If, for example, you wished
to split the target database into three parts, you would run
three exonerate jobs on different nodes including the options:
--targetchunkid 1 --targetchunktotal 3
--targetchunkid 2 --targetchunktotal 3
--targetchunkid 3 --targetchunktotal 3
NB. The granularity offered by this option only goes down to a
single sequence, so when there are more chunks than sequences in
the database, some processes will do nothing.
-V | --verbose <int>
Be verbose - show information about what is going on during the
analysis. The default is 1 (little information), the higher the
number given, the more information is printed. To silence all
the default output from exonerate, use --verbose 0
--showalignment no --showvulgar no
ANALYSIS OPTIONS
-E | --exhaustive <boolean>
Specify whether or not exhaustive alignment should be used. By
default, this is FALSE, and alignment heuristics will be used.
If it is set to TRUE, an exhaustive alignment will be
calculated. This requires quadratic time, and will be much,
much slower, but will provide the optimal result for the given
model.
-B | --bigseq <int>
Perform alignment of large (multi-megabase) sequences. This is
very memory efficient and fast when both sequences are
chromosome-sized, but currently does not currently permit the
use of a word neighbourhood (ie. exactly matching seeds only).
--forcescan <none | query | target>
Force the FSM to scan the query sequence rather than the target.
This option is useful, for example, if you have a single piece
of genomic sequence and you with to compare it to the whole of
dbEST. By scanning the database, rather than the query, the
analysis will be completed much more quickly, as the overheads
of multiple query FSM construction, multiple target reading and
splice site predictions will be removed. By default, exonerate
will guess the optimal strategy based on database sequence
sizes.
--saturatethreshold <number>
When set to zero, this option does nothing. Otherwise, once
more than this number of words (in addition to the expected
number of words by chance) have matched a position on the query,
the position on the query will be ’numbed’ (ignore further
matches) for the current pairwise comparison.
--customserver <command>
NEW: When using exonerate in client:server mode with a non-
standard server, this command allows you to send a custom
command to the server. This command is sent by the client
(exonerate) before any other commands, and is provided as a way
of passing parameters or other commands specific to the custom
server. See the exonerate-server man page for more information
on running exonerate in client:server mode.
FASTA DATABASE OPTIONS
--fastasuffix <extension>
If any of the inputs given with --query or --target are
directories, then exonerate will recursively descent these
directories, reading all files ending with this suffix as fasta
format input.
GAPPED ALIGNMENT OPTIONS
-m | --model <alignment model>
Specify the alignment model to use. The models currently
supported are:
ungapped
The simplest type of model, used by default. An
appropriate model with be selected automatically for the
type of input sequences provided.
ungapped:trans
This ungapped model includes translation of all frames of
both the query and target sequences. This is similar to
an ungapped tblastx type search.
affine:global
This performs gapped global alignment, similar to the
Needleman-Wunsch algorithm, except with affine gaps.
Global alignment requires that both the sequences in
their entirety are included in the alignment.
affine:bestfit
This performs a best fit or best location alignment of
the query onto the target sequence. The entire query
sequence will be included in the alignment, but only the
best location for its alignment on the target sequence.
affine:local
This is local alignment with affine gaps, similar to the
Smith-Waterman-Gotoh algorithm. A general-purpose
alignment algorithm. As this is local alignment, any
subsequence of the query and target sequence may appear
in the alignment.
affine:overlap
This type of alignment finds the best overlap between the
query and target. The overlap alignment must include the
start of the query or target and the end of the query or
the target sequence, to align sequences which overlap at
the ends, or in the mid-section of a longer sequence..
This is the type of alignment frequently used in assembly
algorithms.
est2genome
This model is similar to the affine:local model, but it
also includes intron modelling on the target sequence to
allow alignment of spliced to unspliced coding sequences
for both forward and reversed genes. This is similar to
the alignment models used in programs such as EST_GENOME
and sim4.
ner NERs are non-equivalenced regions - large regions in both
the query and target which are not aligned. This model
can be used for protein alignments where strongly
conserved helix regions will be aligned, but weakly
conserved loop regions are not. Similarly, this model
could be used to look for co-linearly conserved regions
in comparison of genomic sequences.
protein2dna
This model compares a protein sequence to a DNA sequence,
incorporating all the appropriate gaps and frameshifts.
protein2dna:bestfit
NEW: This is a bestfit version of the protein2dna model,
with which the entire protein is included in the
alignment. It is currently only available when using
exhaustive alignment.
protein2genome
This model allows alignment of a protein sequence to
genomic DNA. This is similar to the protein2dna model,
with the addition of modelling of introns and intron
phases. This model is simliar to those used by genewise.
protein2genome:bestfit
NEW: This is a bestfit version of the protein2genome
model, with which the entire protein is included in the
alignment. It is currently only available when using
exhaustive alignment.
coding2coding
This model is similar to the ungapped:trans model, except
that gaps and frameshifts are allowed. It is similar to
a gapped tblastx search.
coding2genome
This is similar to the est2genome model, except that the
query sequence is translated during comparison, allowing
a more sensitive comparison.
cdna2genome
This combines properties of the est2genome and
coding2genome models, to allow modeling of an whole cDNA
where a central coding region can be flanked by non-
coding UTRs. When the CDS start and end is known it may
be specified using the --annotation option (see below) to
permit only the correct coding region to appear in the
alignemnt.
genome2genome
This model is similar to the coding2coding model, except
introns are modelled on both sequences. (not working
well yet)
The short names u, u:t, a:g, a:b, a:l, a:o, e2g, ner,
p2d, p2d:b p2g, p2g:b, c2c, c2g cd2g and g2g can also be used
for specifying models.
-s | --score <threshold>
This is the overall score threshold. Alignments will not be
reported below this threshold. For heuristic alignments, the
higher this threshold, the less time the analysis will take.
--percent <percentage>
Report only alignments scoring at least this percentage of the
maximal score for each query. eg. use --percent 90 to report
alignments with 90% of the maximal score optainable for that
query. This option is useful not only because it reduces the
spurious matches in the output, but because it generates query-
specific thresholds (unlike --score ) for a set of queries of
differing lengths, and will also speed up the search
considerably. NB. with this option, it is possible to have a
cDNA match its corresponding gene exactly, yet still score less
than 100%, due to the addition of the intron penalty scores,
hence this option must be used with caution.
--showalignment <boolean>
Show the alignments in an human readable form.
--showsugar <boolean>
Display "sugar" output for ungapped alignments. Sugar is Simple
UnGapped Alignment Report, which displays ungapped alignments
one-per-line. The sugar line starts with the string "sugar:"
for easy extraction from the output, and is followed by the the
following 9 fields in the order below:
query_id Query identifier
query_start Query position at alignment start
query_end Query position alignment end
query_strand Strand of query matched
target_id |
target_start | the same 4 fields
target_end | for the target sequence
target_strand |
score The raw alignment score
--showcigar <boolean>
Show the alignments in "cigar" format. Cigar is a Compact
Idiosyncratic Gapped Alignment Report, which displays gapped
alignments one-per-line. The format starts with the same 9
fields as sugar output (see above), and is followed by a series
of <operation, length> pairs where operation is one of match,
insert or delete, and the length describes the number of times
this operation is repeated.
--showvulgar <boolean>
Shows the alignments in "vulgar" format. Vulgar is Verbose
Useful Labelled Gapped Alignment Report, This format also starts
with the same 9 fields as sugar output (see above), and is
followed by a series of <label, query_length, target_length>
triplets. The label may be one of the following:
M Match
C Codon
G Gap
N Non-equivalenced region
5 5’ splice site
3 3’ splice site
I Intron
S Split codon
F Frameshift
--showquerygff <boolean>
Report GFF output for features on the query sequence. See
http://www.sanger.ac.uk/Software/formats/GFF for more
information.
--showtargetgff <boolean>
Report GFF output for features on the target sequence.
--ryo <format>
Roll-your-own output format. This allows specification of a
printf-esque format line which is used to specify which
information to include in the output, and how it is to be shown.
The format field may contain the following fields:
%[qt][idlsSt]
For either {query,target}, report the
{id,definition,length,sequence,Strand,type} Sequences are
reported in a fasta-format like block (no headers).
%[qt]a[bels]
For either {query,target} region which occurs in the
alignment, report the {begin,end,length,sequence}
%[qt]c[bels]
For either {query,target} region which occurs in the
coding sequence in the alignment, report the
{begin,end,length,sequence}
%s The raw score
%r The rank (in results from a bestn search)
%m Model name
%e[tism]
Equivalenced {total,id,similarity,mismatches} (ie. %em ==
(%et - %ei))
%p[is] Percent {id,similarity} over the equivalenced portions of
the alignment. (ie. %pi == 100*(%ei / %et))
%g Gene orientation (’+’ = forward, ’-’ = reverse, ’.’ =
unknown)
%S Sugar block (the 9 fields used in sugar output (see
above)
%C Cigar block (the fields of a cigar line after the sugar
portion)
%V Vulgar block (the fields of a vulgar line after the sugar
portion)
%% Expands to a percentage sign (%)
\n Newline
\t Tab
\\ Expands to a backslash (\)
\{ Open curly brace
\} Close curly brace
{ Begin per-transition output section
} End per-transition output section
%P[qt][sabe]
Per-transition output for {query,target}
{sequence,advance,begin,end}
%P[nsl]
Per-transition output for {name,score,label}
This option is very useful and flexible. For example, to report all
the sections of query sequences which feature in alignments in fasta
format, use:
--ryo ">%qi %qd\n%qas\n"
To output all the symbols and scores in an alignment, try something
like:
--ryo "%V{%Pqs %Pts %Ps\n}"
-n | --bestn <number>
Report the best N results for each query. (Only results scoring
better than the score threshold
will be reported). The option reduces the amount of output
generated, and also allows exonerate to speed up the search.
-S | --subopt <boolean>
This option allows for the reporting of (Waterman-Eggert style)
suboptimal alignments. (It is on by default.) All suboptimal
(ie. non-intersecting) alignments will be reported for each pair
of sequences scoring at least the threshold provided by --score.
When this option is used with exhaustive alignments, several
full quadratic time passes will be required, so the running time
will be considerably increased.
-g | --gappedextension <boolean>
Causes a gapped extension stage to be performed ie. dynamic
programming is applied in arbitrarily shaped and dynamically
sized regions surrounding HSP seeds. The extension threshold is
controlled by the --extensionthreshold option.
Although sometimes slower than BSDP, gapped extension improves
sensitivity with weak, gap-rich alignments such as during cross-
species comparison.
NB. This option is now the default. Set it to false to reverse
to the old BSDP type alignments. This option may be slower than
BSDP for some large scale analyses with simple alignment models.
--refine <strategy>
Force exonerate to refine alignments generated by heuristics
using dynamic programming over larger regions. This takes more
time, but improves the quality of the final alignments.
The strategies available for refinement are:
none The default - no refinement is used.
full An exhaustive alignment is calculated from the pair of
sequences in their entirety.
region DP is applied just to the region of the sequences covered
by the heuristic alignment.
--refineboundary <size>
Specify an extra boundary to be included in the region subject
to alignment during refinement by region.
VITERBI ALGORITM OPTIONS
-D | --dpmemory <Mb>
The exhaustive alignment traceback routines use a Hughey-style
reduced memory technique. This option specifies how much memory
will be used for this. Generally, the more memory is permitted
here, the faster the alignments will be produced.
CODE GENERATION OPTIONS
-C | --compiled <boolean>
This option allows disabling of generated code for dynamic
programming. It is mainly used during development of exonerate.
When set to FALSE, an "interpreted" version of the dynamic
programming implementation is used, which is much slower.
HEURISTIC OPTIONS
--terminalrangeint
--terminalrangeext
--joinrangeint
--joinrangeext
--spanrangeint
--spanrangeext
These options are used to specify the size of the sub-alignment
regions to which DP is applied around the ends of the HSPs.
This can be at the HSP ends (terminal range), between HSPs (join
range), or between HSPs which may be connected by a large region
such as an intron or non-equivalenced region (span range).
These ranges can be specified for a number of matches back onto
the HSP (internal range) or out from the HSP (external range).
SEEDED DYNAMIC PROGRAMMING OPTIONS
-x | --extensionthreshold <score>
This is the amount by which the score will be allowed to degrade
during SDP. This is the equivalent of the hspdropoff penalties,
except it is applied during dynamic programming, not HSP
extension. Decreasing this parameter will increase the speed of
the SDP, and increasing it will increase the sensitivity.
--singlepass <boolean>
By default the suboptimal SDP alignments are reported by a
singlepass algorithm, but may miss some suboptimal alignments
that are close together. This option can be used to force the
use of a multipass suboptimal alignment algorithm for SDP,
resulting in higher quality suboptimal alignments.
BSDP OPTIONS
--joinfilter <limit>
(experimental)
Only allow consider this number of SARs for joining HSPs
together. The SARs with the highest potential for appearing in
a high-scoring alignment are considered. This option useful for
limiting time and memory usage when searching unmasked data with
repetitive sequences, but should not be set too low, as valid
matches may be ignored. Something like --joinfilter 32 seems to
work well.
SEQUENCE OPTIONS
--annotation <path>
Specify basic sequence annotation information. This is most
useful with the cdna2genome model, but will work with other
models. The annotation file contains four fields per line:
<id> <strand> <cds_start> <cds_length>
Here is a simple example of such a file for 4 cDNAs:
dhh.human.cdna + 308 1191
dhh.mouse.cdna + 250 1191
csn7a.human.cdna + 178 828
csn7a.mouse.cdna + 126 828
These annotation lines will also work when only the first two
fields are used. This can be used when specifying which strand
of a specific sequence should be included in a comparison.
SYMBOL COMPARISON OPTIONS
--softmaskquery <boolean>
Indicate that the query is softmasked. See description below
for --softmasktarget
--softmasktarget <boolean>
Indicate that the target is softmasked. In a softmasked
sequence file, instead of masking regions by Ns or Xs they are
masked by putting those regions in lower case (and with unmasked
regions in upper case). This option allows the masking to be
ignored by some parts of the program, combining the speed of
searching masked data with sensitivity of searching unmasked
data. The utility fastasoftmask supplied which is supplied with
exonerate can be used for producing softmasked sequence from
conventionally masked sequence.
-d | --dnasubmat <name>
Specify the the substitution matrix to be used for DNA
comparison. This should be a path to a substitution matrix in
same format as that which is used by blast.
-p | --proteinsubmat <name>
Specify the the substitution matrix to be used for protein
comparison. (Both DNA and protein substitution matrices are
required for some types of analysis). The use of the special
names, nucleic, blosum62, pam250, edit or identity will cause
built-in substitution matrices to be used.
ALIGNMENT SEEDING OPTIONS
-M | --fsmmemory <Mb>
Specify the amount of memory to use for the FSM in heuristic
analyses. exonerate multiplexes the query to accelerate large-
throughput database queries. This figure should always be less
than the physical memory on the machine, but when searching
large databases, generally, the more memory it is allowed to
use, the faster it will go.
--forcefsm <none | normal | compact>
Force the use of more compact finite state machines for analyses
involving big sequences and large word neighbourhoods. By
default, exonerate will pick a sensible strategy, so this option
will rarely need to be set.
--wordjump <int>
The jump between query words used to yield the word
neighbourhood. If set to 1, every word is used, if set to 2,
every other word is used, and if set to the wordlength, only
non-overlapping words will be used. This option reduces the
memory requirements when using very large query sequences, and
makes the search run faster, but it also damages search
sensitivity when high values are set.
AFFINE MODEL OPTIONS
-o | --gapopen <penalty>
This is the gap open penalty.
-e | --gapextend <penalty>
This is the gap extension penalty.
--codongapopen <penalty>
This is the codon gap open penalty.
--codongapextend <penalty>
This is the codon gap extension penalty.
NER OPTIONS
--minner <boolean>
Minimum NER length allowed.
--maxner <length>
Maximum NER length allowed. NB. this option only affects
heuristic alignments.
--neropen <penalty>
Penalty for opening a non-equivalenced region.
INTRON MODELLING OPTIONS
--minintron <length>
Minimum intron length limit. NB. this option only affects
heuristic alignments. This is not a hard limit - it only
affects size of introns which are sought during heuristic
alignment.
--maxintron <length>
Maximum intron length limit. See notes above for --minintron
-i | --intronpenalty <penalty>
Penalty for introduction of an intron.
FRAMESHIFT MODELLING OPTIONS
-f | --frameshift <penalty>
The penalty for the inclusion of a frameshift in an alignment.
ALPHABET OPTIONS
--useaatla <boolean>
Use three-letter abbreviations for AA names. ie. when
displaying alignment "Met" is used instead of " M "
TRANSLATION OPTIONS
--geneticcode <code>
NEW: Specify an alternative genetic code. The default code (1)
is the standard genetic code. Other genetic codes may be
specified by in shorthand or longhand form.
In shorthand form, a number between 1 and 23 is used to specify
one of 17 built-in genetic code variants. These are genetic
code variants taken from:
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
These are:
1 The Standard Code
2 The Vertebrate Mitochondrial Code
3 The Yeast Mitochondrial Code
4 The Mold, Protozoan, and Coelenterate Mitochondrial Code
and the Mycoplasma/Spiroplasma Code
5 The Invertebrate Mitochondrial Code
6 The Ciliate, Dasycladacean and Hexamita Nuclear Code
9 The Echinoderm and Flatworm Mitochondrial Code
10 The Euplotid Nuclear Code
11 The Bacterial and Plant Plastid Code
12 The Alternative Yeast Nuclear Code
13 The Ascidian Mitochondrial Code
14 The Alternative Flatworm Mitochondrial Code
15 Blepharisma Nuclear Code
16 Chlorophycean Mitochondrial Code
21 Trematode Mitochondrial Code
22 Scenedesmus obliquus mitochondrial Code
23 Thraustochytrium Mitochondrial Code",
In longhand form, a genetic code variant may be provided as a 64
byte string in TCAG order, eg. the standard genetic code in this
form would be:
FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
HSP CREATION OPTIONS
--hspfilter <threshold>
Use aggressive HSP filtering to speed up heuristic searches.
The threshold specifies the number of HSPs centred about a point
in the query which will be stored. Any lower scoring HSPs will
be discarded. This is an experimental option to handle speed
problems caused by some sequences. A value of about 100 seems
to work well.
--useworddropoff <boolean>
When this is TRUE, the score threshold for admitting words into
the word neighbourhood is set to be the initial word score minus
the word threshold (see below). This strategy is designed to
prevent restricting the word
SSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG When
this is FALSE, the word threshold is taken to be an absolute
value.
--seedrepeat <count>
NEW: The seedrepeat parameter sets the number of seeds which
must be found on the same diagonal or reading frame before HSP
extension will occur. Increasing the value for --seedrepeat
will speed up searches, and is usually a better option than
using longer word lengths, particularly when using the
exonerate-server where increasing word lengths requires
recomputing the index, and greater increases memory
requirements.
-w --dnawordlen <bases>
-W --proteinwordlen <residues>
-W --codonnwordlen <bases>
The word length used for DNA, protein or codon words. When
performing DNA vs protein comparisons, a the DNA wordlength will
always (automatically) be triple the protein wordlength.
--dnahspdropoff <score>
--proteinhspdropoff <score>
--codonhspdropoff <score>
The amount by which an HSP score will be allowed to degrade
during HSP extension. Separate threshold can be set for dna or
protein comparisons.
--dnahspthreshold <score>
--proteinhspthreshold <score>
--codonhspthreshold <score>
The HSP score thresholds. An HSP must score at least this much
before it will be reported or be used in preparation of a
heuristic alignment.
--dnawordlimit <score>
--proteinwordlimit <score>
--codonwordlimit <score>
The threshold for admitting DNA or protein words into the word
neighbourhood. The behaviour of this option is altered by the
--useworddropoff option (see above).
--geneseed <threshold>
Exclude HSPs from gapped alignment computation which cannot
feature in a alignment containing at least one HSP scoring at
least this threshold.
This option provides considerable speed up for gapped alignment
computation, but may cause some very gap-rich alignments to be
missed.
It is useful when aligning similar sequences back onto genome
quickly, eg. try --geneseed 250
--geneseedrepeat <count>
NEW: The geneseedrepeat parameter is like the seedrepeat
parameter, but is only applied when looking for the geneseed
hsps. Using a larger value for --geneseedrepeat will speed up
searches when the --geneseed parameter is also used.
(experimental, implementation incomplete)
ALIGNMENT OPTIONS
--alignmentwidth <width>
Width of alignment display. The default is 80.
--forwardcoordinates <boolean>
By default, all coordinates are reported on the forward strand.
Setting this option to false reverts to the old behaviour
(pre-0.8.3) whereby alignments on the reverse complement of a
sequence are reported using coordinates on the reverse
complement.
SUB-ALIGNMENT REGION OPTIONS
--quality <percent>
This option excludes HSPs from BSDP when their components
outside of the SARs fall below this quality threshold.
SPLICE SITE PREDICTION OPTIONS
--splice3 <path>
--splice5 <path>
NEW: Provide a file containing a custom PSSM (position specific
score matrix) for prediction of the intron splice sites.
The file format for splice data is simple: lines beginning with
´#´ are comments, a line containing just the word ´splice´
denotes the position of the splice site, and the other lines
show the observed relative frequencies of the bases flanking the
splice sites in the chosen organism (in ACGT order).
Example 5’ splice data file:
# start of example 5’ splice data
# A C G T
28 40 17 14
59 14 13 14
8 5 81 6
splice
0 0 100 0
0 0 0 100
54 2 42 2
74 8 11 8
5 6 85 4
16 18 21 45
# end of test 5’ splice data
Example 3’ splice data file:
# start of example 3’ splice data
# A C G T
10 31 14 44
8 36 14 43
6 34 12 48
6 34 8 52
9 37 9 45
9 38 10 44
8 44 9 40
9 41 8 41
6 44 6 45
6 40 6 48
23 28 26 23
2 79 1 18
100 0 0 0
0 0 100 0
splice
28 14 47 11
# end of example 3’ splice data
--forcegtag <boolean>
Only allow splice sites at gt....ag sites (or ct....ac sites
when the gene is reversed) With this restriction in place, the
splice site prediction scores are still used and allow tie
breaking when there is more than one possible splice site.
STRATEGIES FOR SPEED
Keep all data on local disks.
Apply the highest acceptable score thresholds using a combination of
--score, --percent and --bestn.
Repeat mask and dust the genomic (target) sequence. (Softmask these
sequences and use --softmasktarget).
Increase the --fsmmemory option to allow more query multiplexing.
Increase the value for --seedrepeat
When using an alignment model containing introns, set --geneseed as
high as possible.
If you are compiling exonerate yourself, see the README file supplied
with the source code for details of compile-time optimisations.
STRATEGIES FOR SENSITIVITY
Not documented yet.
Increase the word neighbourhood. Decrease the HSP threshold. Increase
the SAR ranges. Run exhaustively.
ENVIRONMENT
Not documented yet.
EXAMPLES
exonerate cdna.fasta genomic.fasta
This simplest way in which exonerate may be used. By default,
an ungapped alignment model will be used.
exonerate --exhaustive y --model est2genome cdna.fasta
genomic.masked.fasta
Exhaustively align cdnas to genomic sequence. This will be
much, much slower, but more accurate. This option causes
exonerate to behave like EST_GENOME.
exonerate --exhaustive --model affine:local query.fasta target.fasta
If the affine:local model is used with exhaustive alignment, you
have the Smith-Waterman algorithm.
exonerate --exhaustive --model affine:global protein.fasta
protein.fasta
Switch to a global model, and you have Needleman-Wunsch.
exonerate --wordthreshold 1 --gapped no --showhsp yes protein.fasta
genome.fasta
Generate ungapped Protein:DNA alignments
exonerate --model coding2coding --score 1000 --bigseq yes
--proteinhspthreshold 90 chr21.fa chr22.fa
Perform quick-and-dirty translated pairwise alignment of two
very large DNA sequences.
Many similar combinations should work. Try them out.
VERSION
This documentation accompanies version 2.2.0 of the exonerate package.
AUTHOR
Guy St.C. Slater. <guy@ebi.ac.uk>. See the AUTHORS file accompanying
the source code for a list of contributors.
AVAILABILITY
This source code for the exonerate package is available under the terms
of the GNU general public licence.
Please see the file COPYING which was distrubuted with this package, or
http://www.gnu.org/licenses/gpl.txt for details.
This package has been developed as part of the ensembl project. Please
see http://www.ensembl.org/ for more information.
SEE ALSO
exonerate-server(1), ipcress(1), blast(1L).