NAME
spidey - align mRNA sequences to a genome
SYNOPSIS
spidey [-] [-F N] [-G] [-L N] [-M filename] [-N filename] [-R filename]
[-S p/m] [-T N] [-X] [-a filename] [-c N] [-d] [-e X] [-f X] [-g X]
-i filename [-j] [-k filename] [-l N] -m filename [-n N] [-o str]
[-p N] [-r c/d/m/p/v] [-s] [-t filename] [-u] [-w]
DESCRIPTION
spidey is a tool for aligning one or more mRNA sequences to a given
genomic sequence. spidey was written with two main goals in mind: find
good alignments regardless of intron size; and avoid getting confused
by nearby pseudogenes and paralogs. Towards the first goal, spidey
uses BLAST and Dot View (another local alignment tool) to find its
alignments; since these are both local alignment tools, spidey does not
intrinsically favor shorter or longer introns and has no maximum intron
size. To avoid mistakenly including exons from paralogs and
pseudogenes, spidey first defines windows on the genomic sequence and
then performs the mRNA-to-genomic alignment separately within each
window. Because of the way the windows are constructed, neighboring
paralogs or pseudogenes should be in separate windows and should not be
included in the final spliced alignment.
Initial alignments and construction of genomic windows
spidey takes as input a single genomic sequence and a set of mRNA
accessions or FASTA sequences. All processing is done one mRNA
sequence at a time. The first step for each mRNA sequence is a high-
stringency BLAST against the genomic sequence. The resulting hits are
analyzed to find the genomic windows.
The BLAST alignments are sorted by score and then assigned into windows
by a recursive function which takes the first alignment and then goes
down the alignment list to find all alignments that are consistent with
the first (same strand of mRNA, both the mRNA and genomic coordinates
are nonoverlapping and linearly consistent). On subsequent passes, the
remaining alignments are examined and are put into their own
nonoverlapping, consistent windows, until no alignments are left.
Depending on how many gene models are desired, the top n windows are
chosen to go on to the next step and the others are deleted.
Aligning in each window
Once the genomic windows are constructed, the initial BLAST alignments
are freed and another BLAST search is performed, this time with the
entire mRNA against the genomic region defined by the window, and at a
lower stringency than the initial search. spidey then uses a greedy
algorithm to generate a high-scoring, nonoverlapping subset of the
alignments from the second BLAST search. This consistent set is
analyzed carefully to make sure that the entire mRNA sequence is
covered by the alignments. When gaps are found between the alignments,
the appropriate region of genomic sequence is searched against the
missing mRNA, first using a very low-stringency BLAST and, if the BLAST
fails to find a hit, using DotView functions to locate the alignment.
When gaps are found at the ends of the alignments, the BLAST and
DotView searches are actually allowed to extend past the boundaries of
the window. If the 3’ end of the mRNA does not align completely, it is
first examined for the presence of a poly(A) tail. No attempt is made
to align the portion of the mRNA that seems to be a poly(A) tail;
sometimes there is a poly(A) tail that does align to the genomic
sequence, and these are noted because they indicate the possibility of
a pseudogene.
Now that the mRNA is completely covered by the set of alignments, the
boundaries of the alignments (there should be one alignment per exon
now) are adjusted so that the alignments abut each other precisely and
so that they are adjacent to good splice donor and acceptor sites.
Most commonly, two adjacent exons’ alignments overlap by as much as 20
or 30 base pairs on the mRNA sequence. The true exon boundary may lie
anywhere within this overlap, or (as we have seen empirically) even a
few base pairs outside the overlap. To position the exon boundaries,
the overlap plus a few base pairs on each side is examined for splice
donor sites, using functions that have different splice matrices
depending on the organism chosen. The top few splice donor sites (by
score) are then evaluated as to how much they affect the original
alignment boundaries. The site that affects the boundaries the least
is chosen, and is evaluated as to the presence of an acceptor site.
The alignments are truncated or extended as necessary so that they
terminate at the splice donor site and so that they do not overlap.
Final result
The windows are examined carefully to get the percent identity per
exon, the number of gaps per exon, the overall percent identity, the
percent coverage of the mRNA, presence of an aligning or non-aligning
poly(A) tail, number of splice donor sites and the presence or absence
of splice donor and acceptor sites for each exon, and the occurrence of
an mRNA that has a 5’ or 3’ end (or both) that does not align to the
genomic sequence. If the overall percent identity and percent length
coverage are above the user-defined cutoffs, a summary report is
printed, and, if requested, a text alignment showing identities and
mismatches is also printed.
Interspecies alignments
spidey is capable of performing interspecies alignments. The major
difference in interspecies alignments is that the mRNA-genomic identity
will not be close to 100% as it is in intraspecies alignments; also,
the alignments have numerous and lengthy gaps. If spidey is used in
its normal mode to do interspecies alignments, it produces gene models
with many, many short exons. When the interspecies flag is set, spidey
uses different BLAST parameters to encourage longer and more gaps and
to not penalize as heavily for mismatches. This way, the alignments
for the exons are much longer and more closely approximate the actual
gene structure.
Extracting CDS alignments
When spidey is run in network-aware mode or when ASN.1 files are used
for the mRNA records, it is capable of extracting a CDS alignment from
an mRNA alignment and printing the CDS information also. Since the CDS
alignment is just a subset of the mRNA alignment, it is relatively
straightforward to truncate the exon alignments as necessary and to
generate a CDS alignment. Furthermore, the untranslated regions are
now defined, so the percent identity for the 5’ and 3’ untranslated
regions is also calculated.
OPTIONS
A summary of options is included below.
- Print usage message.
-F N Start of genomic interval desired (from; 0-based).
-G Input file is a GI list.
-L N The extra-large intron size to use (default = 220000).
-M filename
File with donor splice matrix.
-N filename
File with acceptor splice matrix.
-R filename
File (including path) to repeat blast database for filtering.
-S p/m Restrict to plus (p) or minus (m) strand of genomic sequence.
-T N Stop of genomic interval desired (to; 0-based).
-X Use extra-large intron sizes (increases the limit for initial
and terminal introns from 100kb to 240kb and for all others from
35kb to 120kb); may result in significantly longer compute
times.
-a filename
Output file for alignments when directed to a separate file with
-p 3 (default = spidey.aln).
-c N Identity cutoff, in percent, for quality control purposes.
-d Also try to align coding sequences corresponding to the given
mRNA records (may require network access).
-e X First-pass e-value (default = 1.0e-10). Higher values increase
speed at the cost of sensitivity.
-f X Second-pass e-value (default = 0.001).
-g X Third-pass e-value (default = 10).
-i filename
Input file containing the genomic sequence in ASN.1 or FASTA
format. If your computer is running on a network that can
access GenBank, you can substitute the desired accession number
for the filename.
-j Print ASN.1 alignment?
-k filename
File for ASN.1 output with -k (default = spidey.asn).
-l N Length coverage cutoff, in percent.
-m filename
Input file containing the mRNA sequence(s) in ASN.1 or FASTA
format, or a list of their accessions (with -G). If your
computer is running on a network that can access GenBank, you
can substitute a single accession number for the filename.
-n N Number of gene models to return per input mRNA (default = 1).
-o str Main output file (default = stdout; contents controlled by -p).
-p N Print alignment?
0 summary and alignments together (default)
1 just the summary
2 just the alignments
3 summary and alignments in different files
-r c/d/m/p/v
Organism of genomic sequence, used to determine splice matrices.
c C. elegans
d Drosophila
m Dictyostelium discoideum
p plant
v vertebrate (default)
-s Tune for interspecies alignments.
-t filename
File with feature table, in 4 tab-delimited columns:
seqid (e.g., NM_04377.1)
name (only repetitive_region is currently supported)
start (0-based)
stop (0-based)
-u Make a multiple alignment of all input mRNAs (which must overlap
on the genomic sequence).
-w Consider lowercase characters in input FASTA sequences to be
masked.
AUTHOR
Sarah Wheelan and others at the National Center for Biotechnology
Information; Steffen Moeller contributed to this documentation.
SEE ALSO
blast(1), <http://www.ncbi.nlm.nih.gov/spidey>