spidey - align mRNA sequences to a genome

NAME

       spidey - align mRNA sequences to a genome

SYNOPSIS

       spidey [-] [-F N] [-G] [-L N] [-M filename] [-N filename] [-R filename]
       [-S p/m] [-T N] [-X] [-a filename] [-c N]  [-d]  [-e X]  [-f X]  [-g X]
       -i filename  [-j]  [-k filename]  [-l N]  -m filename  [-n N]  [-o str]
       [-p N] [-r c/d/m/p/v] [-s] [-t filename] [-u] [-w]

DESCRIPTION

       spidey is a tool for aligning one or more mRNA  sequences  to  a  given
       genomic sequence.  spidey was written with two main goals in mind: find
       good alignments regardless of intron size; and avoid  getting  confused
       by  nearby  pseudogenes  and  paralogs.  Towards the first goal, spidey
       uses BLAST and Dot View (another local  alignment  tool)  to  find  its
       alignments; since these are both local alignment tools, spidey does not
       intrinsically favor shorter or longer introns and has no maximum intron
       size.    To   avoid   mistakenly  including  exons  from  paralogs  and
       pseudogenes, spidey first defines windows on the genomic  sequence  and
       then  performs  the  mRNA-to-genomic  alignment  separately within each
       window.  Because of the way the windows  are  constructed,  neighboring
       paralogs or pseudogenes should be in separate windows and should not be
       included in the final spliced alignment.

   Initial alignments and construction of genomic windows
       spidey takes as input a single genomic  sequence  and  a  set  of  mRNA
       accessions  or  FASTA  sequences.   All  processing  is  done  one mRNA
       sequence at a time.  The first step for each mRNA sequence is  a  high-
       stringency  BLAST against the genomic sequence.  The resulting hits are
       analyzed to find the genomic windows.

       The BLAST alignments are sorted by score and then assigned into windows
       by  a  recursive function which takes the first alignment and then goes
       down the alignment list to find all alignments that are consistent with
       the  first  (same strand of mRNA, both the mRNA and genomic coordinates
       are nonoverlapping and linearly consistent).  On subsequent passes, the
       remaining   alignments   are  examined  and  are  put  into  their  own
       nonoverlapping, consistent  windows,  until  no  alignments  are  left.
       Depending  on  how  many gene models are desired, the top n windows are
       chosen to go on to the next step and the others are deleted.

   Aligning in each window
       Once the genomic windows are constructed, the initial BLAST  alignments
       are  freed  and  another  BLAST search is performed, this time with the
       entire mRNA against the genomic region defined by the window, and at  a
       lower  stringency  than  the initial search.  spidey then uses a greedy
       algorithm to generate a  high-scoring,  nonoverlapping  subset  of  the
       alignments  from  the  second  BLAST  search.   This  consistent set is
       analyzed carefully to make  sure  that  the  entire  mRNA  sequence  is
       covered by the alignments.  When gaps are found between the alignments,
       the appropriate region of genomic  sequence  is  searched  against  the
       missing mRNA, first using a very low-stringency BLAST and, if the BLAST
       fails to find a hit, using DotView functions to locate  the  alignment.
       When  gaps  are  found  at  the  ends  of the alignments, the BLAST and
       DotView searches are actually allowed to extend past the boundaries  of
       the window.  If the 3’ end of the mRNA does not align completely, it is
       first examined for the presence of a poly(A) tail.  No attempt is  made
       to  align  the  portion  of  the  mRNA that seems to be a poly(A) tail;
       sometimes there is a poly(A)  tail  that  does  align  to  the  genomic
       sequence,  and these are noted because they indicate the possibility of
       a pseudogene.

       Now that the mRNA is completely covered by the set of  alignments,  the
       boundaries  of  the  alignments (there should be one alignment per exon
       now) are adjusted so that the alignments abut each other precisely  and
       so  that  they  are  adjacent  to good splice donor and acceptor sites.
       Most commonly, two adjacent exons’ alignments overlap by as much as  20
       or  30 base pairs on the mRNA sequence.  The true exon boundary may lie
       anywhere within this overlap, or (as we have seen empirically)  even  a
       few  base  pairs outside the overlap.  To position the exon boundaries,
       the overlap plus a few base pairs on each side is examined  for  splice
       donor  sites,  using  functions  that  have  different  splice matrices
       depending on the organism chosen.  The top few splice donor  sites  (by
       score)  are  then  evaluated  as  to  how much they affect the original
       alignment boundaries.  The site that affects the boundaries  the  least
       is  chosen,  and  is  evaluated as to the presence of an acceptor site.
       The alignments are truncated or extended  as  necessary  so  that  they
       terminate at the splice donor site and so that they do not overlap.

   Final result
       The  windows  are  examined  carefully  to get the percent identity per
       exon, the number of gaps per exon, the overall  percent  identity,  the
       percent  coverage  of the mRNA, presence of an aligning or non-aligning
       poly(A) tail, number of splice donor sites and the presence or  absence
       of splice donor and acceptor sites for each exon, and the occurrence of
       an mRNA that has a 5’ or 3’ end (or both) that does not  align  to  the
       genomic  sequence.   If the overall percent identity and percent length
       coverage are above  the  user-defined  cutoffs,  a  summary  report  is
       printed,  and,  if  requested,  a text alignment showing identities and
       mismatches is also printed.

   Interspecies alignments
       spidey is capable of performing  interspecies  alignments.   The  major
       difference in interspecies alignments is that the mRNA-genomic identity
       will not be close to 100% as it is in  intraspecies  alignments;  also,
       the  alignments  have  numerous and lengthy gaps.  If spidey is used in
       its normal mode to do interspecies alignments, it produces gene  models
       with many, many short exons.  When the interspecies flag is set, spidey
       uses different BLAST parameters to encourage longer and more  gaps  and
       to  not  penalize  as heavily for mismatches.  This way, the alignments
       for the exons are much longer and more closely approximate  the  actual
       gene structure.

   Extracting CDS alignments
       When  spidey  is run in network-aware mode or when ASN.1 files are used
       for the mRNA records, it is capable of extracting a CDS alignment  from
       an mRNA alignment and printing the CDS information also.  Since the CDS
       alignment is just a subset of the  mRNA  alignment,  it  is  relatively
       straightforward  to  truncate  the  exon alignments as necessary and to
       generate a CDS alignment.  Furthermore, the  untranslated  regions  are
       now  defined,  so  the  percent identity for the 5’ and 3’ untranslated
       regions is also calculated.

OPTIONS

       A summary of options is included below.

       -      Print usage message.

       -F N   Start of genomic interval desired (from; 0-based).

       -G     Input file is a GI list.

       -L N   The extra-large intron size to use (default = 220000).

       -M filename
              File with donor splice matrix.

       -N filename
              File with acceptor splice matrix.

       -R filename
              File (including path) to repeat blast database for filtering.

       -S p/m Restrict to plus (p) or minus (m) strand of genomic sequence.

       -T N   Stop of genomic interval desired (to; 0-based).

       -X     Use extra-large intron sizes (increases the  limit  for  initial
              and terminal introns from 100kb to 240kb and for all others from
              35kb to 120kb);  may  result  in  significantly  longer  compute
              times.

       -a filename
              Output file for alignments when directed to a separate file with
              -p 3 (default = spidey.aln).

       -c N   Identity cutoff, in percent, for quality control purposes.

       -d     Also try to align coding sequences corresponding  to  the  given
              mRNA records (may require network access).

       -e X   First-pass  e-value (default = 1.0e-10).  Higher values increase
              speed at the cost of sensitivity.

       -f X   Second-pass e-value (default = 0.001).

       -g X   Third-pass e-value (default = 10).

       -i filename
              Input file containing the genomic sequence  in  ASN.1  or  FASTA
              format.   If  your  computer  is  running  on a network that can
              access GenBank, you can substitute the desired accession  number
              for the filename.

       -j     Print ASN.1 alignment?

       -k filename
              File for ASN.1 output with -k (default = spidey.asn).

       -l N   Length coverage cutoff, in percent.

       -m filename
              Input  file  containing  the  mRNA sequence(s) in ASN.1 or FASTA
              format, or a list  of  their  accessions  (with  -G).   If  your
              computer  is  running  on a network that can access GenBank, you
              can substitute a single accession number for the filename.

       -n N   Number of gene models to return per input mRNA (default = 1).

       -o str Main output file (default = stdout; contents controlled by  -p).

       -p N   Print alignment?
              0      summary and alignments together (default)
              1      just the summary
              2      just the alignments
              3      summary and alignments in different files

       -r c/d/m/p/v
              Organism of genomic sequence, used to determine splice matrices.
              c      C. elegans
              d      Drosophila
              m      Dictyostelium discoideum
              p      plant
              v      vertebrate (default)

       -s     Tune for interspecies alignments.

       -t filename
              File with feature table, in 4 tab-delimited columns:
              seqid  (e.g., NM_04377.1)
              name   (only repetitive_region is currently supported)
              start  (0-based)
              stop   (0-based)

       -u     Make a multiple alignment of all input mRNAs (which must overlap
              on the genomic sequence).

       -w     Consider  lowercase  characters  in  input FASTA sequences to be
              masked.

AUTHOR

       Sarah Wheelan and others  at  the  National  Center  for  Biotechnology
       Information; Steffen Moeller contributed to this documentation.

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

AUTHOR

SEE ALSO