sim4 - align an expressed DNA sequence with a genomic sequence

NAME

       sim4 - align an expressed DNA sequence with a genomic sequence

SYNOPSIS

       sim4 seqfile1 seqfile2 {[WXKCRDAPNB]=value}

DESCRIPTION

       sim4  is a similarity-based tool for aligning an expressed DNA sequence
       (EST, cDNA, mRNA) with a genomic sequence for the gene. It also detects
       end  matches when the two input sequences overlap at one end (i.e., the
       start of one sequence overlaps the end of the other). If seqfile2 is  a
       database  of  sequences,  the sequence in seqfile1 will be aligned with
       each of the sequences in seqfile2.

       sim4 employs a blast-based  technique  to  first  determine  the  basic
       matching  blocks representing the "exon cores". In this first stage, it
       detects all possible exact matches of W-mers (i.e., DNA words  of  size
       W)  between  the two sequences and extends them to maximal scoring gap-
       free segments. In the second stage, the exon cores  are  extended  into
       the   adjacent   as-yet-unmatched   fragments  using  greedy  alignment
       algorithms, and  heuristics  are  used  to  favor  configurations  that
       conform  to  the  splice-site  recognition  signals  (GT-AG, CT-AC). If
       necessary, the process is repeated with less  stringent  parameters  on
       the unmatched fragments.

       By  default,  sim4  searches  both  strands and reports the best match,
       measured by the number of matching nucleotides found in the  alignment.
       The  R  command  line  option can be used to restrict the search to one
       orientation (strand) only.

       Currently,  five  major  alignment  display  options   are   supported,
       controlled  by  the  A  option.  By  default (A=0), only the endpoints,
       overall similarity, and orientation of the  introns  are  reported.  An
       arrow  sign (‘->’ or ‘<-’) indicates the orientation of the intron (‘+’
       or ‘-’ strand), when the signals flanking the intron have three or more
       position  matches with either the GT-AG or the CT-AC splice recognition
       signals.  When  the  same  number  of  matches  is   found   for   both
       orientations,  the  intron is reported as ambiguous, and represented by
       ‘--’. The sign ‘==’ marks the absence from  the  alignment  of  a  cDNA
       fragment  starting  at  that  position.  Alternative formats (lav-block
       format, text, PipMaker-type ‘exons file’, or  certain  combinations  of
       these  options) can be requested by specifying a different value for A.

       If the P option is specified with a non-zero value,  sim4  will  remove
       any 3’-end poly-A tails that it detects in the alignment.

       Occasionally,  sim4  may  miss an internal exon when surrounded by very
       large introns, typically longer than 100 Kb. When  this  is  suspected,
       the  H  option can be used to reset the exons’ weight to compensate for
       the intron gap penalty.

       Ambiguity codes are by default  allowed  in  sequence  data,  but  sim4
       treats  them  non-differentially.  If desired, the B command option can
       restrict the set of acceptable characters to A,C,G,T,N and X only.

       sim4 compares the lengths of the input sequences to distinguish between
       the   cDNA  (‘short’)  and  the  genomic  (‘long’)  components  in  the
       comparison. When seqfile2 contains a collection of sequences, the first
       entry  in  the  file will be used to determine the type of this and all
       subsequent comparisons.

       In the description below, the term MSP denotes a Maximal Segment  Pair,
       that  is,  a  pair  of  highly  similar fragments in the two sequences,
       obtained during the blast-like procedure by extending a  W-mer  hit  by
       matches and perhaps a few mismatches.

OPTIONS

       The  algorithm  parameters  (included  in the first two sections below)
       have already been tuned and do not normally require adjustment  by  the
       user.

       Parameters internal to the blast-like procedure:

       W      Sets  the  word  size  for  blast hits in the first stage of the
              algorithm. The default value is 12, but it can be increased  for
              a more stringent search or decreased to find weaker matches.

       X      Controls  the  limits  for  terminating  word  extensions in the
              blast-like stage of the algorithm. The default value is 12.

       K      Sets the threshold for the MSP scores when determining the basic
              ‘exon  cores’, during the first stage of the algorithm. (If this
              option is not specified, the  threshold  is  computed  from  the
              lengths  of  the  sequences,  using  statistical  criteria.) For
              example, a good value for genomic sequences in the  range  of  a
              few  hundred  Kb  is  16.  To avoid spurious matches, however, a
              larger value may be needed for longer sequences.

       C      Sets the threshold for the MSP scores when aligning the  as-yet-
              unmatched  fragments,  during the second stage of the algorithm.
              By default, the smaller of the constant  12  and  a  statistics-
              based threshold is chosen.

       Additional algorithm parameters:

       D      Sets  the  bound  for the "diagonal" distance within consecutive
              MSPs in an exon. The default value is 10.

       Context parameters:

       R      Specifies the direction of the search.  If  R=0,  only  the  "+"
              (direct)  strand  is  searched.  If  R=1,  only the "-" (reverse
              complement) matches are sought. By default (R=2), sim4  searches
              both  strands and reports the best match, measured by the number
              of matching pairs in the alignment.

       A      Specifies the format of the output: exon endpoints  only  (A=0),
              exon  endpoints and boundaries of the coding region (CDS) in the
              genomic sequence, when  specified  for  the  input  mRNA  (A=5),
              alignment  text  (A=1),  alignment in lav-block format (A=2), or
              both exon endpoints and  alignment  text  (A=3  or  A=4).  If  a
              reverse  complement  match  is  found, A=0,1,2,3,5 will give its
              position in the "+" strand of the longer sequence  and  the  "-"
              strand  of  the  shorter sequence. A=4 will give its position in
              the "+" strand of the first  sequence  (seqfile1)  and  the  "-"
              strand  of  the  second sequence (seqfile2), regardless of which
              sequence is longer. The A=5  option  can  be  used  with  the  S
              command  line  option to specify the endpoints of the CDS in the
              mRNA, and produces output in the ‘exons file’ format required by
              PipMaker.

       P      Specifies  whether or not the program should report the fragment
              of the alignment containing  the  poly-A  tail  (if  found).  By
              default  (P=0)  the  alignment  is  displayed  as  computed, but
              specifying a non-zero value will  request  sim4  to  remove  the
              poly-A  tails. When this feature is enabled, all display options
              produce additional lav alignment headers.

       H      Resets the MSPs’ weight to compensate for  very  large  introns.
              The  default value is H=500, but some introns larger than 100 Kb
              may require higher values, typically between 1000 and 2500. This
              option  should  be  used cautiously, generally in cases where an
              unmatched internal portion of the cDNA  may  disguise  a  missed
              exon within a very large intron. It is not recommended for ESTs,
              where they may produce spurious exons.

       N      Requests an additional search for  small  marginal  exons  (N=1)
              guided  by  the splice-site recognition signals. This option can
              be used when a high accuracy  match  is  expected.  The  default
              value is N=0, specifying no additional search.

       B      Controls  the  set of characters allowed in the input sequences.
              By default (B=1), ambiguity  characters  (ABCDGHKMNRSTVWXY)  are
              allowed.  By specifying B=0, the set of acceptable characters is
              restricted to A,C,G,T,N and X only.

       S      Allows the user to specify the endpoints of the CDS in the input
              mRNA,  with  the syntax: S=n1..n2. This option is only available
              with the A=5 flag, which produces output in the format  required
              by  PipMaker. Alternatively, the CDS coordinates could appear in
              a construct CDS=n1..n2 in the FastA header of the mRNA sequence.
              When  the  second  file  is  an  mRNA database, the command line
              specification for the CDS will apply to the  first  sequence  in
              the file only.

EXAMPLES

       sim4 est genomic

       sim4 genomic estdb

       sim4 est genomic A=1 P=1

       sim4 est1 est2 R=1

       sim4 mRNA genomic A=5 S=123..1020

       sim4 mouse_cDNA human_genomic K=15 C=11 A=3 W=10

AUTHORS

       sim4 was written by Liliana Florea <florea@gwu.edu> and Scott Schwartz.

       This   manual   page   was   written   by   Nelson   A.   de   Oliveira
       <naoliv@gmail.com>,    based    on    the   online   documentation   at
       http://globin.cse.psu.edu/html/docs/si.html, for the  Debian  project
       (but may be used by others).

                        Wed, 03 Aug 2005 18:40:58 -0300