long-orfs — Find/Score potential genes in genome-file using the

NAME

       long-orfs  —  Find/Score  potential  genes  in  genome-file  using  the
       probability model in icm-file

SYNOPSIS

       tigr-long-orgs [genome-file options]

DESCRIPTION

       Program long-orfs takes a sequence file (in FASTA format) and outputs a
       list  of  all  long  "potential genes" in it that do not overlap by too
       much.  By "potential gene" I mean the portion of an orf from the  first
       start codon to the stop codon at the end.

       The  first  few  lines  of  output  specify  the  settings  of  various
       parameters in the program:

       Minimum gene length is the length of the smallest  fragment  considered
       to  be a gene.  The length is measured from the first base of the start
       codon to the last base *before* the stop  codon.   This  value  can  be
       specified  when  running the program with the  -g  option.  By default,
       the program now (April 2003) will compute an optimal  length  for  this
       parameter,  where  "optimal"  is  the  value that produces the greatest
       number of long ORFs, thereby increasing the amount  of  data  used  for
       training.

       Minimum  overlap length is a lower bound on the number of bases overlap
       between 2 genes that is considered a problem.   Overlaps  shorter  than
       this are ignored.

       Minimum  overlap  percent is another lower bound on the number of bases
       overlap that is considered  a  problem.   Overlaps  shorter  than  this
       percentage of *both* genes are ignored.

       The next portion of the output is a list of potential genes:

       Column  1  is  an  ID  number  for  reference purposes.  It is assigned
       sequentially starting  with   1   to  all  long  potential  genes.   If
       overlapping  genes are eliminated, gaps in the numbers will occur.  The
       ID prefix is specified in the constant  ID_PREFIX .

       Column 2 is the position of the first base of the first start codon  in
       the orf.  Currently I use atg, and gtg as start codons.  This is easily
       changed in the function  Is_Start () .

       Column 3 is the position of the last  base  *before*  the  stop  codon.
       Stop  codons  are taa, tag, and tga.  Note that for orfs in the reverse
       reading frames have their start position higher than the end  position.
       The  order  in  which  orfs  are  listed  is in increasing order by Max
       {OrfStart, End}, i.e., the highest numbered position in the orf, except
       for orfs that "wrap around" the end of the sequence.

       When  two genes with ID numbers overlap by at least a sufficient amount
       (as determined by Min_Olap and Min_Olap_Percent ), they are  eliminated
       and do not appear in the output.

       The  final output of the program (sent to the standard error file so it
       does not show up when output is redirected to a file) is the length  of
       the longest orf found.

       Specifying Different Start and Stop Codons:

       To  specify  different  sets  of start and stop codons, modify the file
       gene.h .  Specifically, the functions:

       Is_Forward_Start      Is_Reverse_Start       Is_Start   Is_Forward_Stop
       Is_Reverse_Stop      Is_Stop

       are used to determine what is used for start and stop codons.

       Is_Start   and   Is_Stop  do simple string comparisons to specify which
       patterns are used.  To add a new pattern, just add the  comparison  for
       it.   To remove a pattern, comment out or delete the comparison for it.

       The other four functions use a bit comparison to  determine  start  and
       stop patterns.  They represent a codon as a 12-bit pattern, with 4 bits
       for each base, one bit for each possible value of the bases, T, G, C or
       A.   Thus  the bit pattern  0010 0101 1100  represents the base pattern
       [C] [A or  G]  [G  or  T].   By  doing  bit  operations  (&  |  ~)  and
       comparisons, more complicated patterns involving ambiguous reads can be
       tested efficiently.  Simple patterns can be tested as  in  the  current
       code.

       For  example,  to  insert  an  additional start codon of CAT requires 3
       changes: 1. The line || (Codon & 0x218) ==  Codon  should  be  inserted
       into   Is_Forward_Start  , since 0x218 = 0010 0001 1000 represents CAT.
       2. The line || (Codon  &  0x184)  ==  Codon  should  be  inserted  into
       Is_Reverse_Start  ,  since 0x184 = 0001 1000 0100 represents ATG, which
       is the reverse-complement of CAT.  Alternately,  the  #define  constant
       ATG_MASK   could  be  used.   3. The line || strncmp (S, "cat", 3) == 0
       should be inserted into  Is_Start .

OPTIONS

       -g n      Set minimum gene length to  n.   Default  is  to  compute  an
                 optimal  value  automatically.   Don’t change this unless you
                 know what you’re doing.

       -l        Regard the genome as linear  (not  circular),  i.e.,  do  not
                 allow  genes  to  "wrap  around" the end of the genome.  This
                 option works on both  glimmer and long-orfs  .   The  default
                 behavior is to regard the genome as circular.

       -o n      Set  maximum overlap length to n.  Overlaps shorter than this
                 are permitted.  (Default is 0 bp.)

       -p n      Set maximum overlap percentage to n%.  Overlaps shorter  than
                 this  percentage  of *both* strings are ignored.  (Default is
                 10%.)

AUTHOR

       This manual page was quickly  copied  from  the  glimmer  web  site  by
       Steffen Moeller moeller@debian.org for the Debian system.

                                                                  LONG-ORFS(1)

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

SEE ALSO

AUTHOR