Man Linux: Main Page and Category List


       fa2htgs  -  formatter  for  high  throughput  genome sequencing project


       fa2htgs [-] [-6 str] [-7 str] [-A filename] [-C str] [-D] [-L filename]
       [-M str]    [-N]    [-O filename]   [-P str]   [-Q filename]   [-S str]
       [-T filename] [-X] [-a str] [-b N] [-c str] [-d str] [-e filename] [-f]
       -g str   [-h str]   [-i filename]   [-k str]   [-l N]   [-m]   [-n str]
       [-o filename] [-p N] [-q] [-r str] -s str [-t filename] [-u] [-v]  [-w]
       [-x str]


       fa2htgs  is  a  program used to generate Seq-submits (an ASN.1 sequence
       submission file) for high throughput genome sequencing projects.

       fa2htgs will read a FASTA file  (or  an  Ace  Contig  file  with  Phrap
       sequence  quality  values),  a Sequin submission template file, (to get
       contact and citation information for the submission), and a  series  of
       command  line  arguments  (see below).  This program will then combines
       these information to make a submission suitable for GenBank.  Once  you
       have  generated your submission file, you need to follow the submission
       protocol (see the README present on your FTP account or mailed  out  to
       your Center).

       fa2htgs  is  intended for the automation by scripts for bulk submission
       of unannotated genome sequence. It can  easily  be  extended  from  its
       current simple form to allow more complicated processing.  A submission
       prepared with fa2htgs can  also  be  read  into  Psequin(1),  and  then
       annotated more extensively.

       Questions  and  concerns  about this processing protocol, or how to use
       this tool should be forwarded to <>.


       A summary of options is included below.

       -      Print usage message

       -6 str SP6 clone (e.g., Contig1,left)

       -7 str T7 clone (e.g., Contig2,right)

       -A filename
              Filename for accession list input (mutually  exclusive  with  -T
              and  -i).   The  input  file contains a tab-delimited table with
              three  to  five  columns,  which  are  accession  number,  start
              position, stop position, and (optionally) length and strand.  If
              start > stop, the minus strand on the  referenced  accession  is
              used.   A  gap  is  indicated  by  the  word "gap" instead of an
              accession, 0 for the start and stop positions, and a number  for
              the length.

       -C str Clone  library  name  (will  appear  as  /clone-lib="str" on the
              source feature)

       -D     HTGS_DRAFT sequence

       -L filename
              Read phrap contig order from filename.  This is a  tab-delimited
              file  that  can  be used to drive the order of contigs (normally
              specified by -P), as well as indicating the SP6 and T7 ends.  It
              can  also  be  used  when  contigs  are  known to be in opposite
              orientation.  For example:

                  Contig2     +       1       SP6     left
                  Contig3     +       1
                  Contig1     -               T7      right

              The  first  column  is  the  contig  name,  the  second  is  the
              orientation,   the  third  is  the  fragment_group,  the  fourth
              indicates the SP6 or T7 end, and the fifth says  which  side  of
              SP6 or T7 end had vector removed.

       -M str Map name (will appear as /map="str" on the source feature)

       -N     Annotate assembly_fragments

       -O filename
              Read comment from filename (100-character-per-line maximum; ~ is
              a linebreak and ‘~ is a literal ~.  You  can  check  the  format
              with PSequin(1).)

       -P str Contigs  to  use,  separated  by commas.  If -P is not indicated
              with the -T option, then the fragments will go in in  the  order
              that  they are in the ace file (which is appropriate for a phase
              1 record, but not for a phase 2 or 3).  If you need to  set  the
              order  of  the segments of the ace file, you need to set it with
              the        -P        flag,         like         this:         -P

       -Q filename
              Read quality scores from filename

       -S str Strain name

       -T filename
              Filename for phrap input (mutually exclusive with -A and -i)

       -X     The coordinates in the input file are on the resulting segmented
              sequence.  (Bases 1 through  n  of  each  accession  are  used.)
              Otherwise,  the  coordinates  are  on the individual accessions,
              which need not start at base 1 of the record.

       -a str GenBank accession; use if and only if updating a sequence.

       -b N   Gap length (default = 100; anything  from  0  to  1000000000  is

       -c str Clone  name (will appear as /clone in the source feature; can be
              the same as -s)

       -d str Title for sequence (will appear in GenBank DEFINITION line)

       -e filename
              Log errors to filename

       -f     htgs_fulltop keyword

       -g str Genome Center tag (probably the same as your login name  on  the
              NCBI FTP server)

       -h str Chromosome (will appear as /chromosome in the source feature)

       -i filename
              Filename  for  fasta input (default is stdin; mutually exclusive
              with -A and -T)

       -k str Add the supplied string as a keyword.

       -l N   Length of sequence in bp (default = 0). The  length  is  checked
              against  the  actual  number  of bases we get. For phase 1 and 2
              sequence it is also used to estimate gap lengths.  For  phase  1
              and  2 records, it is important to use a number GREATER than the
              amount of provided  nucleotide,  otherwise  this  will  generate
              false  ‘gaps’.  Here is assumed that the putative full length of
              the BAC or cosmid will be used.  There should be at least 20  to
              30  ‘n’  in  between  the  segments  (you can check for these in
              Sequin), as this will ensure proper behavior when this  sequence
              is  used  with BLAST.  Otherwise ‘artifactual’ unrelated segment
              neighbors may be brought into proximity of each other.

       -m     Take comment from template

       -n str Organism name (default = Homo sapiens)

       -o filename
              Filename for asn.1 output (default = stdout)

       -p N   HTGS phase:
              1      A collection of unordered contigs with  gaps  of  unknown
                     length.  A Phase 1 record must at the very least have two
                     segments with one gap.  (default)
              2      A series of ordered  contigs,  possibly  with  known  gap
                     lengths.   This  could be a single sequence without gaps,
                     if the sequence has ambiguities to resolve.
              3      A single contiguous sequence.  This sequence is finished,
                     but not necessarily annotated.

       -q     htgs_cancelled keyword

       -r str Remark  for  update  (brief comment describing the nature of the
              update, such as "new  sequence",  "new  citation",  or  "updated

       -s str Sequence  name.   The  sequence  must have a name that is unique
              within the genome center. We use the combination of  the  genome
              center  name  (-g  argument) and the sequence name (-s) to track
              this sequence and to talk to you about it.  The  name  can  have
              any form you like but must be unique within your center.

       -t filename
              Filename for Seq-submit template (default = template.sub)

       -u     Take biosource from template

       -v     htgs_activefin keyword

       -w     Whole Genome Shotgun flag

       -x str Secondary   accession   numbers,   separated   by  commas,  s.t.

              In some cases a large segment will supersede another or group of
              other  accession  numbers (records).  These records which are no
              longer wanted in GenBank should be made secondary. Using the  -x
              argument  you  can  list  the Accession Numbers you want to make
              secondary.  This  will  instruct  us  to  remove  the  accession
              number(s)  from  GenBank,  and  will  no  longer  be part of the
              GenBank release. They will nonetheless be available from Entrez.

              GREAT CARE should be taken when using this argument!!!  Improper
              use of accession numbers here will result in  the  inappropriate
              withdrawal  of  GenBank records from GenBank, EMBL and DDBJ.  We
              provide this parameter as a convenience to  submitting  centers,
              but this may need to be removed if it is not used carefully.


       The National Center for Biotechnology Information.


       Psequin(1), /usr/share/doc/ncbi-tools-bin/README.fa2htgs.gz