NAME
fa2htgs - formatter for high throughput genome sequencing project
submissions
SYNOPSIS
fa2htgs [-] [-6 str] [-7 str] [-A filename] [-C str] [-D] [-L filename]
[-M str] [-N] [-O filename] [-P str] [-Q filename] [-S str]
[-T filename] [-X] [-a str] [-b N] [-c str] [-d str] [-e filename] [-f]
-g str [-h str] [-i filename] [-k str] [-l N] [-m] [-n str]
[-o filename] [-p N] [-q] [-r str] -s str [-t filename] [-u] [-v] [-w]
[-x str]
DESCRIPTION
fa2htgs is a program used to generate Seq-submits (an ASN.1 sequence
submission file) for high throughput genome sequencing projects.
fa2htgs will read a FASTA file (or an Ace Contig file with Phrap
sequence quality values), a Sequin submission template file, (to get
contact and citation information for the submission), and a series of
command line arguments (see below). This program will then combines
these information to make a submission suitable for GenBank. Once you
have generated your submission file, you need to follow the submission
protocol (see the README present on your FTP account or mailed out to
your Center).
fa2htgs is intended for the automation by scripts for bulk submission
of unannotated genome sequence. It can easily be extended from its
current simple form to allow more complicated processing. A submission
prepared with fa2htgs can also be read into Psequin(1), and then
annotated more extensively.
Questions and concerns about this processing protocol, or how to use
this tool should be forwarded to <htgs@ncbi.nlm.nih.gov>.
OPTIONS
A summary of options is included below.
- Print usage message
-6 str SP6 clone (e.g., Contig1,left)
-7 str T7 clone (e.g., Contig2,right)
-A filename
Filename for accession list input (mutually exclusive with -T
and -i). The input file contains a tab-delimited table with
three to five columns, which are accession number, start
position, stop position, and (optionally) length and strand. If
start > stop, the minus strand on the referenced accession is
used. A gap is indicated by the word "gap" instead of an
accession, 0 for the start and stop positions, and a number for
the length.
-C str Clone library name (will appear as /clone-lib="str" on the
source feature)
-D HTGS_DRAFT sequence
-L filename
Read phrap contig order from filename. This is a tab-delimited
file that can be used to drive the order of contigs (normally
specified by -P), as well as indicating the SP6 and T7 ends. It
can also be used when contigs are known to be in opposite
orientation. For example:
Contig2 + 1 SP6 left
Contig3 + 1
Contig1 - T7 right
The first column is the contig name, the second is the
orientation, the third is the fragment_group, the fourth
indicates the SP6 or T7 end, and the fifth says which side of
SP6 or T7 end had vector removed.
-M str Map name (will appear as /map="str" on the source feature)
-N Annotate assembly_fragments
-O filename
Read comment from filename (100-character-per-line maximum; ~ is
a linebreak and ‘~ is a literal ~. You can check the format
with PSequin(1).)
-P str Contigs to use, separated by commas. If -P is not indicated
with the -T option, then the fragments will go in in the order
that they are in the ace file (which is appropriate for a phase
1 record, but not for a phase 2 or 3). If you need to set the
order of the segments of the ace file, you need to set it with
the -P flag, like this: -P
"Contig1,Contig4,Contig3,Contig2,Contig5"
-Q filename
Read quality scores from filename
-S str Strain name
-T filename
Filename for phrap input (mutually exclusive with -A and -i)
-X The coordinates in the input file are on the resulting segmented
sequence. (Bases 1 through n of each accession are used.)
Otherwise, the coordinates are on the individual accessions,
which need not start at base 1 of the record.
-a str GenBank accession; use if and only if updating a sequence.
-b N Gap length (default = 100; anything from 0 to 1000000000 is
legal)
-c str Clone name (will appear as /clone in the source feature; can be
the same as -s)
-d str Title for sequence (will appear in GenBank DEFINITION line)
-e filename
Log errors to filename
-f htgs_fulltop keyword
-g str Genome Center tag (probably the same as your login name on the
NCBI FTP server)
-h str Chromosome (will appear as /chromosome in the source feature)
-i filename
Filename for fasta input (default is stdin; mutually exclusive
with -A and -T)
-k str Add the supplied string as a keyword.
-l N Length of sequence in bp (default = 0). The length is checked
against the actual number of bases we get. For phase 1 and 2
sequence it is also used to estimate gap lengths. For phase 1
and 2 records, it is important to use a number GREATER than the
amount of provided nucleotide, otherwise this will generate
false ‘gaps’. Here is assumed that the putative full length of
the BAC or cosmid will be used. There should be at least 20 to
30 ‘n’ in between the segments (you can check for these in
Sequin), as this will ensure proper behavior when this sequence
is used with BLAST. Otherwise ‘artifactual’ unrelated segment
neighbors may be brought into proximity of each other.
-m Take comment from template
-n str Organism name (default = Homo sapiens)
-o filename
Filename for asn.1 output (default = stdout)
-p N HTGS phase:
1 A collection of unordered contigs with gaps of unknown
length. A Phase 1 record must at the very least have two
segments with one gap. (default)
2 A series of ordered contigs, possibly with known gap
lengths. This could be a single sequence without gaps,
if the sequence has ambiguities to resolve.
3 A single contiguous sequence. This sequence is finished,
but not necessarily annotated.
-q htgs_cancelled keyword
-r str Remark for update (brief comment describing the nature of the
update, such as "new sequence", "new citation", or "updated
features")
-s str Sequence name. The sequence must have a name that is unique
within the genome center. We use the combination of the genome
center name (-g argument) and the sequence name (-s) to track
this sequence and to talk to you about it. The name can have
any form you like but must be unique within your center.
-t filename
Filename for Seq-submit template (default = template.sub)
-u Take biosource from template
-v htgs_activefin keyword
-w Whole Genome Shotgun flag
-x str Secondary accession numbers, separated by commas, s.t.
U10000,L11000.
In some cases a large segment will supersede another or group of
other accession numbers (records). These records which are no
longer wanted in GenBank should be made secondary. Using the -x
argument you can list the Accession Numbers you want to make
secondary. This will instruct us to remove the accession
number(s) from GenBank, and will no longer be part of the
GenBank release. They will nonetheless be available from Entrez.
GREAT CARE should be taken when using this argument!!! Improper
use of accession numbers here will result in the inappropriate
withdrawal of GenBank records from GenBank, EMBL and DDBJ. We
provide this parameter as a convenience to submitting centers,
but this may need to be removed if it is not used carefully.
AUTHOR
The National Center for Biotechnology Information.
SEE ALSO
Psequin(1), /usr/share/doc/ncbi-tools-bin/README.fa2htgs.gz