sreformat - convert sequence file to different format

NAME

       sreformat - convert sequence file to different format

SYNOPSIS

       sreformat [options] format seqfile

DESCRIPTION

       sreformat  reads  the  sequence  file  seqfile in any supported format,
       reformats it into a new format specified by  format,  then  prints  the
       reformatted text.

       Supported  input formats include (but are not limited to) the unaligned
       formats FASTA, Genbank, EMBL, SWISS-PROT, PIR, and GCG, and the aligned
       formats Stockholm, Clustal, GCG MSF, and Phylip.

       Available  unaligned  output  file  format  codes  include fasta (FASTA
       format); embl (EMBL/SWISSPROT format); genbank  (Genbank  format);  gcg
       (GCG  single  sequence format); gcgdata (GCG flatfile database format);
       pir  (PIR/CODATA  flatfile  format);  raw  (raw  sequence,   no   other
       information).

       The  available  aligned  output  file  format  codes  include stockholm
       (PFAM/Stockholm format); msf (GCG MSF format); a2m  (an  aligned  FASTA
       format);  PHYLIP  (Felsenstein’s  PHYLIP  format); and clustal (Clustal
       V/W/X format); and selex (the old SELEX/HMMER/Pfam annotated  alignment
       format);

       All  thee  codes  are interpreted case-insensitively (e.g. MSF, Msf, or
       msf all work).

       Unaligned format  files  cannot  be  reformatted  to  aligned  formats.
       However, aligned formats can be reformatted to unaligned formats -- gap
       characters are simply stripped out.

       This program was originally named reformat, but that name clashes  with
       a GCG program of the same name.

OPTIONS

       -d     DNA; convert U’s to T’s, to make sure a nucleic acid sequence is
              shown as DNA not RNA. See -r.

       -h     Print brief help; includes version number  and  summary  of  all
              options, including expert options.

       -l     Lowercase; convert all sequence residues to lower case.  See -u.

       -n     For  DNA/RNA  sequences,  converts  any  character  that’s   not
              unambiguous  RNA/DNA (e.g. ACGTU/acgtu) to an N. Used to convert
              IUPAC ambiguity codes to N’s, for software that can’t handle all
              IUPAC codes (some public RNA folding codes, for example). If the
              file is an alignment, gap characters are also left unchanged. If
              sequences  are  not  nucleic  acid  sequences,  this option will
              corrupt the data in a predictable fashion.

       -r     RNA; convert T’s to U’s, to make sure a nucleic acid sequence is
              shown as RNA not DNA. See -d.

       -u     Uppercase; convert all sequence residues to upper case.  See -l.

       -x     For DNA sequences, convert non-IUPAC characters (such as X’s) to
              N’s.  This is for compatibility with benighted people who insist
              on using X instead of the IUPAC ambiguity character N. (X is for
              ambiguity in an amino acid residue).

              Warning: like the -n option, the code doesn’t check that you are
              actually giving it DNA. It simply literally just  converts  non-
              IUPAC  DNA  symbols to N. So if you accidentally give it protein
              sequence, it will happily convert most every amino acid  residue
              to an N.

EXPERT OPTIONS

       --gapsym <c>
              Convert  all  gap  characters to <c>.  Used to prepare alignment
              files for programs with strict  requirements  for  gap  symbols.
              Only makes sense if the input seqfile is an alignment.

       --informat <s>
              Specify  that  the  sequence  file is in format <s>, rather than
              allowing the program  to  autodetect  the  file  format.  Common
              examples  include  Genbank,  EMBL, GCG, PIR, Stockholm, Clustal,
              MSF, or PHYLIP; see the printed  documentation  for  a  complete
              list of accepted format names.

       --mingap
              If seqfile is an alignment, remove any columns that contain 100%
              gap characters, minimizing the overall length of the  alignment.
              (Often  useful if you’ve extracted a subset of aligned sequences
              from a larger alignment.)

       --nogap
              Remove any aligned columns that contain any gap symbols at  all.
              Useful  as  a  prelude  to phylogenetic analyses, where you only
              want to analyze columns containing 100% residues, so you want to
              strip  out  any  columns with gaps in them.  Only makes sense if
              the file is an alignment file.

       --pfam For SELEX alignment output format only, put the entire alignment
              in  one  block (don’t wrap into multiple blocks).  This is close
              to  the  format  used  internally  by  Pfam  in  Stockholm   and
              Cambridge.

       --sam  Try  to convert gap characters to UC Santa Cruz SAM style, where
              a .  means a gap in an insert column, and a - means  a  deletion
              in  a  consensus/match  column.  This  only works for converting
              aligned file formats, and only if the alignment already  adheres
              to   the   SAM   convention   of  upper  case  for  residues  in
              consensus/match columns, and lower case for residues  in  insert
              columns.  This is true, for instance, of all alignments produced
              by old versions  of  HMMER.  (HMMER2  produces  alignments  that
              adhere to SAM’s conventions even in gap character choice.)  This
              option was added to allow Pfam alignments to be reformatted into
              something  more  suitable for profile HMM construction using the
              UCSC SAM software.

       --samfrac <x>
              Try to convert the alignment gap characters and residue cases to
              UC  Santa  Cruz  SAM  style, where a .  means a gap in an insert
              column and a - means a deletion in a consensus/match column, and
              upper  case  means match/consensus residues and lower case means
              inserted resiudes. This will only work  for  converting  aligned
              file  formats,  but  unlike  the  --sam  option,  it  will  work
              regardless of whether the file adheres to the  upper/lower  case
              residue  convention.  Instead, any column containing more than a
              fraction <x> of gap  characters  is  interpreted  as  an  insert
              column,  and all other columns are interpreted as match columns.
              This option was added to allow Pfam alignments to be reformatted
              into  something more suitable for profile HMM construction using
              the UCSC SAM software.

       --wussify
              Convert  RNA  secondary  structure  annotation   strings   (both
              consensus  and individual) from old "KHS" format, ><, to the new
              WUSS notation, <>. If the notation is already  in  WUSS  format,
              this  option  will  screw it up, without warning. Only SELEX and
              Stockholm  format  files  have  secondary  structure  markup  at
              present.

       --dewuss
              Convert  RNA secondary structure annotation strings from the new
              WUSS notation, <>, back to  the  old  KHS  format,  ><.  If  the
              annotation  is  already  in  KHS,  this  option will corrupt it,
              without warning.  Only SELEX and  Stockholm  format  files  have
              secondary structure markup.

AUTHOR

       Biosquid   and   its   documentation   are   Copyright   (C)  1992-2003
       HHMI/Washington University School of Medicine Freely distributed  under
       the  GNU  General  Public  License (GPL) See COPYING in the source code
       distribution for more details, or contact me.

       Sean Eddy
       HHMI/Department of Genetics
       Washington University School of Medicine
       4444 Forest Park Blvd., Box 8510
       St Louis, MO 63108 USA
       Phone: 1-314-362-7666
       FAX  : 1-314-362-2157
       Email: eddy@genetics.wustl.edu

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

EXPERT OPTIONS

SEE ALSO

AUTHOR