Man Linux: Main Page and Category List

NAME

       shuffle - randomize the sequences in a sequence file

SYNOPSIS

       shuffle [options] seqfile

DESCRIPTION

       shuffle  reads  a  sequence file seqfile, randomizes each sequence, and
       prints the randomized sequences in FASTA format on standard output. The
       sequence  names are unchanged; this allows you to track down the source
       of each randomized sequence if necessary.

       The default is  to  simply  shuffle  each  input  sequence,  preserving
       monosymbol   composition   exactly.  To  shuffle  each  sequence  while
       preserving both its monosymbol and disymbol  composition  exactly,  use
       the -d option.

       The  -0  and  -1  options allow you to generate sequences with the same
       Markov properties as each input  sequence.  With  -0,  for  each  input
       sequence,  0th  order  Markov  statistics  are  collected  (e.g. symbol
       composition),  and  a  new  sequence  is  generated   with   the   same
       composition.   With  -1,  the generated sequence has the same 1st order
       Markov properties as  the  input  sequence  (e.g.   the  same  disymbol
       frequencies).

       Note  that the default and -0, or -d and -1, are similar; the shuffling
       algorithms preserve composition exactly, while  the  Markov  algorithms
       only expect to generate a sequence of similar composition on average.

       Other  shuffling  algorithms are also available, as documented below in
       the options.

OPTIONS

       -0     Calculate 0th order Markov frequencies of  each  input  sequence
              (e.g.  residue  composition); generate output sequence using the
              same 0th order Markov frequencies.

       -1     Calculate 1st order Markov frequencies for each  input  sequence
              (e.g. diresidue composition); generate output sequence using the
              same 1st order Markov frequencies.  The  first  residue  of  the
              output  sequence  is always the same as the first residue of the
              input sequence.

       -d     Shuffle the input sequence while preserving both monosymbol  and
              disymbol  composition  exactly.  Uses  an algorithm published by
              S.F. Altschul and B.W. Erickson,  Mol.  Biol.  Evol.  2:526-538,
              1985.

       -h     Print  brief  help;  includes  version number and summary of all
              options, including expert options.

       -l     Look only at the length of  each  input  sequence;  generate  an
              i.i.d. output protein sequence of that length, using monoresidue
              frequencies typical of proteins (taken from Swissprot 35).

       -n <n> Make <n> different randomizations  of  each  input  sequence  in
              seqfile, rather than the default of one.

       -r     Generate  the  output  sequence by reversing the input sequence.
              (Therefore  only  one  "randomization"  per  input  sequence  is
              possible, so it’s not worth using -n if you use reversal.)

       -t <n> Truncate  each  input  sequence to a fixed length of exactly <n>
              residues. If the input  sequence  is  shorter  than  <n>  it  is
              discarded (therefore the output file may contain fewer sequences
              than the input file).  If the input sequence is longer than  <n>
              a contiguous subsequence is randomly chosen.

       -w <n> Regionally  shuffle  each input sequence in window sizes of <n>,
              preserving local residue composition in each window.  Probably a
              better  shuffling  algorithm for biosequences with nonstationary
              residue composition (e.g. composition that is varying along  the
              sequence,  such  as  between different isochores in human genome
              sequence).

       -B     (Babelfish). Autodetect and read a sequence  file  format  other
              than the default (FASTA). Almost any common sequence file format
              is recognized (including Genbank, EMBL, SWISS-PROT, PIR, and GCG
              unaligned  sequence formats, and Stockholm, GCG MSF, and Clustal
              alignment formats). See the printed documentation for a complete
              list of supported formats.

EXPERT OPTIONS

       --informat <s>
              Specify that the sequence file is in format <s>, rather than the
              default FASTA format.  Common examples  include  Genbank,  EMBL,
              GCG,  PIR,  Stockholm,  Clustal, MSF, or PHYLIP; see the printed
              documentation for a complete  list  of  accepted  format  names.
              This  option  overrides  the default expected format (FASTA) and
              the -B Babelfish autodetection option.

       --nodesc
              Do not output any sequence description in the output file,  only
              the sequence names.

       --seed <s>
              Set  the  random  number  seed to <s>.  If you want reproducible
              results, use the same seed each time.  By default, shuffle  uses
              a different seed each time, so does not generate the same output
              in subsequent runs with the same input.

SEE ALSO

       afetch(1),   alistat(1),   compalign(1),   compstruct(1),   revcomp(1),
       seqsplit(1),    seqstat(1),    sfetch(1),    sindex(1),   sreformat(1),
       stranslate(1), weight(1).

AUTHOR

       Biosquid  and   its   documentation   are   Copyright   (C)   1992-2003
       HHMI/Washington  University School of Medicine Freely distributed under
       the GNU General Public License (GPL) See COPYING  in  the  source  code
       distribution for more details, or contact me.

       Sean Eddy
       HHMI/Department of Genetics
       Washington University School of Medicine
       4444 Forest Park Blvd., Box 8510
       St Louis, MO 63108 USA
       Phone: 1-314-362-7666
       FAX  : 1-314-362-2157
       Email: eddy@genetics.wustl.edu