illumina2srf - Builds an SRF file from an Illumina/Solexa GA run

NAME

       illumina2srf  -  Builds  an  SRF  file  from  an Illumina/Solexa GA run
       folder.

SYNOPSIS

       illumina2srf  [options] tile_seq_file ...

DESCRIPTION

       illumina2srf converts the Illumina GA-pipeline run folder  output  into
       an   SRF  file.  It  should  be  run  from  the  Bustard<version><date>
       directory.  It has a wealth of options,  listed  below,  although  many
       have  defaults   and  may  be  ommitted  if  the run folder follows the
       standard directory layout. The arguments, after the options, should  be
       the  filenames  of  the  sequence  files,  eg  s_8_*_seq.txt. All other
       filenames are derived from the _seq.txt filenames.

       The main structure of an SRF file is as a container, much like  zip  or
       tar.  The  contents  however  may  be  split  into  variable and common
       components allowing for better compression. For illumina2srf that means
       that  we  store  trace  data in ZTR format with common ZTR chunks (text
       identifiers such as base-caller name  and  version,  matrix  files  and
       compression  specifications)  in  an SRF Data Block Header and variable
       components (sequence, quality and traces) in ZTR chunks held within  an
       SRF  Data  Block.  Typically  we have 10,000 Data Blocks per Data Block
       Header.

       The most major decision in producing the SRF file is what data  to  put
       in  it.  By  default  the  program  writes the sequence and probability
       values along with the "processed" trace intensities. In GAPipeline v1.0
       and  earlier  these  are  in the _seq.txt, _prb.txt and _sig2.txt files
       held within the main Bustard directory. In addition  to  these  the  -r
       option requests storage of the "raw" trace intensities, comprising both
       the pre-processed intensities and noise estimates  from  the  Firecrest
       _int.txt   and   _nse.txt   files   respectively.  To  store  only  raw
       intensities, skipping  processed  data,  specify  the  -r  -P  options.
       Finally the -I option can be used to store data from IPAR format files.

       Confidence values have been  a  source  of  large  variation  over  the
       pipeline  releases. In GAPipeline 1.0 and earlier the _prb.txt files in
       the Bustard directory contain four  quality  values  per  base  encoded
       using a log-odds system: 10*log(P/(1-P)). In addition to this there are
       various calibrated formats in the GERALD directory with one Phred scale
       value per base. See the -qf, -qr and -qc parameters.

       There are a number of smaller ancillary data files that get stored too.
       As there is no per-lane or per-run storage mechanism in these are added
       for every SRF Data Block Header of which there may be several per tile.
       However the overhead in duplicating this data is not significant  given
       the  size  of  the individual SRF Data Blocks. The ancillary data files
       also stored  are  .params  files  (for  both  Bustard  and  Firecrest),
       matrices  (specified  using -mf and -mr) and phasing XML files (-pf and
       -pr).

OPTIONS

   Trace data-source options
       -r, -R Specifies to store (-r) or not to store (-R - the default) "raw"
              data.  This  is  currently  comprised  of  the  contents  of the
              _int.txt and _nse.txt files in the Firecrest directory.

       -p, -P Specifies to store (-p - the default) or not to store  (-P)  the
              "processed" data. This is the contents of the _sig2.txt files in
              the Bustard directory.

       -u     Deprecated. Older GAPipeline  releases  created  _sig.txt  files
              holding  semi-processed  data  with  compensation  for  the  dye
              spectral overlap, but before phasing correction  steps.  The  -u
              argument  indicates that the processed data should be taken from
              these files instead of _sig2.txt.

       -I     Reads IPAR files instead of the raw trace data files. These  are
              a  different  format used by the incremental processing software
              when the pipeline is run on the instrument control PC itself.

   Quality value data-source options
       -qf filename
              Specifies the filename of the calibrated quality values for  the
              forward-read  or  both  the forward and reverse read combined if
              appropriate. filename should be in Illumina’s  fastq  derivative
              format, with quality values stored as ASCII 64 plus the log-odds
              score.

       -qr filename
              If the calibrated fastq files are split into forward and reverse
              files  then  filename specifies the reverse sequences. Otherwise
              we assume they are tacked onto the end of the forward  sequences
              specified  in  -qf.  Like  the  former  file,  this should be in
              Illumina’s fastq-like format.

       -qc directory
              This is an alternative to the -qf and -qr options above  and  is
              mutually exclusive with them. This specifies that the calibrated
              data should  come  from  files  named  "directory/s_%d_qcal.txt"
              where "%d" is replaced by the current tile number.

   Filtering options
       -c value
              Only  store  traces that have a "chastity" score >= Value.  This
              is mutually exclusive with the -C option.

       -C value
              Until the -c option, traces with a "chastity" score < Value  are
              still  stored  in  the  SRF  file  but  are  marked as bad reads
              instead. srf2fasta and srf2fastq have  options  to  subsequently
              filter  out  bad  reads  using  this  flag.   This  is  mutually
              exclusive with the -c option.

       -s N   This skips the first N cycles  of  a  trace  (including  signal,
              sequence and quality values) when writing it to an SRF file. The
              purpose of this is  to  remove  primer  bases,  but  it  is  not
              recommended. Instead the SRF file should be using the ZTR region
              chunk (REGN) to indicate which potion of a trace is valid.

   Read naming
       Read names are split into two halves, a prefix and a suffix. One common
       prefix  is  stored  in  each  and every SRF Data Block Header while the
       suffix is stored in every  Data  Block.  This  combination  allows  for
       removal of repetitive data in order to shrink the SRF file size.

       -n format
              Controls  the format used for creating the sequence name suffix.
              This uses a printf style system of percent expansions that  will
              be  replaced  with  the  appropriate  data.  The list of percent
              expansions are:

              %%     A literal percent character

              %d     Run  date  (taken  from  parsing  the   current   working
                     directory)

              %m     Machine  name  (taken  from  parsing  the current working
                     directory)

              %r     Run  number  (taken  from  parsing  the  current  working
                     directory)

              %l     lane number (%L for hexidecimal encoding)

              %t     tile number (%T for hexidecimal encoding)

              %x     X coordinate (%X for hexidecimal encoding)

              %y     Y coordinate (%Y for hexidecimal encoding)

              %c     Counter;  increments  by 1 for every sequence in the tile
                     (%C for hexidecimal encoding).

              All the above format strings have an  optional  numerical  value
              between  the  percent  and the format character. This is used to
              control the field width. For  example  to  print  the  X  and  Y
              coordinates to 3 hexidecimal places we could use -n "%3X:%3Y".

              The default format is "%x:%y".

       -N format
              Specifies  the  format  string  for  encoding  the  reading name
              prefix. It follows the same formatting rules specified in the -n
              above.

              The default format is "%m_%r:%l:%t:".

   Ancillary data files
       These  options  govern  the  extra  files  stored per tile (or strictly
       speaking per SRF Data Block Header).

       -2 cycle
              This specifies the cycle number, counting from 1, of the  second
              read forming a read-pair. It is used for automatic generation of
              filenames  in  several  of  the  options  below  and  also   for
              construction of the ZTR region (REGN) chunks.

       -mf filename
              The  filename  of  the  forward  matrix file. If a single printf
              numerical percent rule is used (such as "%d") then  it  will  be
              replaced  by  the  lane  number.  When not specified the default
              filename will be ../Matrix/s_%d_02_matrix.txt.

       -mr filename
              The filename of the reverse matrix file - only  used  on  paired
              end  runs.  If  a  single  printf numerical percent rule is used
              (such as "%d") then it will be replaced by the lane number.   If
              a  second  printf  percent rule is used then it will be replaced
              with the cycle number that the paired read starts  on.  This  is
              equivalent  to  the cycle number specified in the -2 option plus
              one. (The plus one comes from using the second cycle per end for
              matrix  calibration.)   When  -mr  is  not specified the default
              filename will be ../Matrix/s_%d_%02d_matrix.txt.

       pf filename
              Specifies the filename of the forward-read phasing XML file.  As
              with -mf a printf numerical percent rule will be replaced by the
              lane    number.    The    default     filename     format     is
              Phasing/s_%d_01_phasing.xml.

       pr filename
              Specifies  the filename of the reverse-read phasing XML file. As
              with -mr the first two printf numerical percent  rules  will  be
              replaced  by  the  lane  number and the cycle number. Unlike -mr
              though the cycle number is the value used in the -c option as-is
              instead   of   plus   one.   The   default  filename  format  is
              Phasing/s_%d_%02d_phasing.xml.

   Other options
       -o srf_filename
              Specifies the  output  filename  to  write  the  SRF  data  too.
              Defaults to "traces.srf".

       -i     Indicates that an index should be appended to the SRF file. This
              allows for random access based on the sequence name.

       -d     Enable dots-mode. This outputs a full-stop per input tile.  Most
              useful in conjunction with quiet mode. Default is off.

       -q     Quiet  mode.  Do  not  output  commentary on which tile is being
              processed and the metrics about it. Default off.

EXAMPLES

       To store a lane 4 from a paired end run with raw traces,  no  processed
       data and calibrated confidence values.

           # From Bustard directory
           illumina2srf -o all.srf -r -P \
               -qf GERALD*/s_4_1_sequence.txt \
               -qr GERALD*/s_4_2_sequence.txt \
               s_4_*_seq.txt

       To store and index only processed traces with chastity >= 0.6

           illumina2srf -o s4.srf -c 0.6 s_4_*_seq.txt

CAVEATS

       There  are  many  mutually  exclusive options, some of which may be for
       processing file formats that no  longer  exist.  This  is  due  to  the
       history  of  the  program  and the rapidly changing nature of the files
       being processed. Some future culling of options and file formats can be
       expected.

       Some assumptions are made as to the directory layout and the ability to
       parse the run folder directory name. There are  currently  no  ways  to
       override  some  of this information, including run date, run number and
       GAPipeline program version numbers.

AUTHOR

       James Bonfield, Wellcome Trust Sanger Institute

                                 September 29                  illumina2srf(1)