NAME
illumina2srf - Builds an SRF file from an Illumina/Solexa GA run
folder.
SYNOPSIS
illumina2srf [options] tile_seq_file ...
DESCRIPTION
illumina2srf converts the Illumina GA-pipeline run folder output into
an SRF file. It should be run from the Bustard<version><date>
directory. It has a wealth of options, listed below, although many
have defaults and may be ommitted if the run folder follows the
standard directory layout. The arguments, after the options, should be
the filenames of the sequence files, eg s_8_*_seq.txt. All other
filenames are derived from the _seq.txt filenames.
The main structure of an SRF file is as a container, much like zip or
tar. The contents however may be split into variable and common
components allowing for better compression. For illumina2srf that means
that we store trace data in ZTR format with common ZTR chunks (text
identifiers such as base-caller name and version, matrix files and
compression specifications) in an SRF Data Block Header and variable
components (sequence, quality and traces) in ZTR chunks held within an
SRF Data Block. Typically we have 10,000 Data Blocks per Data Block
Header.
The most major decision in producing the SRF file is what data to put
in it. By default the program writes the sequence and probability
values along with the "processed" trace intensities. In GAPipeline v1.0
and earlier these are in the _seq.txt, _prb.txt and _sig2.txt files
held within the main Bustard directory. In addition to these the -r
option requests storage of the "raw" trace intensities, comprising both
the pre-processed intensities and noise estimates from the Firecrest
_int.txt and _nse.txt files respectively. To store only raw
intensities, skipping processed data, specify the -r -P options.
Finally the -I option can be used to store data from IPAR format files.
Confidence values have been a source of large variation over the
pipeline releases. In GAPipeline 1.0 and earlier the _prb.txt files in
the Bustard directory contain four quality values per base encoded
using a log-odds system: 10*log(P/(1-P)). In addition to this there are
various calibrated formats in the GERALD directory with one Phred scale
value per base. See the -qf, -qr and -qc parameters.
There are a number of smaller ancillary data files that get stored too.
As there is no per-lane or per-run storage mechanism in these are added
for every SRF Data Block Header of which there may be several per tile.
However the overhead in duplicating this data is not significant given
the size of the individual SRF Data Blocks. The ancillary data files
also stored are .params files (for both Bustard and Firecrest),
matrices (specified using -mf and -mr) and phasing XML files (-pf and
-pr).
OPTIONS
Trace data-source options
-r, -R Specifies to store (-r) or not to store (-R - the default) "raw"
data. This is currently comprised of the contents of the
_int.txt and _nse.txt files in the Firecrest directory.
-p, -P Specifies to store (-p - the default) or not to store (-P) the
"processed" data. This is the contents of the _sig2.txt files in
the Bustard directory.
-u Deprecated. Older GAPipeline releases created _sig.txt files
holding semi-processed data with compensation for the dye
spectral overlap, but before phasing correction steps. The -u
argument indicates that the processed data should be taken from
these files instead of _sig2.txt.
-I Reads IPAR files instead of the raw trace data files. These are
a different format used by the incremental processing software
when the pipeline is run on the instrument control PC itself.
Quality value data-source options
-qf filename
Specifies the filename of the calibrated quality values for the
forward-read or both the forward and reverse read combined if
appropriate. filename should be in Illumina’s fastq derivative
format, with quality values stored as ASCII 64 plus the log-odds
score.
-qr filename
If the calibrated fastq files are split into forward and reverse
files then filename specifies the reverse sequences. Otherwise
we assume they are tacked onto the end of the forward sequences
specified in -qf. Like the former file, this should be in
Illumina’s fastq-like format.
-qc directory
This is an alternative to the -qf and -qr options above and is
mutually exclusive with them. This specifies that the calibrated
data should come from files named "directory/s_%d_qcal.txt"
where "%d" is replaced by the current tile number.
Filtering options
-c value
Only store traces that have a "chastity" score >= Value. This
is mutually exclusive with the -C option.
-C value
Until the -c option, traces with a "chastity" score < Value are
still stored in the SRF file but are marked as bad reads
instead. srf2fasta and srf2fastq have options to subsequently
filter out bad reads using this flag. This is mutually
exclusive with the -c option.
-s N This skips the first N cycles of a trace (including signal,
sequence and quality values) when writing it to an SRF file. The
purpose of this is to remove primer bases, but it is not
recommended. Instead the SRF file should be using the ZTR region
chunk (REGN) to indicate which potion of a trace is valid.
Read naming
Read names are split into two halves, a prefix and a suffix. One common
prefix is stored in each and every SRF Data Block Header while the
suffix is stored in every Data Block. This combination allows for
removal of repetitive data in order to shrink the SRF file size.
-n format
Controls the format used for creating the sequence name suffix.
This uses a printf style system of percent expansions that will
be replaced with the appropriate data. The list of percent
expansions are:
%% A literal percent character
%d Run date (taken from parsing the current working
directory)
%m Machine name (taken from parsing the current working
directory)
%r Run number (taken from parsing the current working
directory)
%l lane number (%L for hexidecimal encoding)
%t tile number (%T for hexidecimal encoding)
%x X coordinate (%X for hexidecimal encoding)
%y Y coordinate (%Y for hexidecimal encoding)
%c Counter; increments by 1 for every sequence in the tile
(%C for hexidecimal encoding).
All the above format strings have an optional numerical value
between the percent and the format character. This is used to
control the field width. For example to print the X and Y
coordinates to 3 hexidecimal places we could use -n "%3X:%3Y".
The default format is "%x:%y".
-N format
Specifies the format string for encoding the reading name
prefix. It follows the same formatting rules specified in the -n
above.
The default format is "%m_%r:%l:%t:".
Ancillary data files
These options govern the extra files stored per tile (or strictly
speaking per SRF Data Block Header).
-2 cycle
This specifies the cycle number, counting from 1, of the second
read forming a read-pair. It is used for automatic generation of
filenames in several of the options below and also for
construction of the ZTR region (REGN) chunks.
-mf filename
The filename of the forward matrix file. If a single printf
numerical percent rule is used (such as "%d") then it will be
replaced by the lane number. When not specified the default
filename will be ../Matrix/s_%d_02_matrix.txt.
-mr filename
The filename of the reverse matrix file - only used on paired
end runs. If a single printf numerical percent rule is used
(such as "%d") then it will be replaced by the lane number. If
a second printf percent rule is used then it will be replaced
with the cycle number that the paired read starts on. This is
equivalent to the cycle number specified in the -2 option plus
one. (The plus one comes from using the second cycle per end for
matrix calibration.) When -mr is not specified the default
filename will be ../Matrix/s_%d_%02d_matrix.txt.
pf filename
Specifies the filename of the forward-read phasing XML file. As
with -mf a printf numerical percent rule will be replaced by the
lane number. The default filename format is
Phasing/s_%d_01_phasing.xml.
pr filename
Specifies the filename of the reverse-read phasing XML file. As
with -mr the first two printf numerical percent rules will be
replaced by the lane number and the cycle number. Unlike -mr
though the cycle number is the value used in the -c option as-is
instead of plus one. The default filename format is
Phasing/s_%d_%02d_phasing.xml.
Other options
-o srf_filename
Specifies the output filename to write the SRF data too.
Defaults to "traces.srf".
-i Indicates that an index should be appended to the SRF file. This
allows for random access based on the sequence name.
-d Enable dots-mode. This outputs a full-stop per input tile. Most
useful in conjunction with quiet mode. Default is off.
-q Quiet mode. Do not output commentary on which tile is being
processed and the metrics about it. Default off.
EXAMPLES
To store a lane 4 from a paired end run with raw traces, no processed
data and calibrated confidence values.
# From Bustard directory
illumina2srf -o all.srf -r -P \
-qf GERALD*/s_4_1_sequence.txt \
-qr GERALD*/s_4_2_sequence.txt \
s_4_*_seq.txt
To store and index only processed traces with chastity >= 0.6
illumina2srf -o s4.srf -c 0.6 s_4_*_seq.txt
CAVEATS
There are many mutually exclusive options, some of which may be for
processing file formats that no longer exist. This is due to the
history of the program and the rapidly changing nature of the files
being processed. Some future culling of options and file formats can be
expected.
Some assumptions are made as to the directory layout and the ability to
parse the run folder directory name. There are currently no ways to
override some of this information, including run date, run number and
GAPipeline program version numbers.
AUTHOR
James Bonfield, Wellcome Trust Sanger Institute
September 29 illumina2srf(1)