samtools - Utilities for the Sequence Alignment/Map (SAM) format

NAME

       samtools - Utilities for the Sequence Alignment/Map (SAM) format

SYNOPSIS

       samtools view -bt ref_list.txt -o aln.bam aln.sam.gz

       samtools sort aln.bam aln.sorted

       samtools index aln.sorted.bam

       samtools view aln.sorted.bam chr2:20,100,000-20,200,000

       samtools merge out.bam in1.bam in2.bam in3.bam

       samtools faidx ref.fasta

       samtools pileup -f ref.fasta aln.sorted.bam

       samtools tview aln.sorted.bam ref.fasta

DESCRIPTION

       Samtools  is  a  set of utilities that manipulate alignments in the BAM
       format. It imports from and exports to the SAM (Sequence Alignment/Map)
       format,  does  sorting,  merging  and  indexing, and allows to retrieve
       reads in any regions swiftly.

       Samtools is designed to work on a stream. It regards an input file  ‘-’
       as  the  standard  input (stdin) and an output file ‘-’ as the standard
       output (stdout). Several commands can thus be combined with Unix pipes.
       Samtools always output warning and error messages to the standard error
       output (stderr).

       Samtools is also able to open a BAM (not SAM) file on a remote  FTP  or
       HTTP  server  if  the  BAM file name starts with ‘ftp://’ or ‘http://’.
       Samtools checks the current working directory for the  index  file  and
       will  download  the  index upon absence. Samtools does not retrieve the
       entire alignment file unless it is asked to do so.

COMMANDS AND OPTIONS

       import    samtools import <in.ref_list> <in.sam> <out.bam>

                 Since 0.1.4, this command is an alias of:

                 samtools view -bt <in.ref_list> -o <out.bam> <in.sam>

       sort      samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>

                 Sort    alignments    by    leftmost    coordinates.     File
                 <out.prefix>.bam  will  be  created.  This  command  may also
                 create temporary files  <out.prefix>.%d.bam  when  the  whole
                 alignment  cannot be fitted into memory (controlled by option
                 -m).

                 OPTIONS:

                 -n      Sort  by  read  names  rather  than  by   chromosomal
                         coordinates

                 -m INT  Approximately    the    maximum    required   memory.
                         [500000000]

       merge     samtools  merge  [-h  inh.sam]   [-n]   <out.bam>   <in1.bam>
                 <in2.bam> [...]

                 Merge multiple sorted alignments.  The header reference lists
                 of all the input BAM files, and the @SQ headers  of  inh.sam,
                 if  any,  must  all  refer  to  the  same  set  of  reference
                 sequences.  The header reference list and (unless  overridden
                 by  -h) ‘@’ headers of in1.bam will be copied to out.bam, and
                 the headers of other files will be ignored.

                 OPTIONS:

                 -h FILE Use the lines of FILE as ‘@’ headers to be copied  to
                         out.bam,   replacing  any  header  lines  that  would
                         otherwise be copied from in1.bam.  (FILE is  actually
                         in  SAM  format,  though any alignment records it may
                         contain are ignored.)

                 -n      The input alignments are sorted by read names  rather
                         than by chromosomal coordinates

       index     samtools index <aln.bam>

                 Index  sorted  alignment  for  fast random access. Index file
                 <aln.bam>.bai will be created.

       view      samtools  view  [-bhuHS]  [-t  in.refList]  [-o  output]  [-f
                 reqFlag]   [-F   skipFlag]  [-q  minMapQ]  [-l  library]  [-r
                 readGroup] <in.bam>|<in.sam> [region1 [...]]

                 Extract/print all or sub alignments in SAM or BAM format.  If
                 no  region  is specified, all the alignments will be printed;
                 otherwise only alignments overlapping the  specified  regions
                 will  be  output. An alignment may be given multiple times if
                 it is overlapping several regions. A region can be presented,
                 for  example,  in  the  following  format:  ‘chr2’ (the whole
                 chr2), ‘chr2:1000000’ (region starting from  1,000,000bp)  or
                 ‘chr2:1,000,000-2,000,000’   (region  between  1,000,000  and
                 2,000,000bp including the  end  points).  The  coordinate  is
                 1-based.

                 OPTIONS:

                 -b      Output in the BAM format.

                 -u      Output uncompressed BAM. This option saves time spent
                         on compression/decomprssion  and  is  thus  preferred
                         when the output is piped to another samtools command.

                 -h      Include the header in the output.

                 -H      Output the header only.

                 -S      Input is in SAM. If @SQ header lines are absent,  the
                         ‘-t’ option is required.

                 -t FILE This  file  is  TAB-delimited. Each line must contain
                         the reference name and the length of  the  reference,
                         one  line  for  each  distinct  reference; additional
                         fields are ignored. This file also defines the  order
                         of  the  reference  sequences  in sorting. If you run
                         ‘samtools faidx <ref.fa>’, the resultant  index  file
                         <ref.fa>.fai  can be used as this <in.ref_list> file.

                 -o FILE Output file [stdout]

                 -f INT  Only output alignments with all bits in  INT  present
                         in the FLAG field. INT can be in hex in the format of
                         /^0x[0-9A-F]+/ [0]

                 -F INT  Skip alignments with bits present in INT [0]

                 -q INT  Skip alignments with MAPQ smaller than INT [0]

                 -l STR  Only output reads in library STR [null]

                 -r STR  Only output reads in read group STR [null]

       faidx     samtools faidx <ref.fasta> [region1 [...]]

                 Index reference sequence  in  the  FASTA  format  or  extract
                 subsequence  from indexed reference sequence. If no region is
                 specified,   faidx   will   index   the   file   and   create
                 <ref.fasta>.fai  on the disk. If regions are speficified, the
                 subsequences will be retrieved and printed to stdout  in  the
                 FASTA  format.  The  input file can be compressed in the RAZF
                 format.

       pileup    samtools  pileup  [-f  in.ref.fasta]  [-t  in.ref_list]   [-l
                 in.site_list]    [-iscgS2]   [-T   theta]   [-N   nHap]   [-r
                 pairDiffRate] <in.bam>|<in.sam>

                 Print the alignment in  the  pileup  format.  In  the  pileup
                 format,  each  line represents a genomic position, consisting
                 of chromosome name, coordinate, reference base,  read  bases,
                 read  qualities  and alignment mapping qualities. Information
                 on match, mismatch, indel, strand, mapping quality and  start
                 and end of a read are all encoded at the read base column. At
                 this column, a dot stands for a match to the  reference  base
                 on  the  forward  strand,  a comma for a match on the reverse
                 strand, ‘ACGTN’ for a mismatch  on  the  forward  strand  and
                 ‘acgtn’  for  a  mismatch  on  the  reverse strand. A pattern
                 ‘\+[0-9]+[ACGTNacgtn]+’  indicates  there  is  an   insertion
                 between  this  reference  position  and  the  next  reference
                 position. The length of the insertion is given by the integer
                 in the pattern, followed by the inserted sequence. Similarly,
                 a pattern ‘-[0-9]+[ACGTNacgtn]+’ represents a  deletion  from
                 the  reference. The deleted bases will be presented as ‘*’ in
                 the following lines. Also at the read base column,  a  symbol
                 ‘^’  marks  the start of a read segment which is a contiguous
                 subsequence  on  the  read   separated   by   ‘N/S/H’   CIGAR
                 operations. The ASCII of the character following ‘^’ minus 33
                 gives the mapping quality. A symbol ‘$’ marks the  end  of  a
                 read segment.

                 If  option  -c  is  applied, the consensus base, Phred-scaled
                 consensus  quality,  SNP  quality  (i.e.   the   Phred-scaled
                 probability   of   the   consensus  being  identical  to  the
                 reference) and root mean square (RMS) mapping quality of  the
                 reads   covering  the  site  will  be  inserted  between  the
                 ‘reference base’ and  the  ‘read  bases’  columns.  An  indel
                 occupies  an  additional  line.  Each  indel line consists of
                 chromosome name, coordinate, a star, the genotype,  consensus
                 quality,  SNP quality, RMS mapping quality, # covering reads,
                 the first alllele, the second allele, # reads supporting  the
                 first  allele,  #  reads  supporting  the second allele and #
                 reads containing indels different from the top two alleles.

                 OPTIONS:

                 -s        Print the mapping quality as the last column.  This
                           option  makes  the output easier to parse, although
                           this format is not space efficient.

                 -S        The input file is in SAM.

                 -i        Only output pileup lines containing indels.

                 -f FILE   The reference sequence in the FASTA  format.  Index
                           file FILE.fai will be created if absent.

                 -M INT    Cap mapping quality at INT [60]

                 -t FILE   List  of  reference  names ane sequence lengths, in
                           the format described for  the  import  command.  If
                           this  option is present, samtools assumes the input
                           <in.alignment>  is  in  SAM  format;  otherwise  it
                           assumes in BAM format.

                 -l FILE   List  of sites at which pileup is output. This file
                           is space  delimited.  The  first  two  columns  are
                           required  to  be chromosome and 1-based coordinate.
                           Additional columns are ignored. It  is  recommended
                           to use option -s together with -l as in the default
                           format we may not know the mapping quality.

                 -c        Call the consensus  sequence  using  MAQ  consensus
                           model. Options -T, -N, -I and -r are only effective
                           when -c or -g is in use.

                 -g        Generate genotype likelihood in  the  binary  GLFv3
                           format. This option suppresses -c, -i and -s.

                 -T FLOAT  The  theta parameter (error dependency coefficient)
                           in the maq consensus calling model [0.85]

                 -N INT    Number of haplotypes in the sample (>=2) [2]

                 -r FLOAT  Expected fraction of differences between a pair  of
                           haplotypes [0.001]

                 -I INT    Phred  probability  of an indel in sequencing/prep.
                           [40]

       tview     samtools tview <in.sorted.bam> [ref.fasta]

                 Text alignment viewer (based on the ncurses library). In  the
                 viewer,  press  ‘?’  for  help  and  press  ‘g’  to check the
                 alignment  start  from  a   region   in   the   format   like
                 ‘chr10:10,000,000’.

       fixmate   samtools fixmate <in.nameSrt.bam> <out.bam>

                 Fill in mate coordinates, ISIZE and mate related flags from a
                 name-sorted alignment.

       rmdup     samtools rmdup <input.srt.bam> <out.bam>

                 Remove potential PCR duplicates: if multiple read pairs  have
                 identical  external  coordinates,  only  retain the pair with
                 highest mapping quality.  This command  ONLY  works  with  FR
                 orientation and requires ISIZE is correctly set.

       rmdupse   samtools rmdupse <input.srt.bam> <out.bam>

                 Remove  potential  duplicates  for  single-ended  reads. This
                 command will treat all reads as single-ended even if they are
                 paired in fact.

       fillmd    samtools fillmd [-e] <aln.bam> <ref.fasta>

                 Generate  the  MD tag. If the MD tag is already present, this
                 command will give a  warning  if  the  MD  tag  generated  is
                 different from the existing tag.

                 OPTIONS:

                 -e      Convert  a  the  read base to = if it is identical to
                         the aligned reference base.  Indel  caller  does  not
                         support the = bases at the moment.

SAM FORMAT

       SAM  is  TAB-delimited.  Apart from the header lines, which are started
       with the ‘@’ symbol, each alignment line consists of:

       +----+-------+----------------------------------------------------------+
       |Col | Field |                       Description                        |
       +----+-------+----------------------------------------------------------+
       | 1  | QNAME | Query (pair) NAME                                        |
       | 2  | FLAG  | bitwise FLAG                                             |
       | 3  | RNAME | Reference sequence NAME                                  |
       | 4  | POS   | 1-based leftmost POSition/coordinate of clipped sequence |
       | 5  | MAPQ  | MAPping Quality (Phred-scaled)                           |
       | 6  | CIAGR | extended CIGAR string                                    |
       | 7  | MRNM  | Mate Reference sequence NaMe (‘=’ if same as RNAME)      |
       | 8  | MPOS  | 1-based Mate POSistion                                   |
       | 9  | ISIZE | Inferred insert SIZE                                     |
       |10  | SEQ   | query SEQuence on the same strand as the reference       |
       |11  | QUAL  | query QUALity (ASCII-33 gives the Phred base quality)    |
       |12  | OPT   | variable OPTional fields in the format TAG:VTYPE:VALUE   |
       +----+-------+----------------------------------------------------------+

       Each bit in the FLAG field is defined as:

             +-------+--------------------------------------------------+
             | Flag  |                   Description                    |
             +-------+--------------------------------------------------+
             |0x0001 | the read is paired in sequencing                 |
             |0x0002 | the read is mapped in a proper pair              |
             |0x0004 | the query sequence itself is unmapped            |
             |0x0008 | the mate is unmapped                             |
             |0x0010 | strand of the query (1 for reverse)              |
             |0x0020 | strand of the mate                               |
             |0x0040 | the read is the first read in a pair             |
             |0x0080 | the read is the second read in a pair            |
             |0x0100 | the alignment is not primary                     |
             |0x0200 | the read fails platform/vendor quality checks    |
             |0x0400 | the read is either a PCR or an optical duplicate |
             +-------+--------------------------------------------------+

LIMITATIONS

       o Unaligned  words  used  in  bam_import.c,  bam_endian.h,  bam.c   and
         bam_aux.c.

       o CIGAR operation P is not properly handled at the moment.

       o In  merging,  the input files are required to have the same number of
         reference sequences. The requirement can  be  relaxed.  In  addition,
         merging  does  not reconstruct the header dictionaries automatically.
         Endusers have to provide the correct  header.  Picard  is  better  at
         merging.

       o Samtools’ rmdup does not work for single-end data and does not remove
         duplicates across chromosomes. Picard is better.

AUTHOR

       Heng Li from the Sanger Institute wrote the C version of samtools.  Bob
       Handsaker from the Broad Institute implemented the BGZF library and Jue
       Ruan from Beijing Genomics Institute wrote the  RAZF  library.  Various
       people  in  the  1000Genomes  Project  contributed  to  the  SAM format
       specification.

NAME

SYNOPSIS

DESCRIPTION

COMMANDS AND OPTIONS

SAM FORMAT

LIMITATIONS

AUTHOR

SEE ALSO