theseus - Maximum likelihood, multiple simultaneous superpositions with

NAME

       theseus - Maximum likelihood, multiple simultaneous superpositions with
       statistical analysis

SYNOPSIS

       theseus  [-aAbBcCdDeEfFgGhHiIjklLmMnNoOpPqQrRsStTuvVwWxXyYZ]   pdbfile1
       [pdbfile2 ...]

       and

       theseus_align   [-aAbBcCdDeEfFgGhHiIjklLmMnNoOpPqQrRsStTuvVwWxXyYZ]  -f
       pdbfile1 [pdbfile2 ...]

       Default usage is equivalent to:

       theseus -a0 -e2 -g1 -i200 -k-1 -p1e-7 -r theseus -v -P0 your.pdb

DESCRIPTION

       Theseus   superpositions   a   set   of    macromolecular    structures
       simultaneously using the method of maximum likelihood (ML), rather than
       the conventional least-squares criterion.   Theseus  assumes  that  the
       structures  are distributed according to a matrix Gaussian distribution
       and  that  the  eigenvalues  of  the  atomic  covariance   matrix   are
       hierarchically  distributed according to an inverse gamma distribution.
       This ML superpositioning model produces much more accurate  results  by
       essentially  downweighting  variable  regions  of the structures and by
       correcting for correlations among atoms.

       Theseus operates in two main modes, a mode for superimposing structures
       with  identical  sequences  and  a  mode  for structures with different
       sequences but similar structures:

              (1) A mode for superpositioning  macromolecules  with  identical
              sequences and numbers of residues, for instance, multiple models
              in an NMR family or multiple structures from  different  crystal
              forms of the same protein. In this mode, Theseus will read every
              model in every file on the command line and superposition  them.

              Example:

              theseus 1s40.pdb

              In the above example, 1s40.pdb is a pdb file of 10 NMR models.

              (2)  An  "alignment"  mode  for superpositioning structures with
              different sequences, for example,  multiple  structures  of  the
              cytochrome  c protein from different species or multiple mutated
              structures of hen egg white lysozyme.  This  mode  requires  the
              user to supply a sequence alignment file of the structures being
              superpositioned  (see  option  -A  and  "FILE  FORMATS"  below).
              Additionally, it may be necessary to supply a mapfile that tells
              theseus which PDB structure files correspond to which  sequences
              in  the alignment (see option -M and "FILE FORMATS" below). When
              superpositioning based on a seqeunce alignment, theseus  uses  a
              novel maximum likelihood algorithm for superpositioning multiple
              structures that include arbitrary gaps and  insertions  relative
              to   each  other.   Unlike  other  algorithms  for  simultaneous
              superpositioning  of  multiple  structures,   our   Expectation-
              Maximization  algorithm uses all available data by including all
              residues aligned with gaps in the calculations.  In  this  mode,
              if  there  are multiple structural models in a PDB file, theseus
              only reads the first model in each file on the command line.  In
              other  words, theseus treats the files on the command line as if
              there were only one structure per file.

              Example 1:

              theseus -A  cytc.aln  -M  cytc.filemap  d1cih__.pdb  d1csu__.pdb
              d1kyow_.pdb

              In  the above example, d1cih__.pdb, d1csu__.pdb, and d1kyow_.pdb
              are pdb files of cytochrome c domains from the SCOP database.

              Example 2:

              theseus_align -f d1cih__.pdb d1csu__.pdb d1kyow_.pdb

              In this example, the theseus_align script is called  to  do  the
              hard  work  for you.  It will calculate a sequence alignment and
              then  superimpose  based  on   that   alignment.    The   script
              theseus_align  takes  the  same  options as the theseus program.
              Note, the first few lines of this script must  be  modified  for
              your  system,  since  it  calls  an  external  multiple sequence
              alignment program  to  do  the  alignment.   See  the  examples/
              directory for more details, including example files.

OPTIONS

   Algorithmic options, defaults in {brackets}:
       -a [selection]
              Atoms  to  include  in the superposition.  This option takes two
              types of arguments, either (1) a number specifying a preselected
              set  of atom types, or (2) an explict PDB-style, colon-delimited
              list of the atoms to include.

              For the preselected atom type  subsets,  the  following  integer
              options are available:

               · 0, alpha carbons for proteins, C1´ atoms for nucleic acids
               · 1, backbone
               · 2, all
               · 3, alpha and beta carbons
               · 4, all heavy atoms (no hydrogens)

              Note,  only  the  -a0  option is available when superpositioning
              structures with different sequences.

              To custom select an explicit set of atom types, the  atom  types
              must  be  specified  exactly  as  given  in  the PDB file field,
              including  spaces,  and  the  atom-types  must  encapsulated  in
              quotation  marks.   Multiple  atom  types must be delimited by a
              colon.  For example,

              -a’ N  : CA : C  : O  ’

              would specify the atom types in the peptide backbone.

       -c     Use ML  atomic  covariance  weighting  (fit  correlations,  much
              slower)

              Unless  you  have  many  different structures with few residues,
              fitting   the   correlation   matrix   is   likely   unwarranted
              statistically  due  to a plethora of parameters and a paucity of
              data.

       -e [n] Embedding algorithm for initializing the average structure
               · 0 = none; use randomly chosen model
               · {2} = {ML embedded structure}

       -f     Only read the first model of a multi-model PDB file

       -g [n] Hierarchical model for variances
               · 0 = none (may not converge)
               · {1} = inverse gamma distribution

       -h     Help/usage

       -i [nnn]
              Maximum iterations, {200}

       -k [n] constant minimum variance {-1} {if set to  negative  value,  the
              minimum variance is determined empirically}

       -p [precision]
              Requested relative precision for convergence, {1e-7}

       -r [root name]
              Root name to be used in naming the output files, {theseus}

       -s [n-n:...]
              Residue selection (e.g. -s15-45:50-55), {all}

       -S [n-n:...]
              Residues to exclude (e.g. -S15-45:50-55) {none}

              The  previous  two  options  have  the  same format. Residue (or
              alignment column) ranges are  indicated  by  beginning  and  end
              separated  by  a dash.  Multiple ranges, in any arbitrary order,
              are separated by a colon.  Chains may also be selected by giving
              the  chain  ID  immediately  preceding  the  residue range.  For
              example, -sA1-20:A40-71 will only include residues 1 through  20
              and  40  through  70 in chain A. Chains cannot be specified when
              superpositioning structures with different sequences.

       -v     use ML variance weighting (no correlations) {default}

   Input/output options:
       -A [sequence alignment file]
              Sequence alignment file to  use  as  a  guide  (CLUSTAL  or  A2M
              format)

              For   use   when   superpositioning  structures  with  different
              sequences.  See "FILE FORMATS" below.

       -E     Print expert options

       -F     Print FASTA files of the sequences in PDB files and quit

              A useful option when superpositioning structures with  different
              sequences.   The  files  output  with this option can be aligned
              with a multiple sequence alignment program such  as  CLUSTAL  or
              MUSCLE,  and the resulting output alignment file used as theseus
              input with the -A option.

       -h     Help/usage

       -I     Just calculate statistics for input file; don’t superposition

       -M [mapfile]
              File that maps sequences in the alignment file to PDB files

              A simple two-column formatted file; see  "FILE  FORMATS"  below.
              Used with mode 2.

       -n     Don’t write transformed pdb file

       -o [reference structure]
              Reference  file  to superposition on, all rotations are relative
              to the first model in this file

              For  example,  ’theseus   -o   cytc1.pdb   cytc1.pdb   cytc2.pdb
              cytc3.pdb’  will  superposition  the  structures  and rotate the
              entire final superposition so that the structure from  cytc1.pdb
              is  in  the  same  orientation  as the structure in the original
              cytc1.pdb PDB file.

       -O     Olve’s segID file

              Useful output when superpositioning  structures  with  different
              sequences   (mode   2).    In  ’theseus.pdb’,  the  main  output
              superposition PDB file, the segID field now holds the number  of
              the  sequence alignment column that it belongs to.  This number,
              divided by 100, is also echoed  in  the  B-factor  field.   When
              using  O (or any other capable molecular visualization program),
              one can then color by B-factor ranges and immediately see in the
              superposition  which regions of the structure are aligned in the
              sequence alignment file.  An additional  file  is  also  output,
              called  ’theseus_olve.pdb’  which  only  contains the very atoms
              that were included in the ML  superposition  calculation.   That
              is, it will only contain alpha carbons or phosphorous atoms, and
              it will only contain atoms from the columns selected with the -s
              or  "-S"  options.   Requested by Olve Peersen of Colorado State
              University.

       -V     Version

   Principal components analysis:
       -C     Use covariance matrix for PCA (correlation matrix is default)

       -P [nnn]
              Number of principal components to calculate {0}

              In both of the above, the corresponding principal  component  is
              written  in  the  B-factor field of the output PDB file. Usually
              only the first few PCs are of any interest (maybe up to six).

               EXAMPLES theseus 2sdf.pdb

       theseus -l -r new2sdf 2sdf.pdb

       theseus -s15-45 -P3 2sdf.pdb

       theseus -A cytc.aln  -M  cytc.mapfile  -o  cytc1.pdb  -s1-40  cytc1.pdb
       cytc2.pdb cytc3.pdb cytc4.pdb

ENVIRONMENT

       You  can  set  the  environment  variable  ’PDBDIR’  to  your  PDB file
       directory and  theseus  will  look  there  after  the  present  working
       directory.   For  example,  in  the  C shell (tcsh or csh), you can put
       something akin to this in your .cshrc file:

       setenv PDBDIR ’/usr/share/pdbs/’

FILE FORMATS

       Theseus   will    read    standard    PDB    formatted    files    (see
       <http://www.rcsb.org/pdb/>).   Every  effort  has  been  made  for  the
       program to accept nonstandard CNS and X-PLOR file formats also.

       Two other files deserve  mention,  a  sequence  alignment  file  and  a
       mapfile.

   Sequence alignment file
       When  superpositioning  structures  with  different  residue identities
       (where the lengths of each the macromolecules in terms of residues  are
       not  necessarily equal), a sequence alignment file must be included for
       theseus to use as a  guide  (specified  by  the  -A  option).   Theseus
       accepts  both  CLUSTAL  and  A2M  (FASTA)  formatted  multiple sequence
       alignment files.

       NOTE 1: The residue sequence in the alignment must  match  exactly  the
       residue  sequence  given  in  the coordinates of the PDB file. That is,
       there can be no missing or extra residues that do not correspond to the
       sequence  in  the  PDB  file. An easy way to ensure that your sequences
       exactly match the PDB files is to generate the sequences using theseus’
       -F  option,  which  writes  out  a FASTA formatted sequence file of the
       chain(s) in the PDB files. The files output with this option  can  then
       be  aligned  with a multiple sequence alignment program such as CLUSTAL
       or MUSCLE, and the resulting output  alignment  file  used  as  theseus
       input with the -A option.

       NOTE  2:  Every  PDB  file  must  have  a corresponding sequence in the
       alignment.  However, not every sequence in the alignment needs to  have
       a  corresponding PDB file. That is, there can be extra sequences in the
       alignment that are not used for guiding the superposition.

   PDB -> Sequence mapfile
       If the names of the PDB  files  and  the  names  of  the  corresponding
       sequences  in  the alignemnt are identical, the mapfile may be omitted.
       Otherwise, Theseus needs to know which sequences in the alignment  file
       correspond  to  which PDB structure files. This information is included
       in a mapfile with a very simple format (specified with the -M  option).
       There  are  only  two columns separated by whitespace: the first column
       lists the names of the PDB structure files,  while  the  second  column
       lists the corresponding sequence names exactly as given in the multiple
       sequence alignment file.

       An example of the mapfile:

       cytc1.pdb    seq1
       cytc2.pdb    seq2
       cytc3.pdb    seq3

SCREEN OUTPUT

       Theseus  provides  output  describing  both   the   progress   of   the
       superpositioning and several statistics for the final result:

       Least-squares <sigma>:
              The  standard  deviation  for  the  superposition,  based on the
              conventional assumption of no correlation and  equal  variances.
              Basically equal to the RMSD from the average structure.

       Classical LS pairwise <RMSD>:
              The  conventional  RMSD  for the superposition, the average RMSD
              for all pairwise combinations of structures in the ensemble.

       Maximum Likelihood <sigma>:
              The ML analog of the standard deviation for  the  superposition.
              When  assuming  that  the  correlations  are  zero  (a  diagonal
              covariance matrix), this is equal to  the  square  root  of  the
              harmonic  average  of  the variances for each atom. In contrast,
              the ’Least-squares <sigma>’ given above reports the square  root
              of  the  arithmetic  average  of  the  variances.   The harmonic
              average is always less than  the  arithmetic  average,  and  the
              harmonic  average downweights large values proportional to their
              magnitude.  This  makes  sense   statistically,   because   when
              combining  values  one  should  weight them by the reciprocal of
              their variance (which is in fact what  the  ML  superpositioning
              method does).

       Log Likelihood:
              The  final  log  likelihood  of  the superposition, assuming the
              matrix  Gaussian  distribution  of  the   structures   and   the
              hierarchical  inverse  gamma  distribution of the eigenvalues of
              the covariance matrix.

       AIC:   The Akaike Information Criterion for  the  final  superposition.
              This  is an important statistic in likelihood analysis and model
              selection theory. It allows an objective comparison of  multiple
              theoretical models with different numbers of parameters. In this
              case, the higher the number the  better.  There  is  a  tradeoff
              between  fit to the data and the number of parameters being fit.
              Increasing the number of parameters in a model will always  give
              a  better fit to the data, but it also increases the uncertainty
              of the estimated values.   The  AIC  criterion  finds  the  best
              combination  by  (1)  maximizing  the  fit to the data while (2)
              minimizing the uncertainty due to the number of  parameters.  In
              the  superposition  case,  one  can  compare  the  least squares
              superposition  to  the  maximum  likelihood  superposition.  The
              method (or model) with the higher AIC is preferred. A difference
              in the AIC  of  2  or  more  is  considered  strong  statistical
              evidence for the better model.

       BIC:   The Bayesian Information Criterion. Similar to the AIC, but with
              a Bayesian emphasis.

       Rotational, translational, covar chi^2:
              The reduced chi-squared statistic for the fit of the  structures
              to  the model.  With a good fit it should be close to 1.0, which
              indicates a perfect fit of the data to  the  statistical  model.
              In  the  case  of  least-squares,  the assumed model is a matrix
              Gaussian distribution of the structures with equal variances and
              no correlations.  For the ML fits, the assumed models can either
              be (1) unequal variances and no correlations, as calculated with
              the   -v   option   [default]   or  (2)  unequal  variances  and
              correlations, as calculated with the -c option.  This  statistic
              is  for  the superposition only, and does not include the fit of
              the  covariance  matrix  eigenvalues   to   an   inverse   gamma
              distribution.  See ’Omnibus chi^2’ below.

       Hierarchical minimum var:
              The   hierarchical   fit   of  the  inverse  gamma  distribution
              constrains the variances of  the  atoms  by  making  large  ones
              smaller  and  small  ones  larger.   This  statistic reports the
              minimum possible  variance  given  the  inferred  inverse  gamma
              parameters.

       Hierarchical var (alpha, gamma) chi^2:
              The  reduced  chi-squared  for  the  inverse  gamma  fit  of the
              covariance matrix eigenvalues. As before, it should  ideally  be
              close  to  1.0.   The  two  values in the parentheses are the ML
              estimates of the scale and shape parameters,  respectively,  for
              the inverse gamma distribtuion.

       Omnibus chi^2:
              The  overall  reduced  chi-squared statistic for the entire fit,
              including the  rotations,  translations,  covariances,  and  the
              inverse  gamma  parameters.  This is probably the most important
              statistic for the superposition.  In  some  cases,  the  inverse
              gamma  fit  may be poor, yet the overall fit is still very good.
              Again, it should ideally be close to 1.0, which would indicate a
              perfect fit. However, if you think it is too large, make sure to
              compare it to the chi^2 for the least-squares fit; it’s probably
              not  that  bad  after  all.   A  large  chi^2  often indicates a
              violation of the assumptions of  the  model.   The  most  common
              violation  is  when  superpositioning  two  or  more independent
              domains that can rotate relative to each other. If this  is  the
              case,   then   there  will  likely  be  not  just  one  Gaussian
              distribution, but several mixed Gaussians, one for each  domain.
              Then,   it   would   be  better  to  superposition  each  domain
              independently.

       skewness, skewness Z-value, kurtosis & kurtosis Z-value:
              The skewness and kurtosis of the residuals. Both should  be  0.0
              if  the  residuals  fit a Gaussian distribution perfectly.  They
              are followed by the P-value for the statistics. This is  a  very
              stringent  test;  residuals can be very non-Gaussian and yet the
              estimated rotations, translations,  and  covariance  matrix  may
              still be rather accurate.

       FP error in transformed coordinates:
              The   empirically   determined   floating  point  error  in  the
              coordinates after rotation and translation.

       Minimum RMSD error per atom:
              The empirically determined minimum RMSD error per atom, based on
              the floating point error of the computer.

       Data pts, Free params, D/P:
              The  total  number of data points given all observed structures,
              the number of parameters being fit in the model, and  the  data-
              to-parameter ratio.

       Median structure:
              The  structure  that  is  overall  most  similar  to the average
              structure. This can be  considered  to  be  the  most  "typical"
              structure in the ensemble.

       Total rounds:
              The number of iterations that the algorithm took to converge.

       Fractional precision:
              The actual precision that the algorithm converged to.

OUTPUT FILES

       Theseus writes out the following files:

       theseus_sup.pdb
              The  final  superposition,  rotated to the principle axes of the
              mean structure.

       theseus_ave.pdb
              The estimate of the mean structure.

       theseus_cor.mat, theseus_cov.mat
              The atomic correlation matrix and covariance matrices, based  on
              the  final  superposition.  The  format is suitable for input to
              GNU’s octave.  These are the  matrices  used  in  the  Principal
              Components Analysis.

       theseus_embed_ave.pdb
              The  average structure as calculated by S. Lele’s EDMA embedding
              algorithm, used as the starting point for the maximum likelihood
              iterations.

       theseus_residuals.txt
              The  normalized  residuals  of  the  superposition. These can be
              analyzed for deviations  from  normality  (whether  they  fit  a
              standard  Gaussian distribution). E.g., the chi^2, skewness, and
              kurtosis statistics are based on these values.

       theseus_transf.txt
              The  final  transformation  rotation  matrices  and  translation
              vectors.

       theseus_variances.txt
              The vector of estimated variances for each atom.

       When  Principal  Components  are  calculated  (with the -P option), the
       following files are also produced:

       theseus_pcvecs.txt
              The principal component vectors.

       theseus_pcstats.txt
              Simple  statistics  for  each  principle  component   (loadings,
              variance explained, etc.).

       theseus_pcN_ave.pdb
              The  average  structure with the Nth principal component written
              in the temperature factor field.

       theseus_pcN.pdb
              The final superposition with the Nth principal component written
              in  the  temperature  factor  field.   This file is omitted when
              superpositioning  molecules  with  different  residue  sequences
              (mode 2).

BUGS

       Please send me (DLT) reports of all problems.

RESTRICTIONS

       Theseus  is  not  a  structural alignment program.  The structure-based
       alignment  problem  is  completely  different   from   the   structural
       superposition  problem.   In  order  to  do a structural superposition,
       there must be a  1-to-1  mapping  that  associates  the  atoms  in  one
       structure  with  the  atoms  in  the other structures.  In the simplest
       case, this means that structures must have equivalent numbers of atoms,
       such  as  the models in an NMR PDB file.  For structures with different
       numbers of residues/atoms, superpositioning is only possible  when  the
       sequences  have  been  aligned  previously.   Finding the best sequence
       alignment based on only structural information is a difficult  problem,
       and  one  for  which there is currently no maximum likelihood approach.
       Extending theseus to address the structural  alignment  problem  is  an
       ongoing research project.

AUTHOR

       Douglas L. Theobald
       dtheobald@brandeis.edu
       dtheobald@gmail.com

CITATION

       When using theseus in publications please cite the following:

       Douglas L. Theobald and Deborah S. Wuttke (2006)
       "Empirical  Bayes models for regularizing maximum likelihood estimation
       in the matrix Gaussian Procrustes problem."
       PNAS 103(49):18521-18527

       Douglas L. Theobald and Deborah S. Wuttke (2006)
       "THESEUS:  Maximum  likelihood   superpositioning   and   analysis   of
       macromolecular structures."
       Bioinformatics 22(17):2171-2172

       Douglas L. Theobald and Deborah S. Wuttke (2008)
       "Accurate    structural    correlations    from    maximum   likelihood
       superpositions."
       PLoS Computational Biology 4(2):e43 in press

HISTORY

       Long, tedious, and sordid.