PDLAHQR - i an auxiliary routine used to find the Schur decomposition

NAME

       PDLAHQR  -  i an auxiliary routine used to find the Schur decomposition
       and or eigenvalues of a matrix already in Hessenberg  form  from   cols
       ILO to IHI

SYNOPSIS

       SUBROUTINE PDLAHQR( WANTT,  WANTZ, N, ILO, IHI, A, DESCA, WR, WI, ILOZ,
                           IHIZ, Z, DESCZ, WORK, LWORK, IWORK, ILWORK, INFO )

           LOGICAL         WANTT, WANTZ

           INTEGER         IHI, IHIZ, ILO, ILOZ, ILWORK, INFO, LWORK, N, ROTN

           INTEGER         DESCA( * ), DESCZ( * ), IWORK( * )

           DOUBLE          PRECISION A( * ), WI( * ), WORK( * ), WR( * ), Z( *
                           )

PURPOSE

       PDLAHQR is an auxiliary routine used to find the Schur decomposition
         and or eigenvalues of a matrix already in Hessenberg form from
         cols ILO to IHI.

       Notes
       =====

       Each  global  data  object  is  described  by an associated description
       vector.  This vector stores the information required to  establish  the
       mapping  between  an  object  element and its corresponding process and
       memory location.

       Let A be a generic term for any 2D block  cyclicly  distributed  array.
       Such a global array has an associated description vector DESCA.  In the
       following comments, the character _ should be read as  "of  the  global
       array".

       NOTATION        STORED IN      EXPLANATION
       ---------------  --------------  --------------------------------------
       DTYPE_A(global) DESCA( DTYPE_ )The descriptor type.  In this case,
                                      DTYPE_A = 1.
       CTXT_A (global) DESCA( CTXT_ ) The BLACS context handle, indicating
                                      the BLACS process grid A is distribu-
                                      ted over. The context itself is glo-
                                      bal, but the handle (the integer
                                      value) may vary.
       M_A    (global) DESCA( M_ )    The number of rows in the global
                                      array A.
       N_A    (global) DESCA( N_ )    The number of columns in the global
                                      array A.
       MB_A   (global) DESCA( MB_ )   The blocking factor used to distribute
                                      the rows of the array.
       NB_A   (global) DESCA( NB_ )   The blocking factor used to distribute
                                      the columns of the array.
       RSRC_A (global) DESCA( RSRC_ ) The process row over which the first
                                      row  of  the  array  A  is  distributed.
       CSRC_A (global) DESCA( CSRC_ ) The process column over which the
                                      first column of the array A is
                                      distributed.
       LLD_A  (local)  DESCA( LLD_ )  The leading dimension of the local
                                      array.  LLD_A >= MAX(1,LOCr(M_A)).

       Let  K  be  the  number of rows or columns of a distributed matrix, and
       assume that its process grid has dimension p x q.
       LOCr( K ) denotes the number of elements of  K  that  a  process  would
       receive  if  K  were  distributed  over  the p processes of its process
       column.
       Similarly, LOCc( K ) denotes the number of elements of K that a process
       would receive if K were distributed over the q processes of its process
       row.
       The values of LOCr() and LOCc() may be determined via  a  call  to  the
       ScaLAPACK tool function, NUMROC:
               LOCr( M ) = NUMROC( M, MB_A, MYROW, RSRC_A, NPROW ),
               LOCc(  N ) = NUMROC( N, NB_A, MYCOL, CSRC_A, NPCOL ).  An upper
       bound for these quantities may be computed by:
               LOCr( M ) <= ceil( ceil(M/MB_A)/NPROW )*MB_A
               LOCc( N ) <= ceil( ceil(N/NB_A)/NPCOL )*NB_A

ARGUMENTS

       WANTT   (global input) LOGICAL
               = .TRUE. : the full Schur form T is required;
               = .FALSE.: only eigenvalues are required.

       WANTZ   (global input) LOGICAL
               = .TRUE. : the matrix of Schur vectors Z is required;
               = .FALSE.: Schur vectors are not required.

       N       (global input) INTEGER
               The order of the Hessenberg matrix A (and Z if WANTZ).  N >= 0.

       ILO     (global input) INTEGER
               IHI      (global input) INTEGER It is assumed that A is already
               upper quasi-triangular in rows and columns  IHI+1:N,  and  that
               A(ILO,ILO-1) = 0 (unless ILO = 1). PDLAHQR works primarily with
               the Hessenberg submatrix in rows and columns ILO  to  IHI,  but
               applies  transformations  to all of H if WANTT is .TRUE..  1 <=
               ILO <= max(1,IHI); IHI <= N.

       A       (global input/output) DOUBLE PRECISION array, dimension
               (DESCA(LLD_),*) On entry, the upper Hessenberg  matrix  A.   On
               exit,  if  WANTT is .TRUE., A is upper quasi-triangular in rows
               and columns ILO:IHI, with any 2-by-2 or larger diagonal  blocks
               not  yet in standard form. If WANTT is .FALSE., the contents of
               A are unspecified on exit.

       DESCA   (global and local input) INTEGER array of dimension DLEN_.
               The array descriptor for the distributed matrix A.

       WR      (global replicated output) DOUBLE PRECISION array,
               dimension  (N)  WI       (global  replicated   output)   DOUBLE
               PRECISION  array,  dimension  (N) The real and imaginary parts,
               respectively, of the computed eigenvalues ILO to IHI are stored
               in  the corresponding elements of WR and WI. If two eigenvalues
               are computed as a complex conjugate pair, they  are  stored  in
               consecutive  elements  of  WR and WI, say the i-th and (i+1)th,
               with WI(i) > 0 and  WI(i+1)  <  0.  If  WANTT  is  .TRUE.,  the
               eigenvalues  are stored in the same order as on the diagonal of
               the Schur form returned in A.  A may be  returned  with  larger
               diagonal blocks until the next release.

       ILOZ    (global input) INTEGER
               IHIZ     (global  input) INTEGER Specify the rows of Z to which
               transformations must be applied if WANTZ is .TRUE..  1 <=  ILOZ
               <= ILO; IHI <= IHIZ <= N.

       Z       (global input/output) DOUBLE PRECISION array.
               If  WANTZ is .TRUE., on entry Z must contain the current matrix
               Z of transformations accumulated by PDHSEQR, and on exit Z  has
               been updated; transformations are applied only to the submatrix
               Z(ILOZ:IHIZ,ILO:IHI).   If  WANTZ  is   .FALSE.,   Z   is   not
               referenced.

       DESCZ   (global and local input) INTEGER array of dimension DLEN_.
               The array descriptor for the distributed matrix Z.

       WORK    (local output) DOUBLE PRECISION array of size LWORK
               (Unless LWORK=-1, in which case WORK must be at least size 1)

       LWORK   (local input) INTEGER
               WORK(LWORK) is a local array and LWORK is assumed big enough so
               that LWORK  >=  3*N  +  MAX(  2*MAX(DESCZ(LLD_),DESCA(LLD_))  +
               2*LOCc(N),    7*Ceil(N/HBL)/LCM(NPROW,NPCOL))   +   MAX(   2*N,
               (8*LCM(NPROW,NPCOL)+2)**2 ) If LWORK=-1, then WORK(1) gets  set
               to the above number and the code returns immediately.

       IWORK   (global and local input) INTEGER array of size ILWORK
               This  will  hold some of the IBLK integer arrays.  This is held
               as  a  place  holder   for   a   future   release.    Currently
               unreferenced.

       ILWORK  (local input) INTEGER
               This  will hold the size of the IWORK array.  This is held as a
               place holder for a future release.  Currently unreferenced.

       INFO    (global output) INTEGER
               < 0: parameter number -INFO incorrect or inconsistent
               = 0: successful exit
               > 0: PDLAHQR failed to compute all the eigenvalues ILO  to  IHI
               in  a total of 30*(IHI-ILO+1) iterations; if INFO = i, elements
               i+1:ihi of WR and WI contain those eigenvalues which have  been
               successfully computed.

               Logic:  This  algorithm  is  very  similar  to  _LAHQR.  Unlike
               _LAHQR, instead of sending one double shift through the largest
               unreduced  submatrix,  this  algorithm  sends  multiple  double
               shifts and spaces them apart so that there can  be  parallelism
               across   several   processor   row/columns.   Another  critical
               difference  is  that  this   algorithm   aggregrates   multiple
               transforms  together in order to apply them in a block fashion.

               Important Local Variables: IBLK = The maximum number of  bulges
               that  can  be computed.  Currently fixed.  Future releases this
               won’t   be   fixed.    HBL    =   The   square    block    size
               (HBL=DESCA(MB_)=DESCA(NB_))  ROTN = The number of transforms to
               block together NBULGE = The  number  of  bulges  that  will  be
               attempted  on  the  current  submatrix.   IBULGE  = The current
               number of bulges started.   K1(*),K2(*)  =  The  current  bulge
               loops from K1(*) to K2(*).

               Subroutines:  From  LAPACK,  this  routine calls: DLAHQR     ->
               Serial QR used  to  determine  shifts  and  eigenvalues  DLARFG
               -> Determine the Householder transforms

               This  ScaLAPACK, this routine calls: PDLACONSB  -> To determine
               where to start each  iteration  DLAMSH      ->  Sends  multiple
               shifts  through  a  small  submatrix to see how the consecutive
               subdiagonals change (if PDLACONSB indicates we can start a  run
               in   the  middle)  PDLAWIL     ->  Given  the  shift,  get  the
               transformation DLASORTE   -> Pair up eigenvalues so that  reals
               are  paired.   PDLACP3    -> Parallel array to local replicated
               array copy & back.  DLAREF     -> Row/column reflector applier.
               Core  routine here.  PDLASMSUB  -> Finds negligible subdiagonal
               elements.

               Current Notes and/or Restrictions: 1.) This code  requires  the
               distributed  block  size  to  be  square  and at least six (6);
               unlike simpler codes  like  LU,  this  algorithm  is  extremely
               sensitive  to  block size.  Unwise choices of too small a block
               size can lead to bad performance.  2.) This code requires A and
               Z  to be distributed identically and have identical contxts.  A
               future version may allow Z to have a different contxt to 1D row
               map  it  to  all nodes (so no communication on Z is necessary.)
               3.)  This  release  currently  does  not  have  a  routine  for
               resolving  the  Schur  blocks  into regular 2x2 form after this
               code is completed.  Because of this, a significant  performance
               impact  is  required while the deflation is done by sometimes a
               single column of processors.  4.) This code does not  currently
               block  the  initial  transforms  so  that  none  of the rows or
               columns for any bulge are completed until all are started.   To
               offset  pipeline  start-up  it  is  recommended  that  at least
               2*LCM(NPROW,NPCOL)  bulges  are  used  (if  possible)  5.)  The
               maximum  number  of  bulges currently supported is fixed at 32.
               In future versions this will be limited only  by  the  incoming
               WORK  and  IWORK  array.   6.)  The  matrix  A must be in upper
               Hessenberg  form.   If  elements  below  the  subdiagonal   are
               nonzero,  the  resulting transforms may be nonsimilar.  This is
               also true  with  the  LAPACK  routine  DLAHQR.   7.)  For  this
               release,  this code has only been tested for RSRC_=CSRC_=0, but
               it has been written for the general case.  8.)  Currently,  all
               the  eigenvalues  are  distributed  to  all  the nodes.  Future
               releases will probably distribute the eigenvalues by the column
               partitioning.  9.) The internals of this routine are subject to
               change.  10.) To  optimize  this  for  your  architecture,  try
               tuning DLAREF.  11.) This code has only been tested for WANTZ =
               .TRUE. and may behave unpredictably for WANTZ set to .FALSE.

               Implemented by:  G. Henry, May 1, 1997